mirror of
https://github.com/marcredhat/kql
synced 2026-06-08 13:23:58 +00:00
Initial commit: KQL ↔ SDL PowerQuery proof of equivalence
This commit is contained in:
@@ -0,0 +1,402 @@
|
||||
# MPP vs KQL: Why SentinelOne's Architecture Wins Cross-Source Threat Hunting
|
||||
|
||||
> **Companion to [github.com/marcredhat/kql](https://github.com/marcredhat/kql).**
|
||||
> The architecture claims in this document are backed by the 17-rule
|
||||
> end-to-end equivalence proof in that repository: every "ready-to-use" KQL
|
||||
> query from the Microsoft Sentinel data-lake docs was converted to SDL
|
||||
> PowerQuery, run against the same deterministic dataset on both engines,
|
||||
> and asserted to produce equivalent verdicts. See `reports/PROOF.md` after
|
||||
> running `./run_proof.sh`.
|
||||
|
||||
## TL;DR
|
||||
|
||||
KQL on Microsoft Sentinel is a clever query language on top of a
|
||||
general-purpose log analytics store (Azure Data Explorer / Kusto).
|
||||
SentinelOne's Singularity Data Lake (SDL) is a purpose-built, indexless,
|
||||
columnar, always-hot security lake with a Massively Parallel Query Engine
|
||||
(MPP) that dedicates the entire cluster to every interactive query.
|
||||
|
||||
For 90-day cross-source hunts joining endpoint + identity + DNS + cloud,
|
||||
the SDL design removes the three things that actually slow KQL down in
|
||||
production: index/shard locality, workspace boundaries, and tiered
|
||||
storage rehydration.
|
||||
|
||||
**Demonstrated end-to-end on a public repo:** 17 KQL hunts ↔ 17
|
||||
PowerQueries, same data, asserted equivalence —
|
||||
[github.com/marcredhat/kql](https://github.com/marcredhat/kql).
|
||||
|
||||
---
|
||||
|
||||
## 1. Storage model: inverted+columnar shards (Kusto) vs pure columnar epochs (SDL)
|
||||
|
||||
### Microsoft Sentinel / KQL (ADX/Kusto under the hood)
|
||||
|
||||
Kusto stores data as **extents** (immutable shards), columnar within an
|
||||
extent, but each extent has a shard-level inverted **term index** ("shard
|
||||
index") and column-level bloom/range indexes.
|
||||
|
||||
Performance is excellent when the query predicate hits an indexed column
|
||||
with good selectivity (`where Computer == "x"`, `where SourceIP has "1.2.3.4"`).
|
||||
|
||||
Performance degrades when:
|
||||
|
||||
- Predicates are on **unindexed or high-cardinality JSON dynamic fields**
|
||||
→ falls back to scan.
|
||||
- You use `has_any`, `matches regex`, or `contains` on large columns
|
||||
→ index bypass.
|
||||
- Joins cross tables with different partitioning keys → shuffle.
|
||||
- You query **Basic Logs / Auxiliary Logs / Archive** tiers → restricted
|
||||
KQL, no joins, or rehydration jobs (search jobs / restore) measured in
|
||||
hours, billed separately.
|
||||
|
||||
> **Sidebar — what an inverted index actually is.**
|
||||
> Inverted index = `term → [doc, doc, …]`. Great for selective full-text
|
||||
> lookup, useless for unselective scans where most of the data *is* the
|
||||
> answer. That is exactly the shape of cross-source threat-hunt queries.
|
||||
|
||||
### SentinelOne SDL
|
||||
|
||||
No inverted index. Data is written in ~5-minute **epochs** to columnar,
|
||||
append-only segments backed by object storage (the Scalyr/DataSet
|
||||
lineage).
|
||||
|
||||
- Every byte ever ingested is hot — the epoch reader path is identical
|
||||
for "last 15 minutes" and "90 days ago."
|
||||
- There is no index to tune, no extent merge policy, no "is this column
|
||||
indexed?" question. Cost of a scan is predictable and linear in
|
||||
bytes-after-columnar-pruning.
|
||||
|
||||
**Why this matters for hunting:** the most useful hunt queries are
|
||||
exactly the ones KQL indexes least well — wide union across tables,
|
||||
regex/substring on command lines, joins across identity↔endpoint↔network
|
||||
on fields that aren't the partitioning key.
|
||||
|
||||
---
|
||||
|
||||
## 2. Query execution: per-query resource ceilings (Kusto) vs full-cluster-per-query (SDL MPP)
|
||||
|
||||
### Kusto / Sentinel
|
||||
|
||||
Kusto is multi-tenant per cluster and uses a workload group / resource
|
||||
governor model. Each query gets a slice of cluster CPU/memory, bounded
|
||||
by `MaxMemoryPerQueryPerNode`, `MaxConcurrentRequests`, request rate
|
||||
limits, etc.
|
||||
|
||||
Sentinel customers don't even own the cluster — they get a logical
|
||||
workspace on shared infrastructure with opaque, throttled capacity. You
|
||||
routinely see `E_QUERY_RESULT_SET_TOO_LARGE`, `Request is throttled`,
|
||||
partial results, or the 10-minute query timeout.
|
||||
|
||||
Joins are particularly painful: `join kind=inner` defaults to
|
||||
broadcasting the left side; if it's too big you must hint
|
||||
`hint.strategy=shuffle + hint.shufflekey=...`. Get the hint wrong on a
|
||||
90-day join and the query OOMs or times out.
|
||||
|
||||
### SDL MPP
|
||||
|
||||
The architecture is explicit: every CPU core on every compute node works
|
||||
on one interactive query at a time, with horizontal scheduling across the
|
||||
tenant.
|
||||
|
||||
- Each worker does **early predicate pushdown + local
|
||||
aggregation/reduction** on its epoch segments.
|
||||
- The coordinator merges already-reduced outputs, not raw rows.
|
||||
|
||||
> **Real-world latency from our 17-rule proof.** SDL PowerQuery latency
|
||||
> on a freshly-ingested 445-event corpus was **1.7–2.6 s end-to-end**
|
||||
> (HTTP + parse + scan + aggregate) for hunt-shape queries. The same
|
||||
> queries against a wider `30d` startTime window were noticeably slower
|
||||
> — a reminder that even on a full-cluster MPP, **window sizing still
|
||||
> matters**; the value is that the *runtime path is identical* for 2 h
|
||||
> vs 90 d, not that it's free.
|
||||
|
||||
**Why this matters:** KQL's bottleneck on a hard query is rarely the
|
||||
language — it's that you can't actually have the whole cluster for 8
|
||||
seconds. SDL's whole point is that you can.
|
||||
|
||||
---
|
||||
|
||||
## 3. The "parallel scan + local reduction" point is the real one
|
||||
|
||||
Many engines parallelize the scan. Few parallelize the **reduction**.
|
||||
Kusto fans out the scan, then often funnels intermediate results back to
|
||||
a coordinator/data-node tier for `summarize`, `join`, `top`,
|
||||
`make-series`. On a 90-day, multi-table hunt that intermediate set can be
|
||||
enormous, and that's where queries stall.
|
||||
|
||||
SDL keeps `filter → project → partial-aggregate → partial-join`
|
||||
distributed for as long as possible, so what flows up the tree is small.
|
||||
This is the same insight behind Snowflake / Presto / Trino, applied
|
||||
specifically to security telemetry shapes (high-cardinality
|
||||
`process_name`, `device_id`, `user_principal_name`, `dns_question`,
|
||||
etc.).
|
||||
|
||||
---
|
||||
|
||||
## 4. Schema and normalization: ASIM (best-effort views) vs OCSF (native)
|
||||
|
||||
### Sentinel / ASIM
|
||||
|
||||
Microsoft's Advanced SIEM Information Model (ASIM) is implemented as KQL
|
||||
functions/parsers on top of raw tables. Every cross-source query expands
|
||||
at runtime into a union of parsers (`_Im_NetworkSession`,
|
||||
`_Im_ProcessEvent`, …).
|
||||
|
||||
Each parser is a function call; the optimizer can't always push
|
||||
predicates through them cleanly, so wide ASIM queries can be
|
||||
significantly slower than native-table queries.
|
||||
|
||||
Coverage is partial — many 3rd-party sources have no ASIM parser and
|
||||
must be hand-normalized.
|
||||
|
||||
### SDL / OCSF
|
||||
|
||||
Data is normalized to OCSF at ingest by the AI-native pipeline. The
|
||||
columns on disk are already the unified schema.
|
||||
|
||||
No runtime parser expansion, no union of synthetic views — cross-source
|
||||
joins are just joins on real columns.
|
||||
|
||||
> **Honest hedge.** This holds for first-party connectors and the
|
||||
> SentinelOne ingest catalog. For raw `/api/uploadLogs` ingest, the
|
||||
> customer-supplied parser determines schema fidelity — we hit this
|
||||
> while building the equivalence proof and ended up using a `json`
|
||||
> parser plus a per-event `event_type` discriminator column to mimic
|
||||
> the table-per-source shape of Sentinel.
|
||||
|
||||
**Why this matters for cross-source hunts:** the realistic 90-day Okta +
|
||||
DNS + EDR hunt in Sentinel is
|
||||
`union isfuzzy=true (_Im_WebSession) (_Im_Dns) (_Im_ProcessEvent) | join ...`
|
||||
and it is a known performance cliff. In SDL the same hunt is one query
|
||||
against one schema.
|
||||
|
||||
---
|
||||
|
||||
## 5. Retention and tiering: the silent killer for KQL hunts
|
||||
|
||||
| Dimension | Sentinel/Log Analytics | SentinelOne SDL |
|
||||
|----------------------------|------------------------------------------------------------------------------|--------------------------|
|
||||
| Default interactive retention | 90 days (Analytics tier) | Entire retention window |
|
||||
| Beyond that | Basic Logs (restricted KQL, no joins, no alerts) → Auxiliary → Archive (search jobs, hours-long restores) | Same query path, same engine |
|
||||
| Cost shape | Pay per GB ingested **and** per tier transition **and** per search job | Flat hot lake |
|
||||
| Join across tiers | Not supported | Native |
|
||||
|
||||
A "90-day cross-source hunt" in Sentinel silently becomes a
|
||||
tiered-storage project. In SDL it's a query.
|
||||
|
||||
> **Concrete detail from the equivalence proof.** SentinelOne's
|
||||
> `/api/uploadLogs` accepted **445 events spanning a wide range of
|
||||
> embedded timestamps in a single 217 KB POST**; the same query path
|
||||
> then served them <2 s later. There is no warm-tier flip, no
|
||||
> rehydration job, no separate billing meter. (See
|
||||
> `harness/sdl_client.py` `upload_logs()` in the repo.)
|
||||
|
||||
---
|
||||
|
||||
## 6. Concrete walkthrough: the 90-day Okta → DNS → process hunt
|
||||
|
||||
### Sentinel / KQL (realistic shape — and its failure modes)
|
||||
|
||||
```kusto
|
||||
let suspect_domains = dynamic(["c2.example.com", "suspect.example.net"]);
|
||||
let suspicious_users =
|
||||
SigninLogs
|
||||
| where TimeGenerated > ago(90d)
|
||||
| where ResultType != 0 or RiskLevelDuringSignIn == "high"
|
||||
| summarize by UserPrincipalName;
|
||||
let bad_dns =
|
||||
_Im_Dns(starttime=ago(90d))
|
||||
| where DnsQuery has_any (suspect_domains)
|
||||
| project TimeGenerated, SrcIpAddr, DnsQuery;
|
||||
_Im_ProcessEvent(starttime=ago(90d))
|
||||
| where ProcessCommandLine has_any ("powershell","rundll32","mshta")
|
||||
| join kind=inner hint.strategy=shuffle hint.shufflekey=DvcHostname (
|
||||
bad_dns | extend DvcHostname = tostring(SrcIpAddr)
|
||||
) on DvcHostname
|
||||
| join kind=inner (suspicious_users)
|
||||
on $left.ActorUsername == $right.UserPrincipalName
|
||||
```
|
||||
|
||||
> `_Im_*` are runtime **parser functions**, not tables; each call
|
||||
> re-unions and re-projects the underlying sources, defeating extent
|
||||
> pruning.
|
||||
|
||||
Real-world failure modes:
|
||||
|
||||
1. 90 d on `_Im_ProcessEvent` blows past workspace query memory →
|
||||
partial results or timeout.
|
||||
2. `has_any` on `ProcessCommandLine` bypasses the term index → full scan
|
||||
of the largest table.
|
||||
*(Deep dive: [`kql_cliffs_explained.md` §1](kql_cliffs_explained.md#1-has_any-on-processcommandline-bypasses-the-term-index).)*
|
||||
3. ASIM parsers re-union underlying tables on every call.
|
||||
4. If process events are in **Basic Logs** to save money: `join` is not
|
||||
allowed in Basic. Query refuses to run. You now schedule a Search
|
||||
Job (async, hours, separate billing) and stitch results manually.
|
||||
5. `hint.strategy=shuffle hint.shufflekey=DvcHostname` is **required**
|
||||
to keep the cross-table join from OOMing at 90 d; the hint has to
|
||||
be re-tuned as data volume grows.
|
||||
*(Deep dive: [`kql_cliffs_explained.md` §2](kql_cliffs_explained.md#2-hintshufflekey-is-required-to-avoid-oom-on-the-cross-table-join).)*
|
||||
|
||||
### SDL / PowerQuery (same intent — fully runnable)
|
||||
|
||||
The version below uses the SDL named-input join syntax that we
|
||||
validated against `xdr.us1.sentinelone.net` while building the
|
||||
equivalence proof in this repo. See
|
||||
`docs/runnable_examples/90day_okta_dns_process.pq` for the same query
|
||||
ready to paste into your tenant.
|
||||
|
||||
> NOTE on `loginIsSuccessful = 'false'`: SDL stores booleans as
|
||||
> lowercase strings via the JSON parser, so the quoted form fires on
|
||||
> synthetic data ingested via `uploadLogs`. On a tenant whose OCSF
|
||||
> parser emits native booleans, drop the quotes.
|
||||
|
||||
```
|
||||
| join
|
||||
failed_signins = (
|
||||
event.category = 'logins'
|
||||
AND event.login.loginIsSuccessful = 'false'
|
||||
| columns userName = event.login.userName,
|
||||
host = endpoint.name
|
||||
| group n_fails = count() by userName, host
|
||||
),
|
||||
bad_dns = (
|
||||
event.type = 'DNS Resolved'
|
||||
AND dns.question.name matches '(c2|suspect)\.example\.'
|
||||
| columns userName = src.endpoint.user.name,
|
||||
host = endpoint.name,
|
||||
domain = dns.question.name
|
||||
| group dns_hits = count(),
|
||||
domains = array_agg_distinct(domain, 20)
|
||||
by userName, host
|
||||
),
|
||||
susp_proc = (
|
||||
event.type = 'Process Creation'
|
||||
AND src.process.cmdline matches '(?i)(powershell|rundll32|mshta)'
|
||||
| columns userName = src.process.user,
|
||||
host = endpoint.name,
|
||||
cmdline = src.process.cmdline
|
||||
| group proc_hits = count(),
|
||||
cmdlines = array_agg_distinct(cmdline, 20)
|
||||
by userName, host
|
||||
)
|
||||
on userName, host
|
||||
| columns userName,
|
||||
host,
|
||||
hits = n_fails + dns_hits + proc_hits,
|
||||
n_fails,
|
||||
dns_hits,
|
||||
proc_hits,
|
||||
domains,
|
||||
cmdlines
|
||||
| sort -hits
|
||||
| limit 100
|
||||
```
|
||||
|
||||
Why this shape:
|
||||
|
||||
- **Plain `join`, not `sql join`** — `sql join` currently supports at most
|
||||
two subqueries; plain `join` supports 2+ with inner-join semantics by
|
||||
default.
|
||||
- **Each branch is pre-aggregated** to one row per `(userName, host)`
|
||||
so the join can't collapse multiple DNS/process matches to a single
|
||||
arbitrary first match. `array_agg_distinct(..., 20)` preserves the
|
||||
per-pair rollup.
|
||||
- **`failed_signins` emits `host`** so the multi-key join on
|
||||
`userName, host` is satisfied symmetrically across all three sides.
|
||||
- **Final `columns` references join keys bare** (`userName`, `host`);
|
||||
named-subquery prefixes are reserved for aggregate fields, where
|
||||
needed.
|
||||
|
||||
One schema (OCSF), one engine, one storage tier, full cluster for the
|
||||
query. 90 d isn't a different code path — it's more epochs scanned in
|
||||
parallel.
|
||||
|
||||
---
|
||||
|
||||
## 7. Where KQL is genuinely good (be fair)
|
||||
|
||||
- KQL as a language is excellent — arguably more expressive than
|
||||
PowerQuery for ad-hoc shaping (`make-series`, `mv-expand`,
|
||||
`bag_unpack`, `series_decompose_anomalies`).
|
||||
- For narrow, indexed, recent queries
|
||||
(`where IPAddress == x and TimeGenerated > ago(1h)`), Kusto is
|
||||
extremely fast.
|
||||
- ADX **outside Sentinel** (your own cluster, your own SKU) lets you
|
||||
actually size compute — most of the pain above is Sentinel's
|
||||
multi-tenant workspace packaging, not Kusto itself.
|
||||
|
||||
The architectural argument isn't "KQL is bad." It's "for the workload
|
||||
security teams actually run — wide, long, cross-source, unselective
|
||||
predicates — an indexless columnar always-hot lake with MPP wins by
|
||||
design."
|
||||
|
||||
---
|
||||
|
||||
## 7a. What the KQL → PQ translation actually looks like in practice
|
||||
|
||||
From the 17-rule conversion in this repo: **15 of 17 translate
|
||||
mechanically**; only the 2 using `series_fit_line` required a redesign.
|
||||
|
||||
| KQL idiom | SDL PowerQuery | Friction |
|
||||
|----------------------------------------|--------------------------------------|----------|
|
||||
| `where TimeGenerated > ago(1d)` | `startTime` param + `ts_epoch_ms ≥ N` | None |
|
||||
| `summarize n=count() by X` | `\| group n=count() by X` | None |
|
||||
| `dcount(X)` | `estimate_distinct(X)` | None |
|
||||
| `make_set(X)` | `array_agg_distinct(X, N)` | None (must specify cap) |
|
||||
| `in~ ('a','b')` | `in ('a','b')` | None |
|
||||
| `contains` / `has` | `contains` | None |
|
||||
| `extend Y = ...` | `\| let Y = ...` | Light (must pick a fresh name) |
|
||||
| `join kind=leftanti` | Filter on baseline set built via `array_agg` | Light |
|
||||
| `top N by X` | `\| sort -X \| limit N` | None |
|
||||
| `bin(t, 1h)` / `make-series` | `timebucket('1 hour')` | Light (no make-series) |
|
||||
| `series_fit_line` (ML) | **No equivalent** | Hard — pre-aggregate or mark at ingest |
|
||||
|
||||
ML-on-query-time is the wrong place to do anomaly fitting at
|
||||
security-lake scale. SDL pushes it left (to ingest pipelines / Purple
|
||||
AI) on purpose.
|
||||
|
||||
---
|
||||
|
||||
## 8. The actual operator experience, side by side
|
||||
|
||||
Pulled directly from the lessons embedded in the repo's `README.md`.
|
||||
|
||||
| Pain point in Sentinel / KQL | Equivalent (or absence) in SDL — what to flag |
|
||||
|----------------------------------------|-------------------------------------------------------------------------------------|
|
||||
| `E_QUERY_RESULT_SET_TOO_LARGE` | Not seen during proof; but wide `startTime` (`30d`) can return `matching=0` where a `30m` window returns N. **Always pass the tightest window that contains your data.** |
|
||||
| ASIM parser unions at query time | Single OCSF schema — no fix needed |
|
||||
| Search Jobs (hours, separate billing) | `uploadLogs` ingest path; events queryable <30 s after upload |
|
||||
| `bytesCharged: 0` confusion | **Not a rejection signal** — billing meter, not an error code |
|
||||
| KQL function discovery (IntelliSense) | `group_unique_values()` does **not** exist — use `array_agg_distinct(field, N)`. Publish the full SDL agg function list in your skills. |
|
||||
| `~=` vs `contains` | `~=` is **case-insensitive equality**, not substring — common foot-gun |
|
||||
| `addEvents` silent drops | Use `/api/uploadLogs` for historical ingest; `addEvents` silently drops events whose `ts` is outside its acceptance window |
|
||||
| `serverHost` in `addEvents` payload | **Not honoured** — use a custom marker attribute (we use `proof_run_id`) |
|
||||
|
||||
---
|
||||
|
||||
## 9. The architectural scoreboard for cross-source threat hunting
|
||||
|
||||
| Dimension | Sentinel + KQL | SentinelOne SDL + MPP |
|
||||
|---------------------------------|-----------------------------------------------|--------------------------------------|
|
||||
| Storage | Columnar extents + inverted/bloom indexes | Indexless columnar epochs |
|
||||
| Hot vs cold | Analytics / Basic / Auxiliary / Archive | All hot |
|
||||
| Schema | ASIM (runtime parser functions) | OCSF at ingest |
|
||||
| Query resources | Slice of shared cluster, governed | Whole cluster per interactive query |
|
||||
| Reduction | Often funneled to coordinator | Distributed local reduction |
|
||||
| Cross-source join over 90 d | Multi-table union + shuffle hints + tier limits | Single engine, single schema |
|
||||
| AI assistant value | Bottlenecked by backend latency | Purple AI is useful because backend is sub-second |
|
||||
|
||||
---
|
||||
|
||||
## Bottom line
|
||||
|
||||
KQL is a great query language sitting on a general-purpose analytics
|
||||
database that Microsoft repackaged as a SIEM. SentinelOne built the
|
||||
storage, ingest, and execution layers together, for security telemetry
|
||||
shapes: high cardinality, wide joins, long retention, unselective
|
||||
predicates. That co-design — indexless columnar epochs + OCSF +
|
||||
full-cluster MPP with distributed reduction — is why the 15-minute hunt
|
||||
becomes the 8-second hunt, and why Purple AI is operational rather than
|
||||
a demo.
|
||||
|
||||
**Proof artifact:** [github.com/marcredhat/kql](https://github.com/marcredhat/kql).
|
||||
@@ -0,0 +1,216 @@
|
||||
# Two Kusto performance cliffs explained
|
||||
|
||||
Companion deep-dive to [`MPP_vs_KQL.md`](MPP_vs_KQL.md) §6. Two phrases in
|
||||
the annotated KQL block point at real, well-known performance cliffs that
|
||||
deserve their own explanation rather than a footnote:
|
||||
|
||||
1. `has_any on ProcessCommandLine bypasses the term index`
|
||||
2. `hint.shufflekey is required to avoid OOM on the cross-table join`
|
||||
|
||||
---
|
||||
|
||||
## 1. `has_any` on `ProcessCommandLine` bypasses the term index
|
||||
|
||||
### What the term index actually does
|
||||
|
||||
Kusto builds a **per-shard inverted term index** on string columns. At
|
||||
ingest time each string value is tokenized into "terms" using a fixed
|
||||
tokenizer that splits on non-alphanumeric ASCII (whitespace,
|
||||
punctuation, `\`, `/`, `.`, `-`, etc.) and lowercases. The resulting
|
||||
tokens are written to the shard's inverted index alongside the columnar
|
||||
data.
|
||||
|
||||
When you write `where Col has "x"`, Kusto:
|
||||
|
||||
1. Tokenizes `"x"` the **same way** the indexer did at ingest.
|
||||
2. Looks up the resulting term in the shard's inverted index.
|
||||
3. Reads **only the rows in shards whose index says "this term might be
|
||||
present here"** — entire shards get skipped.
|
||||
|
||||
This is the difference between a 50 ms hunt and a 5-minute one.
|
||||
|
||||
### Why `has_any` on `ProcessCommandLine` falls out of that fast path
|
||||
|
||||
Three independent reasons compound:
|
||||
|
||||
**a) The needle contains characters that the tokenizer treats as separators.**
|
||||
|
||||
`ProcessCommandLine` values look like:
|
||||
|
||||
```
|
||||
"C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe" -nop -w hidden -enc JABzAD0A...
|
||||
```
|
||||
|
||||
If you write `has_any ("powershell.exe", "rundll32.exe")` you're not
|
||||
searching for one token — `powershell.exe` is the **two tokens**
|
||||
`powershell` and `exe` joined by a `.` (a separator). The index never
|
||||
stored `powershell.exe` as a single term, so the lookup misses and Kusto
|
||||
falls back to a row scan.
|
||||
|
||||
Quick fix: search for the bare token (`has_any ("powershell", "rundll32")`)
|
||||
— Kusto's planner will index-prune on each individual token. But
|
||||
analysts almost never write it that way because they're thinking "the
|
||||
binary is `powershell.exe`."
|
||||
|
||||
**b) `has_any` blows past the term-index cardinality threshold.**
|
||||
|
||||
For each candidate term in the `has_any` list, Kusto has to consult the
|
||||
inverted index, accumulate row-id sets, then union them. The query
|
||||
optimizer has an internal threshold: above some number of needles (or
|
||||
above some estimated selectivity), it gives up on index lookups and
|
||||
just scans, because the OR-merge of many indexed lookups costs more
|
||||
than the scan would.
|
||||
|
||||
The exact threshold is undocumented and changes between versions;
|
||||
empirically it kicks in fast on `ProcessCommandLine` because that column
|
||||
has the highest term cardinality in the schema — most of those terms
|
||||
are unique GUIDs, paths, base64 blobs, hashes, etc. — so the inverted
|
||||
index is huge and per-term lookup is expensive.
|
||||
|
||||
**c) `ProcessCommandLine` itself blows up the indexer's effectiveness.**
|
||||
|
||||
Even when you do hit the index, the **selectivity** is terrible. A term
|
||||
like `powershell` matches a large fraction of all process-creation rows
|
||||
on a typical workstation fleet. The index tells Kusto "this shard might
|
||||
contain it" — but every shard *does* contain it, so no shards get
|
||||
pruned. You still scan everything.
|
||||
|
||||
This is the deepest reason of the three: even a perfectly written,
|
||||
single-token, indexed `has` query on `ProcessCommandLine` gives you the
|
||||
*index path's CPU cost* on top of *the scan you were going to do anyway*.
|
||||
|
||||
### The escape hatch most people don't know about
|
||||
|
||||
If you must do this in KQL, push the substring match into a `where`
|
||||
clause that the planner can convert into a true scan-with-early-exit,
|
||||
and pre-narrow with something that *is* selective:
|
||||
|
||||
```kusto
|
||||
SecurityEvent
|
||||
| where TimeGenerated > ago(1h) // narrow time first — selective
|
||||
| where EventID == 4688 // narrow event-id — selective
|
||||
| where ProcessCommandLine matches regex @"(?i)powershell|rundll32|mshta"
|
||||
```
|
||||
|
||||
Two hours of data and an `EventID` filter is usually enough that the
|
||||
scan-after-prune is cheap. **At 90 days with no narrowing predicate,
|
||||
you've lost.** That's the cliff the doc refers to.
|
||||
|
||||
### What SDL does instead
|
||||
|
||||
No index, so no "did I hit the index?" cliff. Every query is a columnar
|
||||
scan over the epochs that overlap the time window, with column-level
|
||||
prefix pruning and run-length compression doing the heavy lifting.
|
||||
`matches "(powershell|rundll32|mshta)"` on `src.process.cmdline` at 90
|
||||
days is the same code path as at 1 hour — just more epochs in parallel.
|
||||
|
||||
---
|
||||
|
||||
## 2. `hint.shufflekey` is required to avoid OOM on the cross-table join
|
||||
|
||||
### How Kusto's distributed join works by default
|
||||
|
||||
Kusto is distributed. Tables are split into extents (shards), and
|
||||
extents live on different data nodes. When you write:
|
||||
|
||||
```kusto
|
||||
A | join kind=inner B on Key
|
||||
```
|
||||
|
||||
Kusto picks one of two physical strategies:
|
||||
|
||||
- **Broadcast** (the default for "small × large"): take the smaller
|
||||
side, replicate it to every node holding the larger side, then do
|
||||
local hash joins. Fast when small really is small.
|
||||
- **Shuffle**: hash both sides on `Key`, send all rows with hash bucket
|
||||
*i* to node *i*, then do local hash joins. Needed when both sides are
|
||||
big.
|
||||
|
||||
The planner chooses based on a **statistics estimate** of how big each
|
||||
side is *after* the upstream `where` filters apply.
|
||||
|
||||
### Where the OOM comes from
|
||||
|
||||
For a 90-day cross-source hunt the planner's estimate is almost always
|
||||
wrong:
|
||||
|
||||
1. `bad_dns` after the `has_any (suspect_domains)` filter is **probably**
|
||||
small — but if the IOC list has 200 entries or a wildcard sneaks in,
|
||||
it can be millions of rows.
|
||||
2. The planner picks **broadcast** because it estimates `bad_dns` is
|
||||
small.
|
||||
3. At runtime, `bad_dns` turns out to be huge.
|
||||
4. Kusto tries to ship the entire `bad_dns` payload to every node
|
||||
holding `_Im_ProcessEvent` extents (which at 90 d is **every node**).
|
||||
5. Each node tries to hold the broadcasted copy in memory while
|
||||
streaming `_Im_ProcessEvent` past it.
|
||||
6. `MaxMemoryPerQueryPerNode` (a tenant-level resource governor knob;
|
||||
on Sentinel it's a shared, opaque value) gets hit.
|
||||
7. You get `Request was aborted due to exceeding query memory limits`
|
||||
— or worse, partial results with no warning.
|
||||
|
||||
The same shape, with a left side just under the broadcast limit, OOMs
|
||||
intermittently as data volume grows day to day. That's the "silent
|
||||
regression" that makes 90-day Sentinel hunts unreliable in production.
|
||||
|
||||
### What `hint.shufflekey` does
|
||||
|
||||
```kusto
|
||||
A | join kind=inner hint.strategy=shuffle hint.shufflekey=Key (B) on Key
|
||||
```
|
||||
|
||||
You override the planner's estimate and force the shuffle strategy on
|
||||
`Key`. Both sides get re-hashed by `Key`, sent across the network into
|
||||
hash buckets, joined locally. No node has to hold all of either side —
|
||||
each node holds only **its bucket's slice**, so memory grows linearly
|
||||
with cluster size instead of quadratically with data size.
|
||||
|
||||
### Why this is a footgun
|
||||
|
||||
1. **You have to know to use it.** The default broadcast looks fine on
|
||||
a 1-day window and silently breaks at 90 days.
|
||||
2. **You have to pick the right key.** If `hint.shufflekey` is something
|
||||
with high skew (e.g. one user has 95% of the events), one node still
|
||||
OOMs while the others sit idle. You'd then add `hint.num_partitions=N`
|
||||
and tune it. Production hunts often have 3+ hints stacked just to
|
||||
keep them stable.
|
||||
3. **You can't compose well.** Two joins in the same query each need
|
||||
their own carefully chosen shuffle key. Get one wrong and the second
|
||||
join breaks.
|
||||
4. **The hints are advisory, not contractual.** A future Kusto version
|
||||
may ignore your hint if its own cost model thinks broadcast is
|
||||
better. Sentinel's update cadence means a query stable today can
|
||||
regress on a Tuesday with no warning.
|
||||
|
||||
### What SDL does instead
|
||||
|
||||
The reduction is distributed **by construction** — there is no
|
||||
broadcast vs shuffle planner because the engine never moves un-reduced
|
||||
rows across the network. Each worker filters → projects →
|
||||
partial-aggregates → partial-joins on its local epochs and sends only
|
||||
the reduced state up to the coordinator. The 90-day join in the SDL
|
||||
example in [`MPP_vs_KQL.md`](MPP_vs_KQL.md) §6 needs **zero hints**:
|
||||
the joiner doesn't need to estimate sizes because it never decides to
|
||||
broadcast.
|
||||
|
||||
---
|
||||
|
||||
## TL;DR sentences (drop-in for sidebars)
|
||||
|
||||
> **`has_any` cliff.** Kusto's term index tokenizes string columns at
|
||||
> ingest. `has_any` on `ProcessCommandLine` defeats it three ways at
|
||||
> once: the needles often contain separator characters (so the indexed
|
||||
> terms don't match), the OR-merge of many needles exceeds the
|
||||
> planner's index-vs-scan threshold, and `ProcessCommandLine` has such
|
||||
> a long-tail term distribution that index lookups rarely prune shards
|
||||
> anyway. At 90 d the scan that results is the largest single column
|
||||
> scan in the workspace.
|
||||
|
||||
> **`hint.shufflekey` cliff.** Kusto's join planner picks broadcast vs
|
||||
> shuffle from an estimated cardinality. On a 90-day cross-source hunt
|
||||
> the estimate is almost always wrong, the planner picks broadcast, and
|
||||
> the smaller side turns out to be tens of millions of rows. Without
|
||||
> `hint.strategy=shuffle hint.shufflekey=...` the query OOMs against
|
||||
> `MaxMemoryPerQueryPerNode`. The hint is required for stability and
|
||||
> has to be re-tuned per query and per data-volume change — a
|
||||
> maintenance tax SDL's distributed-reduction engine doesn't impose.
|
||||
@@ -0,0 +1,41 @@
|
||||
// 90-day cross-source hunt: failed Entra sign-ins ∧ bad-domain DNS ∧
|
||||
// LOLBin process execution on the SAME user/host.
|
||||
//
|
||||
// Paste into a Microsoft Sentinel workspace that has SigninLogs +
|
||||
// _Im_Dns + _Im_ProcessEvent populated.
|
||||
//
|
||||
// Known cliffs on a real workspace at 90d:
|
||||
// * memory pressure on _Im_ProcessEvent
|
||||
// * has_any on ProcessCommandLine bypasses the term index
|
||||
// * hint.shufflekey is required to avoid OOM on the cross-table join
|
||||
|
||||
let suspect_domains = dynamic([
|
||||
"c2.example.com",
|
||||
"suspect.example.net"
|
||||
]);
|
||||
let lolbins = dynamic([
|
||||
"powershell", "rundll32", "mshta"
|
||||
]);
|
||||
let suspicious_users =
|
||||
SigninLogs
|
||||
| where TimeGenerated > ago(90d)
|
||||
| where ResultType != 0 or RiskLevelDuringSignIn == "high"
|
||||
| summarize FailedSignins = count() by UserPrincipalName;
|
||||
let bad_dns =
|
||||
_Im_Dns(starttime=ago(90d))
|
||||
| where DnsQuery has_any (suspect_domains)
|
||||
| project TimeGenerated, SrcIpAddr, DnsQuery, UserName = SrcUsername;
|
||||
_Im_ProcessEvent(starttime=ago(90d))
|
||||
| where ProcessCommandLine has_any (lolbins)
|
||||
| join kind=inner hint.strategy=shuffle hint.shufflekey=DvcHostname (
|
||||
bad_dns
|
||||
| extend DvcHostname = tostring(SrcIpAddr)
|
||||
) on DvcHostname
|
||||
| join kind=inner (suspicious_users)
|
||||
on $left.ActorUsername == $right.UserPrincipalName
|
||||
| summarize Hits = count(),
|
||||
Domains = make_set(DnsQuery, 20),
|
||||
Cmdlines = make_set(ProcessCommandLine, 20)
|
||||
by ActorUsername, DvcHostname
|
||||
| order by Hits desc
|
||||
| take 100
|
||||
@@ -0,0 +1,63 @@
|
||||
// Same 90-day cross-source hunt expressed in SentinelOne SDL PowerQuery.
|
||||
//
|
||||
// Paste into a SDL tenant (e.g. https://xdr.us1.sentinelone.net) with
|
||||
// `startTime = "90d"`. Replace the synthetic domain regex if you have
|
||||
// a real IOC list.
|
||||
//
|
||||
// Why this looks different from the KQL:
|
||||
// * one schema (OCSF) instead of three runtime parser unions
|
||||
// * one engine path; 90d is more epochs scanned in parallel, NOT a
|
||||
// different code path (no Basic/Auxiliary/Archive tiers)
|
||||
// * the join is a single named-input join; no shuffle hint required
|
||||
// because reduction is already distributed
|
||||
//
|
||||
// NOTE on `loginIsSuccessful = 'false'`:
|
||||
// SDL stores booleans as lowercase strings. On a tenant whose OCSF
|
||||
// parser emits true/false as native booleans, drop the quotes:
|
||||
// AND event.login.loginIsSuccessful = false
|
||||
|
||||
| join
|
||||
failed_signins = (
|
||||
event.category = 'logins'
|
||||
AND event.login.loginIsSuccessful = 'false'
|
||||
| columns userName = event.login.userName,
|
||||
host = endpoint.name
|
||||
| group n_fails = count() by userName, host
|
||||
),
|
||||
bad_dns = (
|
||||
event.type = 'DNS Resolved'
|
||||
AND dns.question.name matches '(c2|suspect)\.example\.'
|
||||
| columns userName = src.endpoint.user.name,
|
||||
host = endpoint.name,
|
||||
domain = dns.question.name
|
||||
| group dns_hits = count(),
|
||||
domains = array_agg_distinct(domain, 20)
|
||||
by userName, host
|
||||
),
|
||||
susp_proc = (
|
||||
event.type = 'Process Creation'
|
||||
AND src.process.cmdline matches '(?i)(powershell|rundll32|mshta)'
|
||||
| columns userName = src.process.user,
|
||||
host = endpoint.name,
|
||||
cmdline = src.process.cmdline
|
||||
| group proc_hits = count(),
|
||||
cmdlines = array_agg_distinct(cmdline, 20)
|
||||
by userName, host
|
||||
)
|
||||
on userName, host
|
||||
| columns userName,
|
||||
host,
|
||||
hits = n_fails + dns_hits + proc_hits,
|
||||
n_fails,
|
||||
dns_hits,
|
||||
proc_hits,
|
||||
domains,
|
||||
cmdlines
|
||||
// If your tenant complains about bare field names after the join, use the
|
||||
// fully-prefixed form instead:
|
||||
// | columns userName, host,
|
||||
// hits = failed_signins.n_fails + bad_dns.dns_hits + susp_proc.proc_hits,
|
||||
// failed_signins.n_fails, bad_dns.dns_hits, susp_proc.proc_hits,
|
||||
// bad_dns.domains, susp_proc.cmdlines
|
||||
| sort -hits
|
||||
| limit 100
|
||||
Reference in New Issue
Block a user