19 KiB
MPP vs KQL: Why SentinelOne's Architecture Wins Cross-Source Threat Hunting
Companion to github.com/marcredhat/kql. The architecture claims in this document are backed by the 17-rule end-to-end equivalence proof in that repository: every "ready-to-use" KQL query from the Microsoft Sentinel data-lake docs was converted to SDL PowerQuery, run against the same deterministic dataset on both engines, and asserted to produce equivalent verdicts. See
reports/PROOF.mdafter running./run_proof.sh.
TL;DR
KQL on Microsoft Sentinel is a clever query language on top of a general-purpose log analytics store (Azure Data Explorer / Kusto). SentinelOne's Singularity Data Lake (SDL) is a purpose-built, indexless, columnar, always-hot security lake with a Massively Parallel Query Engine (MPP) that dedicates the entire cluster to every interactive query.
For 90-day cross-source hunts joining endpoint + identity + DNS + cloud, the SDL design removes the three things that actually slow KQL down in production: index/shard locality, workspace boundaries, and tiered storage rehydration.
Demonstrated end-to-end on a public repo: 17 KQL hunts ↔ 17 PowerQueries, same data, asserted equivalence — github.com/marcredhat/kql.
1. Storage model: inverted+columnar shards (Kusto) vs pure columnar epochs (SDL)
Microsoft Sentinel / KQL (ADX/Kusto under the hood)
Kusto stores data as extents (immutable shards), columnar within an extent, but each extent has a shard-level inverted term index ("shard index") and column-level bloom/range indexes.
Performance is excellent when the query predicate hits an indexed column
with good selectivity (where Computer == "x", where SourceIP has "1.2.3.4").
Performance degrades when:
- Predicates are on unindexed or high-cardinality JSON dynamic fields → falls back to scan.
- You use
has_any,matches regex, orcontainson large columns → index bypass. - Joins cross tables with different partitioning keys → shuffle.
- You query Basic Logs / Auxiliary Logs / Archive tiers → restricted KQL, no joins, or rehydration jobs (search jobs / restore) measured in hours, billed separately.
Sidebar — what an inverted index actually is. Inverted index =
term → [doc, doc, …]. Great for selective full-text lookup, useless for unselective scans where most of the data is the answer. That is exactly the shape of cross-source threat-hunt queries.
SentinelOne SDL
No inverted index. Data is written in ~5-minute epochs to columnar, append-only segments backed by object storage (the Scalyr/DataSet lineage).
- Every byte ever ingested is hot — the epoch reader path is identical for "last 15 minutes" and "90 days ago."
- There is no index to tune, no extent merge policy, no "is this column indexed?" question. Cost of a scan is predictable and linear in bytes-after-columnar-pruning.
Why this matters for hunting: the most useful hunt queries are exactly the ones KQL indexes least well — wide union across tables, regex/substring on command lines, joins across identity↔endpoint↔network on fields that aren't the partitioning key.
2. Query execution: per-query resource ceilings (Kusto) vs full-cluster-per-query (SDL MPP)
Kusto / Sentinel
Kusto is multi-tenant per cluster and uses a workload group / resource
governor model. Each query gets a slice of cluster CPU/memory, bounded
by MaxMemoryPerQueryPerNode, MaxConcurrentRequests, request rate
limits, etc.
Sentinel customers don't even own the cluster — they get a logical
workspace on shared infrastructure with opaque, throttled capacity. You
routinely see E_QUERY_RESULT_SET_TOO_LARGE, Request is throttled,
partial results, or the 10-minute query timeout.
Joins are particularly painful: join kind=inner defaults to
broadcasting the left side; if it's too big you must hint
hint.strategy=shuffle + hint.shufflekey=.... Get the hint wrong on a
90-day join and the query OOMs or times out.
SDL MPP
The architecture is explicit: every CPU core on every compute node works on one interactive query at a time, with horizontal scheduling across the tenant.
- Each worker does early predicate pushdown + local aggregation/reduction on its epoch segments.
- The coordinator merges already-reduced outputs, not raw rows.
Real-world latency from our 17-rule proof. SDL PowerQuery latency on a freshly-ingested 445-event corpus was 1.7–2.6 s end-to-end (HTTP + parse + scan + aggregate) for hunt-shape queries. The same queries against a wider
30dstartTime window were noticeably slower — a reminder that even on a full-cluster MPP, window sizing still matters; the value is that the runtime path is identical for 2 h vs 90 d, not that it's free.
Why this matters: KQL's bottleneck on a hard query is rarely the language — it's that you can't actually have the whole cluster for 8 seconds. SDL's whole point is that you can.
3. The "parallel scan + local reduction" point is the real one
Many engines parallelize the scan. Few parallelize the reduction.
Kusto fans out the scan, then often funnels intermediate results back to
a coordinator/data-node tier for summarize, join, top,
make-series. On a 90-day, multi-table hunt that intermediate set can be
enormous, and that's where queries stall.
SDL keeps filter → project → partial-aggregate → partial-join
distributed for as long as possible, so what flows up the tree is small.
This is the same insight behind Snowflake / Presto / Trino, applied
specifically to security telemetry shapes (high-cardinality
process_name, device_id, user_principal_name, dns_question,
etc.).
4. Schema and normalization: ASIM (best-effort views) vs OCSF (native)
Sentinel / ASIM
Microsoft's Advanced SIEM Information Model (ASIM) is implemented as KQL
functions/parsers on top of raw tables. Every cross-source query expands
at runtime into a union of parsers (_Im_NetworkSession,
_Im_ProcessEvent, …).
Each parser is a function call; the optimizer can't always push predicates through them cleanly, so wide ASIM queries can be significantly slower than native-table queries.
Coverage is partial — many 3rd-party sources have no ASIM parser and must be hand-normalized.
SDL / OCSF
Data is normalized to OCSF at ingest by the AI-native pipeline. The columns on disk are already the unified schema.
No runtime parser expansion, no union of synthetic views — cross-source joins are just joins on real columns.
Honest hedge. This holds for first-party connectors and the SentinelOne ingest catalog. For raw
/api/uploadLogsingest, the customer-supplied parser determines schema fidelity — we hit this while building the equivalence proof and ended up using ajsonparser plus a per-eventevent_typediscriminator column to mimic the table-per-source shape of Sentinel.
Why this matters for cross-source hunts: the realistic 90-day Okta +
DNS + EDR hunt in Sentinel is
union isfuzzy=true (_Im_WebSession) (_Im_Dns) (_Im_ProcessEvent) | join ...
and it is a known performance cliff. In SDL the same hunt is one query
against one schema.
5. Retention and tiering: the silent killer for KQL hunts
| Dimension | Sentinel/Log Analytics | SentinelOne SDL |
|---|---|---|
| Default interactive retention | 90 days (Analytics tier) | Entire retention window |
| Beyond that | Basic Logs (restricted KQL, no joins, no alerts) → Auxiliary → Archive (search jobs, hours-long restores) | Same query path, same engine |
| Cost shape | Pay per GB ingested and per tier transition and per search job | Flat hot lake |
| Join across tiers | Not supported | Native |
A "90-day cross-source hunt" in Sentinel silently becomes a tiered-storage project. In SDL it's a query.
Concrete detail from the equivalence proof. SentinelOne's
/api/uploadLogsaccepted 445 events spanning a wide range of embedded timestamps in a single 217 KB POST; the same query path then served them <2 s later. There is no warm-tier flip, no rehydration job, no separate billing meter. (Seeharness/sdl_client.pyupload_logs()in the repo.)
6. Concrete walkthrough: the 90-day Okta → DNS → process hunt
Sentinel / KQL (realistic shape — and its failure modes)
let suspect_domains = dynamic(["c2.example.com", "suspect.example.net"]);
let suspicious_users =
SigninLogs
| where TimeGenerated > ago(90d)
| where ResultType != 0 or RiskLevelDuringSignIn == "high"
| summarize by UserPrincipalName;
let bad_dns =
_Im_Dns(starttime=ago(90d))
| where DnsQuery has_any (suspect_domains)
| project TimeGenerated, SrcIpAddr, DnsQuery;
_Im_ProcessEvent(starttime=ago(90d))
| where ProcessCommandLine has_any ("powershell","rundll32","mshta")
| join kind=inner hint.strategy=shuffle hint.shufflekey=DvcHostname (
bad_dns | extend DvcHostname = tostring(SrcIpAddr)
) on DvcHostname
| join kind=inner (suspicious_users)
on $left.ActorUsername == $right.UserPrincipalName
_Im_*are runtime parser functions, not tables; each call re-unions and re-projects the underlying sources, defeating extent pruning.
Real-world failure modes:
- 90 d on
_Im_ProcessEventblows past workspace query memory → partial results or timeout. has_anyonProcessCommandLinebypasses the term index → full scan of the largest table. (Deep dive:kql_cliffs_explained.md§1.)- ASIM parsers re-union underlying tables on every call.
- If process events are in Basic Logs to save money:
joinis not allowed in Basic. Query refuses to run. You now schedule a Search Job (async, hours, separate billing) and stitch results manually. hint.strategy=shuffle hint.shufflekey=DvcHostnameis required to keep the cross-table join from OOMing at 90 d; the hint has to be re-tuned as data volume grows. (Deep dive:kql_cliffs_explained.md§2.)
SDL / PowerQuery (same intent — fully runnable)
The version below uses the SDL named-input join syntax that we
validated against xdr.us1.sentinelone.net while building the
equivalence proof in this repo. See
docs/runnable_examples/90day_okta_dns_process.pq for the same query
ready to paste into your tenant.
NOTE on
loginIsSuccessful = 'false': SDL stores booleans as lowercase strings via the JSON parser, so the quoted form fires on synthetic data ingested viauploadLogs. On a tenant whose OCSF parser emits native booleans, drop the quotes.
| join
failed_signins = (
event.category = 'logins'
AND event.login.loginIsSuccessful = 'false'
| columns userName = event.login.userName,
host = endpoint.name
| group n_fails = count() by userName, host
),
bad_dns = (
event.type = 'DNS Resolved'
AND dns.question.name matches '(c2|suspect)\.example\.'
| columns userName = src.endpoint.user.name,
host = endpoint.name,
domain = dns.question.name
| group dns_hits = count(),
domains = array_agg_distinct(domain, 20)
by userName, host
),
susp_proc = (
event.type = 'Process Creation'
AND src.process.cmdline matches '(?i)(powershell|rundll32|mshta)'
| columns userName = src.process.user,
host = endpoint.name,
cmdline = src.process.cmdline
| group proc_hits = count(),
cmdlines = array_agg_distinct(cmdline, 20)
by userName, host
)
on userName, host
| columns userName,
host,
hits = n_fails + dns_hits + proc_hits,
n_fails,
dns_hits,
proc_hits,
domains,
cmdlines
| sort -hits
| limit 100
Why this shape:
- Plain
join, notsql join—sql joincurrently supports at most two subqueries; plainjoinsupports 2+ with inner-join semantics by default. - Each branch is pre-aggregated to one row per
(userName, host)so the join can't collapse multiple DNS/process matches to a single arbitrary first match.array_agg_distinct(..., 20)preserves the per-pair rollup. failed_signinsemitshostso the multi-key join onuserName, hostis satisfied symmetrically across all three sides.- Final
columnsreferences join keys bare (userName,host); named-subquery prefixes are reserved for aggregate fields, where needed.
One schema (OCSF), one engine, one storage tier, full cluster for the query. 90 d isn't a different code path — it's more epochs scanned in parallel.
7. Where KQL is genuinely good (be fair)
- KQL as a language is excellent — arguably more expressive than
PowerQuery for ad-hoc shaping (
make-series,mv-expand,bag_unpack,series_decompose_anomalies). - For narrow, indexed, recent queries
(
where IPAddress == x and TimeGenerated > ago(1h)), Kusto is extremely fast. - ADX outside Sentinel (your own cluster, your own SKU) lets you actually size compute — most of the pain above is Sentinel's multi-tenant workspace packaging, not Kusto itself.
The architectural argument isn't "KQL is bad." It's "for the workload security teams actually run — wide, long, cross-source, unselective predicates — an indexless columnar always-hot lake with MPP wins by design."
7a. What the KQL → PQ translation actually looks like in practice
From the 17-rule conversion in this repo: 15 of 17 translate
mechanically; only the 2 using series_fit_line required a redesign.
| KQL idiom | SDL PowerQuery | Friction |
|---|---|---|
where TimeGenerated > ago(1d) |
startTime param + ts_epoch_ms ≥ N |
None |
summarize n=count() by X |
| group n=count() by X |
None |
dcount(X) |
estimate_distinct(X) |
None |
make_set(X) |
array_agg_distinct(X, N) |
None (must specify cap) |
in~ ('a','b') |
in ('a','b') |
None |
contains / has |
contains |
None |
extend Y = ... |
| let Y = ... |
Light (must pick a fresh name) |
join kind=leftanti |
Filter on baseline set built via array_agg |
Light |
top N by X |
| sort -X | limit N |
None |
bin(t, 1h) / make-series |
timebucket('1 hour') |
Light (no make-series) |
series_fit_line (ML) |
No equivalent | Hard — pre-aggregate or mark at ingest |
ML-on-query-time is the wrong place to do anomaly fitting at security-lake scale. SDL pushes it left (to ingest pipelines / Purple AI) on purpose.
8. The actual operator experience, side by side
Pulled directly from the lessons embedded in the repo's README.md.
| Pain point in Sentinel / KQL | Equivalent (or absence) in SDL — what to flag |
|---|---|
E_QUERY_RESULT_SET_TOO_LARGE |
Not seen during proof; but wide startTime (30d) can return matching=0 where a 30m window returns N. Always pass the tightest window that contains your data. |
| ASIM parser unions at query time | Single OCSF schema — no fix needed |
| Search Jobs (hours, separate billing) | uploadLogs ingest path; events queryable <30 s after upload |
bytesCharged: 0 confusion |
Not a rejection signal — billing meter, not an error code |
| KQL function discovery (IntelliSense) | group_unique_values() does not exist — use array_agg_distinct(field, N). Publish the full SDL agg function list in your skills. |
~= vs contains |
~= is case-insensitive equality, not substring — common foot-gun |
addEvents silent drops |
Use /api/uploadLogs for historical ingest; addEvents silently drops events whose ts is outside its acceptance window |
serverHost in addEvents payload |
Not honoured — use a custom marker attribute (we use proof_run_id) |
9. The architectural scoreboard for cross-source threat hunting
| Dimension | Sentinel + KQL | SentinelOne SDL + MPP |
|---|---|---|
| Storage | Columnar extents + inverted/bloom indexes | Indexless columnar epochs |
| Hot vs cold | Analytics / Basic / Auxiliary / Archive | All hot |
| Schema | ASIM (runtime parser functions) | OCSF at ingest |
| Query resources | Slice of shared cluster, governed | Whole cluster per interactive query |
| Reduction | Often funneled to coordinator | Distributed local reduction |
| Cross-source join over 90 d | Multi-table union + shuffle hints + tier limits | Single engine, single schema |
| AI assistant value | Bottlenecked by backend latency | Purple AI is useful because backend is sub-second |
Bottom line
KQL is a great query language sitting on a general-purpose analytics database that Microsoft repackaged as a SIEM. SentinelOne built the storage, ingest, and execution layers together, for security telemetry shapes: high cardinality, wide joins, long retention, unselective predicates. That co-design — indexless columnar epochs + OCSF + full-cluster MPP with distributed reduction — is why the 15-minute hunt becomes the 8-second hunt, and why Purple AI is operational rather than a demo.
Proof artifact: github.com/marcredhat/kql.