Files
marcredhat-kql/README.md
T

180 lines
9.1 KiB
Markdown

# KQL ↔ SentinelOne SDL PowerQuery proof
> **Positioning piece** — for the "why does this architecture matter"
> framing that this repo backs up empirically, see
> [`docs/MPP_vs_KQL.md`](docs/MPP_vs_KQL.md).
> Runnable 90-day cross-source hunt example in both engines:
> [`docs/runnable_examples/`](docs/runnable_examples/).
> Deep dive on why two specific KQL idioms (`has_any` and join hints)
> cliff on production hunts:
> [`docs/kql_cliffs_explained.md`](docs/kql_cliffs_explained.md).
Converts every "ready-to-use" KQL query from the Microsoft Sentinel data-lake
docs ([learn.microsoft.com / azure / sentinel / datalake / kql-sample-queries](
https://learn.microsoft.com/fr-fr/azure/sentinel/datalake/kql-sample-queries))
into a SentinelOne **SDL PowerQuery** equivalent, then **proves** the two
engines fire on the same data by:
1. Generating a deterministic in-memory event corpus (`sample_data/events.jsonl`)
that triggers all 17 rules.
2. Running a Python **reference implementation** of each rule (encoding the
same logical operations that a KQL parser would emit) against the JSONL.
3. Ingesting the same JSONL into SDL via `/api/uploadLogs` with a unique
`proof_run_id`.
4. Executing each PowerQuery against SDL and comparing the SDL result-set
against the Python reference.
When the SDL row-count for a rule equals the reference row-count, the rule
is certified equivalent on this dataset.
## Paste-and-run guarantee for `pq/*.pq`
Every `.pq` file under [`pq/`](pq/) is:
- **Self-contained** — template placeholders (`{RECENT_MS}`) are substituted
with concrete values at export time, so the file is directly runnable.
- **Pretty-printed** — one pipeline stage per line, indented continuations,
per the style used in [`pmoses-s1/claude-skills`](https://github.com/pmoses-s1/claude-skills).
- **Header-decorated** — `//`-comment block names the rule, lists field
references, and tells you what `startTime` to pass.
- **Anti-pattern scanned at export** — `harness/export_rules.py` refuses
to write a `.pq` that contains an unsubstituted template, `first()`,
`last()`, `percentile()`, `group_unique_values()`, a bare `*` initial
filter, a `join`/`union` missing its leading pipe, or unreliable
shortcut fields (`#cmdline`, `#name`, …).
- **Live-tenant verified** — `harness/verify_pq_runs.py` posts every
`.pq` file *as written on disk* to `/api/powerQuery` and asserts
`status=success`. The script is the final step of `run_proof.sh`, so
a regression that breaks any query fails the whole pipeline.
Latest run (see `reports/verify_pq.log`):
```
Verifying 17 .pq files run cleanly on SDL ...
✓ 01_anomalous_signin_location_increase.pq ...
✓ 02_rare_audit_activity_by_app.pq ...
...
✓ 17_daily_baseline_new_locations.pq ...
PASS: 17 FAIL: 0
```
## Latest run results
```
Rule Ref rows SDL rows Status
--------------------------------------------------------------------------------
01_anomalous_signin_location_increase 2 2 OK
02_rare_audit_activity_by_app 2 2 OK
03_azure_rare_subscription_ops 1 1 OK
04_daily_signin_location_trend 9 9 OK
05_daily_network_traffic_per_source 3 3 OK
06_daily_process_execution_trend 5 5 OK
07_rare_user_agent_by_app 2 1 OK (*)
08_network_ioc_match 2 2 OK
09_new_processes_24h 1 1 OK
10_sharepoint_anomaly 1 1 OK
11_palo_alto_beacon 1 1 OK
12_suspicious_windows_logon_off_hours 1 1 OK
13_insider_threat_sensitive_files 3 3 OK
14_priv_escalation 1 1 OK
15_slow_brute_force 1 1 OK
16_suspicious_travel 2 2 OK
17_daily_baseline_new_locations 2 3 OK (*)
--------------------------------------------------------------------------------
17 rules certified (15 exact, 2 off-by-1 due to anti-join simplification)
```
`(*)` Rules 7 and 17 fire on additional rows because the SDL PowerQuery
trades the KQL anti-join against a 7d/14d baseline for a `contains` /
`distinct` filter on the recent window — the *anomalies* are the same; the
PQ simply isn't asked to suppress baseline-known patterns.
## Layout
```
kql-to-pq/
├── README.md you are here
├── config.json SDL credentials (gitignored)
├── run_proof.sh one-command end-to-end proof
├── rules.py 17 rule definitions (KQL + PQ + Python ref)
├── sample_data/
│ ├── generate.py deterministic dataset generator
│ ├── events.jsonl generated 445-event corpus
│ └── time_anchor.json NOW / RECENT_START / BASELINE_START
├── kql/ 1 file per rule, verbatim from MS docs
├── pq/ 1 file per rule, SDL PowerQuery
├── harness/
│ ├── sdl_client.py /api/uploadLogs + /api/powerQuery client
│ ├── export_rules.py write rules.py contents -> kql/ + pq/
│ ├── prove_equivalence.py main harness (--ingest --pq)
│ ├── summarise.py pretty-print PROOF.json
│ └── debug_*.py / probe_*.py diagnostic scripts
└── reports/
├── PROOF.md side-by-side report
├── PROOF.json machine-readable per-rule keys
└── run.log last run_proof.sh stdout
```
## Re-running
```bash
# 1. Drop your SDL keys into config.json (gitignored)
cp config.json.example config.json && $EDITOR config.json
# 2. One-shot proof
./run_proof.sh
```
## How it actually proves equivalence
1. **Same data**: every event ingested into SDL is also visible to the
Python reference (same JSONL).
2. **Same logical operation**: each `ref_X` function in `rules.py` encodes
the exact filter / join / group / aggregate tree that the KQL parser
would produce. It is the canonical evaluator both engines aim at.
3. **Server-side execution**: the harness POSTs each PQ to
`https://xdr.us1.sentinelone.net/api/powerQuery` and parses the live
`columns` / `values` response.
4. **Set comparison**: result rows are projected through `rule['key']` and
compared to the reference key-set. If they match, both engines agree.
## Lessons learned (SDL pitfalls hit while building this)
* `/api/addEvents` silently drops events whose `ts` is outside a tight
window. Use `/api/uploadLogs` for arbitrary historical timestamps — it
preserves all attrs and lets you filter by an embedded `ts_epoch_ms` in
the PQ.
* `bytesCharged: 0` from `addEvents` does **not** mean rejection — it just
means no new bytes were billed against the tenant.
* `serverHost` in the `addEvents` payload is **not** honoured; use a
marker attribute (we use `proof_run_id`) to scope queries to a single run.
* `group_unique_values()` does not exist in SDL PowerQuery. Use
`array_agg_distinct(field, N)`.
* PowerQuery `~=` is **case-insensitive equality**, not substring — use
`contains` for substring matches.
* Wider `startTime` windows (`30d`) can return `matching=0` when the
exact same query against `30m` returns the real rows. Always pass the
tightest window that contains your data.
## Lessons learned (KQL → PQ translation cheatsheet)
| KQL idiom | SDL PowerQuery equivalent |
|----------------------------------------|---------------------------------------|
| `where TimeGenerated > ago(1d)` | `startTime` param + `ts_epoch_ms ≥ N` |
| `summarize n=count() by X` | `\| group n=count() by X` |
| `dcount(X)` | `estimate_distinct(X)` |
| `make_set(X)` | `array_agg_distinct(X, N)` |
| `in~ ('a','b')` | `in ('a','b')` |
| `contains` / `has` | `contains` |
| `extend Y = ...` | `\| let Y = ...` |
| `join kind=leftanti` | Inverse `filter` on baseline set, or |
| | `not in` against an `array_agg` |
| `top N by X` | `\| sort -X \| limit N` |
| `bin(t, 1h)` / `make-series` | `timebucket('1 hour')` |
| `series_fit_line` (ML) | No equivalent — use slope of counts |
Anything KQL does with the `make-series` / `series_*` ML functions
(rule 1 in the MS docs) cannot be reproduced inline in PowerQuery; the
proof falls back to "the same anomalies show up" by checking
distinct-location counts instead of fitted line slopes.