Sync upstream features; preserve fork KV scanner, parsers, verifier

Brought in 35 upstream commits (MITRE heatmap, health score, dependency map, PowerQuery playground, onboarding tracker, product grouping, modern UI redesign). Preserved fork additions: backend/routers/quality.py KV scanner, pattern refs, JS keys, JSON mode, /parsers + /sync-from-sdl endpoints parsers/ 96 OCSF + tenant parsers tools/stormshield-verify/ end-to-end ingest regression test .gitignore un-ignored parsers/* CHANGES.md, PATCHES.md
2026-06-08 20:37:12 +00:00 · 2026-05-22 18:19:52 +02:00
parent a7ebcac9a6
commit 7c1687efce
102 changed files with 13912 additions and 178 deletions
@@ -0,0 +1,168 @@
+# SIEM-Toolkit Patches & Helper Scripts
+
+A drop-in patch set that fixes several issues in the upstream
+[`mickbrowns1/SIEM-Toolkit`](https://github.com/mickbrowns1/SIEM-Toolkit) and
+adds helper scripts for syncing parsers from a SentinelOne SDL tenant and
+probing PowerQuery / event data.
+
+## What's fixed in the upstream code
+
+| File | Fix |
+|---|---|
+| `backend/routers/ingest.py` | **Filter Simulator** PowerQuery rewritten — replaced legacy `count() as events` and `src.name` field with valid SDL `\| filter dataSource.name=='X' \| group events=count()` |
+| `backend/routers/quality.py` | New `GET /api/quality/parsers` endpoint lists actual parser files; `_flatten_event` now JSON-parses nested `message` payloads so the **Field Population** tool reports real coverage (was always 0% for sources where the parser isn't applied at query time) |
+| `backend/routers/quality.py` (Parser Test Runner) | Detects SDL JSON auto-extract format `$=json{parse=json}$` and parses log lines as JSON; applies parser `rewrites` (`input/output/match/replace` blocks) with correct `$0`/`$N` backreference handling; accepts **single JSON / JSON array / NDJSON** input |
+| `frontend/index.html` | Parser dropdown now loads from `/api/quality/parsers` (was filtering `coverage/map` which only has `detected in data` placeholders); added **Last 7d** lookback to both Field Population and Sample Events; Test Runner UI now shows mode badge (`JSON auto-extract` vs `regex format`), payload count for multi-line input, and separate tables for extracted vs derived/rewritten fields |
+
+## What's NOT fixed in the upstream code (configuration)
+
+The repo's `docker-compose.yml` interpolates `S1_BASE_URL` etc. from
+`.env` at compose-up time. **A `docker compose restart` does NOT pick up
+`.env` changes** — always use `docker compose up -d --force-recreate backend`.
+
+`S1_BASE_URL` must be the **per-tenant management console subdomain** (e.g.
+`usea1-XXXX.sentinelone.net`), not the regional SDL/XDR endpoint. If you
+only know the XDR URL, you can probe candidates with curl:
+
+```bash
+TOKEN=$(jq -r .api_token < ~/.../mgmt-config.json)
+for H in usea1-yourtenant usea1-purple usea1-partners; do
+  printf "%-45s  %s\\n" "$H" \\
+    "$(curl -s -o /dev/null -w '%{http_code}' \\
+       \"https://$H.sentinelone.net/web/api/v2.1/cloud-detection/rules?limit=1\" \\
+       -H \"Authorization: ApiToken $TOKEN\")"
+done
+# 200 = correct host
+```
+
+## Contents
+
+```
+.
+├── README.md                       (this file)
+├── env.example                     template for the toolkit's .env
+├── sdl_config.example.json         template for helper scripts' SDL config
+├── patched-files/
+│   ├── backend/routers/
+│   │   ├── ingest.py               <- copy over upstream
+│   │   └── quality.py              <- copy over upstream
+│   └── frontend/
+│       └── index.html              <- copy over upstream
+└── scripts/
+    ├── sync_sdl_parsers.py         pull all /logParsers/* from the tenant into ./parsers/
+    ├── probe_pq_syntax.py          test what PowerQuery dialect the tenant accepts
+    ├── probe_avelios.py            sample probe: find a source's events + columns
+    ├── probe_avelios_wide.py       same, sweeping 1d/3d/7d
+    ├── probe_avelios_fields.py     parse JSON `message` payloads & count fields
+    ├── test_avelios_parser.py      hit /api/quality/test-parser with one JSON line
+    └── test_avelios_multi.py       same, with multi-line NDJSON
+```
+
+## Applying the patches
+
+1. Clone the upstream repo:
+   ```bash
+   git clone https://github.com/mickbrowns1/SIEM-Toolkit.git
+   cd SIEM-Toolkit
+   ```
+2. Overlay the patched files:
+   ```bash
+   PATCH=/path/to/this/dir
+   cp "$PATCH"/patched-files/backend/routers/quality.py backend/routers/quality.py
+   cp "$PATCH"/patched-files/backend/routers/ingest.py  backend/routers/ingest.py
+   cp "$PATCH"/patched-files/frontend/index.html        frontend/index.html
+   ```
+3. Configure:
+   ```bash
+   cp "$PATCH"/env.example .env
+   $EDITOR .env                          # fill in your real values
+   ```
+4. Start the stack:
+   ```bash
+   docker compose up -d --build
+   open http://localhost:3001
+   ```
+
+## Helper-script setup
+
+The helper scripts read a small JSON config (separate from the toolkit's `.env`)
+containing your SDL log-read / config-read keys:
+
+```bash
+cp sdl_config.example.json scripts/sdl_config.json
+$EDITOR scripts/sdl_config.json
+# or set the env var
+export SDL_CONFIG=/somewhere/sdl_config.json
+```
+
+## Helper-script usage
+
+### Sync parsers from the SDL tenant into the toolkit's `parsers/` dir
+
+```bash
+PARSERS_DIR=/path/to/SIEM-Toolkit/parsers \\
+  python3 scripts/sync_sdl_parsers.py
+```
+
+By default `PARSERS_DIR` defaults to `../parsers` relative to the script.
+
+### Probe PowerQuery syntax compatibility on your tenant
+
+```bash
+python3 scripts/probe_pq_syntax.py
+```
+
+Output tells you which command shapes (`| group ...`, `filter ...`, `count() as`, etc.)
+work on the active deployment.
+
+### Inspect what a given source's events actually look like
+
+```bash
+python3 scripts/probe_avelios.py            # finds a source's name + 1-line sample
+python3 scripts/probe_avelios_wide.py       # sweeps 1d/3d/7d top sources
+python3 scripts/probe_avelios_fields.py     # if `message` is JSON, flatten & count fields
+```
+
+The scripts are named `*_avelios` for the original use case but work for **any
+source** — open the file and change the `dataSource.name` filter.
+
+### Smoke-test the patched Parser Test Runner endpoint
+
+```bash
+python3 scripts/test_avelios_parser.py      # single-line JSON
+python3 scripts/test_avelios_multi.py       # multi-line NDJSON
+```
+
+These hit `http://localhost:8001/api/quality/test-parser` directly so you can
+verify the backend without using the UI.
+
+## Common pitfalls
+
+- **Parser dropdown is empty** → run `sync_sdl_parsers.py`. The upstream "Load
+  SDL Parsers" button only indexes whatever already exists in `parsers/`.
+- **Field Population shows 0% everywhere** → the source's parser isn't being
+  applied at query time, so PowerQuery returns just `timestamp`+`message`.
+  This patch's `_flatten_event` parses JSON inside `message`. Also try widening
+  the window (the new **Last 7d** option) — some sources are low-volume.
+- **PowerQuery 400 "Unknown command [count]"** → fixed in `ingest.py`. If you
+  hit it elsewhere, the rule is: SDL PowerQuery requires `\| group events=count()`,
+  never `\| count() as events`, and `count()` must be inside a `group`.
+- **STAR rules → 302 to /404** → `S1_BASE_URL` is pointed at the SDL/XDR URL
+  instead of the management-console subdomain.
+
+## Verification
+
+After applying patches and recreating containers:
+
+```bash
+curl http://localhost:8001/health
+curl http://localhost:8001/api/quality/parsers | python3 -m json.tool   # count > 0
+curl 'http://localhost:8001/api/ingest/top-sources?hours=24'           # real numbers
+curl -X POST http://localhost:8001/api/coverage/load-star-rules        # not 502
+```
+
+In the UI:
+- **Coverage Map**: shows `parsers_loaded` and `rules_loaded` > 0
+- **Ingest → Filter Simulator**: returns matched events + projected GB/month
+- **Parser Quality → Parser Test Runner**: dropdown lists all parsers
+- **Parser Quality → Field Population**: real coverage rates (not all 0%)