Commit Graph

24 Commits

Author SHA1 Message Date
Jason 00d5d5f220 fix(mcu): PR-V — ADF4382A Stage-5 audit fixes (F-5.1..F-5.10)
F-5.1: revert PWM scaffolding to binary DELADJ. Schematic-verified:
  PG7/PG13 on STM32F746ZGT7 have no TIM3 alternate function (Port G AFs are
  FMC/ETH/USART6/SAI2/SDMMC2 — no TIMx routes), and the FreqSynth-board
  DELADJ net has only a 200 kOhm pulldown (R22, R35) — no series-R +
  shunt-C LPF for PWM-to-DC. The 3979693 (Bug #5) + c466021 (B15) PWM
  scaffolding was a false-fix; 5fbe97f's original honest TODO matched the
  actual hardware. Delete htim3, MX_TIM3_Init, start/stop_deladj_pwm,
  phase_ps_to_duty_cycle. Rewrite test_bug5 for binary; delete test_bug15.

F-5.2: split ADF4382A ref_div per device. RX 10.38 GHz / 300 MHz = 34.6 is
  fractional mode, but ADF4382_PFD_FREQ_FRAC_MAX = 250 MHz — driver does
  not reject the out-of-spec config, ldwin_pw silently left at 0. Set
  rx_param.ref_div = 2 -> PFD = 150 MHz, in spec. TX unchanged (integer).

F-5.3: free prior tx_dev/rx_dev in Manager_Init before re-allocating. The
  recovery dispatch on TX/RX unlock calls Manager_Init again; previous
  adf4382_dev allocations were leaking. Mirrors F-4.5 fix for AD9523.

F-5.4: fix upstream adf4382_remove() — only freed dev struct on FAILED SPI
  removal (success path leaked) and always returned 0. Now: NULL guard,
  unconditional free, propagate ret.

F-5.8: lock-detect uses register reg[0x58] LOCKED bit as authoritative.
  GPIO disagreement still logged via DIAG_WARN but no longer flips the
  result — a mis-routed GPIO LKDET would otherwise trigger false-unlock
  recovery loops.

F-5.10: drop stale "EZSYNC" diagnostic string (post-C-14a residue).

Bench-side checks for first power-on:
- Scope PG13 (TX_DELADJ) and PG7 (RX_DELADJ) — both should be HIGH (3.3V)
  after SetPhaseShift(500,500) runs at boot.
- Confirm both ADF4382A LOs lock with PFD=150 MHz on RX (was 300 MHz).
  Lock-time may be slightly longer; phase-noise sidebands shift.
- Confirm no false-unlock storms on the recovery path — the GPIO LKDET
  disagreement DIAG_WARN should no longer flip the lock decision.

Regression: tests/ make test 34/34 PASS (was 35/35 baseline; -1 from
test_bug15 deletion as planned).
2026-05-05 09:20:06 +05:45
Jason 534905263f mcu(health): poll PD15 + dispatch ERROR_FPGA_DSP_STALL (AUDIT-S10 follow-up)
AUDIT-S10 (commit `58154a6`) split the FPGA's six-flag aggregate
gpio_dig5 into two MCU-visible bits: gpio_dig5 keeps signal-saturation
(AGC reacts), gpio_dig7 (PD15) carries control-fault classes
(range_decim_watchdog | cic_fir_overrun). Until now the MCU did NOT
poll PD15, so DSP control faults were invisible to the recovery
dispatcher.

Changes:

- New `ERROR_FPGA_DSP_STALL` enum value placed AFTER ERROR_WATCHDOG_TIMEOUT
  so the dispatcher routes to attemptErrorRecovery (FPGA reset pulse) not
  Emergency_Stop. Updated error_strings[] in lockstep (static_assert
  enforces).

- checkSystemHealth section 10 polls PD15 at 1 Hz with 2-sample debounce.
  `last_dsp_check` is committed BEFORE the early return per AUDIT-CAL
  pattern, so a flapping fault never bypasses the rate-limit. Streak
  counter resets to 0 after firing (armed for next post-recovery
  assertion) AND resets naturally when PD15 returns LOW.

- attemptErrorRecovery: ERROR_FPGA_DSP_STALL fans into the existing
  ERROR_FPGA_COMM PD12 reset case (stacked case labels, same body). No
  MCU-driven reset_monitors path exists; full bitstream reload clears
  all sticky monitors as a side effect.

Tests:
- tests/test_audit_s10_dsp_stall_polling.c (NEW, 7 scenarios, 7/7 PASS):
  T1 healthy 60s, T2 single-sample glitch blocked by debounce, T3
  sustained fault fires once, T4 post-fire rate-limit holds within
  window, T5 sustained fault rate bounded (29 errors / 60s -- MCU-N1
  latch at error_count>10 fires in ~22s, gives operator time to
  intervene), T6 counter-test demos no-debounce false-positive on
  glitch, T7 HAL_GetTick 32-bit wrap.
- MCU host suite 35/35 PASS (was 34/34; +1 new, 0 regressions).
2026-04-29 23:42:21 +05:45
Jason 1b1b5f4fb2 mcu(health): commit rate-limit window before early returns (AUDIT-CAL follow-up)
checkSystemHealth() had three watchdog blocks with the identical
"last_X_check not updated on error path" bug — same root cause as
AUDIT-CAL (BMP180 fix in commit 95aed35), distinct sites:

  AD9523 clock check   (5 s)  main.cpp:693-705
  ADAR1000 comm check  (2 s)  main.cpp:729-749
  IMU comm check       (10 s) main.cpp:752-760

Pre-fix, each block placed `last_X_check = HAL_GetTick();` below the
early-return path, so once the underlying check (STATUS0/1 RESET,
SCRATCHPAD verify fail, GY85_Update false) started failing, the
rate-limit window never engaged. Every subsequent iteration of the
main while(1) loop re-fired the corresponding ERROR_*. With
error_count > 10 latching system_emergency_state per MCU-N1, the
radar would trip into SAFE-MODE within ~10 main-loop iterations of
the first transient — far short of the intended ~100-150 s grace
window meant for operator intervention or attemptErrorRecovery
to succeed. ADAR1000 comm-failure also re-ran the 16 ms blocking
SPI verify (4 devices × 4 ms HAL_Delay) per iteration → chirp jitter.

Fix at all three sites: move the timestamp update INTO the if-block
and BEFORE any sub-check call. Mirrors the AUDIT-CAL post-fix
BMP180 block at main.cpp:771-780. ADAR1000 overtemp check stays
per-loop (unchanged) — over-temperature must remain responsive.

Test: tests/test_audit_imu_watchdog_cadence.c (6 tests, 6/6 PASS)
exercises the post-fix predicate against simulated HAL_GetTick()
ticks and a controllable GY85_Update() mock; counter-test runs the
pre-fix predicate to demonstrate the regression. Test uses IMU as
representative; AD9523 (5 s) and ADAR1000 (2 s) sites have identical
control flow.

Verification: full MCU host suite 34/34 PASS (was 33/33; +1 new test,
0 regressions).
2026-04-29 20:57:50 +05:45
Jason 95aed35d89 mcu(bmp180): call cal-coefficient init at boot + watchdog cadence fix (AUDIT-CAL)
The BMP180 driver had no public init method and never called
readCalibrationCoefficients() from anywhere -- _calCoeff ran at the
C++ in-class member-initializer defaults (all zeros) at runtime.

Consequence chain:
  - computeB5(UT) short-circuited via 0/0 (Cortex-M7 SDIV with
    SCB->CCR.DIV_0_TRP=0 returns 0 silently -- system_stm32f7xx.c does
    not enable the trap)
  - getPressure() always tripped the `if (B4 == 0)` guard, returning
    the I2C-error sentinel (post-AUDIT-C17: INT32_MIN; pre-: 255)
  - health watchdog at main.cpp:758 fired ERROR_BMP180_COMM every
    main-loop iteration because last_bmp_check was only updated on the
    success path, so the 15 s rate-limit never engaged once the check
    started failing
  - error_count > 10 latched system_emergency_state = true (per the
    MCU-N1 fix), driving SAFE-MODE within ~25 s of every boot

Fix:
  - Added BMP180::begin() public method: probes chip ID, then reads the
    11 factory cal coefficients (registers 0xAA..0xBE step 2). Returns
    true only on full success; false on chip-ID mismatch or any I2C
    failure mid-loop.
  - main.cpp BAROMETER INIT calls myBMP.begin() with up to 3 retries
    (50 ms backoff) and sets a file-scope bmp180_operational flag.
    Altitude-baseline loop now gated on success -- failure path leaves
    RADAR_Altitude at 0.0f instead of letting pow(negative, fractional)
    propagate NaN into gps_data telemetry.
  - Health watchdog gates BMP180 check on bmp180_operational AND
    updates last_bmp_check regardless of the error path. A single bad
    pressure reading no longer tight-loops into SAFE-MODE; legit sensor
    failure now takes the intended ~150 s (10 errors x 15 s) before
    the MCU-N1 latch trips, giving the operator time to intervene.

Verification:
  - new test_audit_cal_bmp180_begin.c, 3/3 PASS:
      T1 every coefficient loaded in order with correct signed/unsigned types
      T2 chip-mismatch and I2C-fail short-circuit semantics correct
      T3 regression demo: zero-cal computeB5 returns 0 for any UT (the
         silent-fail mode); datasheet cal reproduces 15.0 C
  - full MCU regression 33/33 PASS (was 32/32; +1 new test, 0 regressions)

Bug introduced in 5fbe97f (initial upload of the driver from the
Arduino enjoyneering79 BMP180 library -- the begin()/init pattern from
the upstream Arduino version was lost in the STM32 port). Latent until
this audit cycle.
2026-04-29 19:21:35 +05:45
Jason 4b142166be mcu(bmp180): replace in-band sentinel + fix uint16->int16 narrowing (AUDIT-C17)
BMP180_ERROR=255 was an in-band sentinel returned by uint16_t I/O helpers
(read16, readRawTemperature) on I2C failure. 255 is also a valid uint16
register reading (0x00FF appears across the calibration block and is
reachable as a raw temperature/pressure sample), so a sensor failure was
indistinguishable from a real reading.

getTemperature() additionally narrowed the uint16_t raw read to int16_t
before passing to computeB5(). Raw bit-patterns >= 0x8000 (reachable across
the BMP180 -40..+85 C operating window) flipped to negative int16_t and
sign-extended into computeB5(), producing temperature errors of order
100s of C (e.g. -347 C instead of +51 C for raw UT = 0x8000).

Fix:
  - Internal I/O helpers (read8/read16/readRawTemperature/readRawPressure)
    now return bool and pass the value through an out-param. None of the
    new sentinels collide with valid sensor output:
      * getTemperature       -> NaN          on error
      * getPressure          -> INT32_MIN    on error
      * getSeaLevelPressure  -> INT32_MIN    on error
  - getTemperature() keeps raw as uint16_t and widens value-preservingly
    via (int32_t)raw before computeB5().
  - readRawPressure() reads XLSB through the bool-out-param contract;
    previously OR'd in 0xFF on I2C fail, silently corrupting the LSB.

Verification: test_audit_c17_bmp180_sentinel_and_cast 4/4 PASS, including
datasheet UT=27898 -> 15.0 C reproduction and 64/64 finite outputs across
a full uint16 sweep (vs 32/32 collapses in the upper half under the buggy
narrowing). Full MCU regression 32/32 PASS.

Caller-side: no external code references BMP180_ERROR; main.cpp's existing
range check at the health-watchdog catches INT32_MIN via the < 30000.0
branch.
2026-04-29 18:55:48 +05:45
Jason 26f8d1fa72 fix(mcu): MCU-A4 — BKPSRAM warm-restart bypass for OCXO 180 s warmup
Every boot waited the full 180 s OCXO warmup soak — even an
IWDG/SYSRESETREQ reset that takes seconds and leaves the OCXO oven hot
lost three minutes of bringup time.

Added BKPSRAM slot 3 (magic 0xCA1C1F1E) with warmup_persist_set/check
helpers next to the existing MCU-A2/A7 BKPSRAM block. Cold-boot path
now arms the flag at the end of the full 180 s soak; subsequent boots
that find the flag still set know the OCXO oven is still hot and the
crystal is settled, so they wait 5 s and move on. Power-cycle clears
BKPSRAM and forces the full soak again — safe default, operator can't
accidentally skip the warmup by yanking and re-applying power.

Added test_mcu_a4_ocxo_warm_restart (7 cases): cold boot soaks 180 s
and sets the flag; warm reset is 5 s; 5 consecutive warm resets stay
fast; power-cycle restores the cold path; cold-after-power-cycle
re-arms the bypass; pre-fix regression confirms 10 warm restarts save
1750 s vs the old always-180-s path. MCU regression now 82/82.
2026-04-28 09:50:32 +05:45
Jason 0a49320e31 fix(mcu): MCU-A2 — site-configurable mag declination, persisted in BKPSRAM
The magnetometer yaw correction used a hardcoded -0.61 deg literal baked
in for one deployment site. Yaw_Sensor was wrong by (site_decl + 0.61)
deg at every other site whenever the UM982 dual-antenna heading was
unavailable.

Backed the value with BKPSRAM (slots 1+2 — slot 0 is the MCU-A7
emergency flag) and exposed set_mag_declination_deg / get_mag_declination_deg.
Default returns the legacy -0.61 deg when no override has been written so
the original site stays correct out of the box; a host command (or
future GPS-derived auto-calibration) writes the new site value once and
it persists across every reset path until main-power removal.

Hardened with a +/-30 deg range clamp on both write AND read paths — real
magnetic declinations are roughly +/-25 deg worldwide, so a wider value
indicates a calibration error or BKPSRAM corruption (VBAT brown-out, bit
flip) rather than a legitimate site. Defensive read-side clamp prevents
a corrupted slot from propagating a wild heading offset.

Replaced the single use site at the magnetometer yaw computation with
the getter; legacy global Mag_Declination retained and kept in sync by
the setter for any external linkage.

Added test_mcu_a2_mag_declination (10 cases): default, set/get,
persistence across reset, power-cycle clear, write-side clamp both
directions, plausible-site passthrough, defensive read-side clamp on
corruption, wrong-magic fallback, pre-fix bearing-error regression.
MCU regression now 81/81.
2026-04-28 09:45:41 +05:45
Jason 4a102e30fe fix(mcu): MCU-A6 — recovery handlers for AD9523_CLOCK and FPGA_COMM
attemptErrorRecovery() previously fell through to the default log-only
branch for both ERROR_AD9523_CLOCK and ERROR_FPGA_COMM. checkSystemHealth
keeps re-firing the same error every pass with no recovery action ever
attempted, so the system limps along until escalation kicks in.

ERROR_AD9523_CLOCK: AD9523_RESET_ASSERT, 10 ms settle, then re-run
configure_ad9523() (releases reset, selects REFB, reprograms, waits for
lock). On second failure we log and let the next health pass re-fire so
a transient brown-out on the 100 MHz reference does not drop straight
into Emergency_Stop.

ERROR_FPGA_COMM: pulse PD12 LOW->10 ms->HIGH (matches the boot reset
pattern). PA rails left untouched at runtime; brief adar_tr_x undefined
window is acceptable vs. losing the radar entirely.

Added test_mcu_a6_recovery_dispatch (11 cases) covering both new
handlers, all existing routes, the default branch, a pre-fix regression
check, and an explicit assertion that RF_PA_OVERCURRENT escalates
upstream (handleSystemError) rather than recovering inline. MCU
regression now 80/80.
2026-04-28 09:26:35 +05:45
Jason 1317a91e01 fix(mcu): MCU-A5 — gate Idq health-window during PA calibration walk
The boot-time Idq calibration walks DAC_val from 126 down toward the
1.680 A target. Mid-walk readings sit well above the 2.5 A overcurrent
threshold by design, and a channel that hits the safety_counter timeout
(50 iters) can be left above the window. Without a gate, the next
checkSystemHealth() pass would trip ERROR_RF_PA_OVERCURRENT and route
straight into Emergency_Stop, killing the system mid-bringup.

Added a `pa_calibration_in_progress` flag set TRUE around both DAC1 and
DAC2 cal walks. checkSystemHealth's Idq window short-circuits while the
flag is set; bias-fault and overcurrent thresholds remain fully active
once the walk completes, so any genuinely stuck-high channel surfaces on
the very next health pass and routes through the normal handler.

Other health checks (lock, comm, temperature, watchdog) stay live during
cal — no behavioural change to anything except the Idq window.

Added test_mcu_a5_pa_cal_gate (7 cases): mid-walk masking, post-cal
re-arming, stuck-high channel surfacing after gate clears, bias-fault
gating, PowerAmplifier=false short-circuit, and a pre-fix regression
case showing the buggy path would have tripped overcurrent mid-walk.
MCU regression now 79/79.
2026-04-28 09:21:43 +05:45
Jason f28a0eaa80 fix(mcu): MCU-A7 — persist emergency state across MCU resets in BKPSRAM
Emergency_Stop's hold loop refreshed IWDG forever, so any reset path that
DID fire (SYSRESETREQ from another fault, brown-out) would re-run
startup and re-energize the PA rails — there was no record that the
system had been in emergency state. Watchdog defeat in the hold loop
masked the problem.

BKPSRAM gives us a flag that survives every reset path but is lost on
main-power removal — exactly the recovery semantics we want:
power-cycle is the deliberate operator action that clears emergency,
every other reset stays in safe-hold.

  - Added emergency_persist_set/check helpers (BKPSRAM @ 0x40024000,
    magic 0xDEAD5A5A); enable PWR + backup-access + BKPSRAM clock.
  - Emergency_Stop now writes the flag BEFORE the rail-cut sequence so
    even an interrupted shutdown still leaves the persisted state set.
  - main() checks the flag immediately after MX_IWDG_Init and before
    any PA enable code; if set, calls Emergency_Stop directly. GPIO
    init has already forced all PA enables LOW, so the safe-hold path
    is reached without a single PA rail going hot.

Hold-loop IWDG refresh kept intentionally: a healthy hold loop does not
need to cycle the MCU, but if the loop itself wedges (stack corruption,
bus fault), refresh stops, IWDG fires, and the persist flag routes the
reset right back into safe-hold.

Added test_mcu_a7_emergency_persist (6 cases) modelling BKPSRAM
persistence vs power-cycle, including a regression check that exercises
the pre-fix "no persistence" boot to confirm it would have re-energized
the PAs. MCU regression now 78/78.
2026-04-27 19:52:13 +05:45
Jason df0b2fd469 fix(mcu): MCU-A1 — replace 25 C cooling stub with 70/60 C hysteresis
Cooling-fan trip in main.cpp's periodic temperature block was a 25 C dev
stub that latched the fan ON at room temperature on every boot. Replaced
with production thermal control: ON at 70 C, OFF at 60 C. The 10 C
dead-band prevents relay/fan chatter near the threshold; the 70 C ON
point sits below the 75 C SAFE-mode gate in checkSystemHealth() so the
fan engages before the system shuts down.

Driven from the existing `temperature` global (max of 8 sensors,
populated just above by the GAP-3 fix) instead of re-OR'ing the eight
Temperature_N variables — single source of truth, and the diag now
prints the actual peak temperature on each transition.

Added test_mcu_a1_cooling_hysteresis (9 cases) covering cold-start,
upward crossing, dead-band hold, downward crossing, and a regression
guard at 30 C that would have engaged the fan under the old stub.
MCU regression now 77/77.
2026-04-27 19:42:42 +05:45
Jason 2c34323bcb fix(mcu): MCU-N5/C4 — runRadarPulseSequence stops shadowing m/n/y globals
runRadarPulseSequence was redeclaring `int m, n, y` at function scope,
which shadowed the file-scope `uint8_t m, n, y` globals at lines
~190-192 that getStatusString reports to the GUI as
BeamPos|Azimuth|ChirpCount. The function's increments updated only the
locals, then discarded them — so telemetry was permanently frozen at
"BeamPos:1|Azimuth:1|ChirpCount:1" no matter how many beam positions
or revolutions had elapsed.

Fix: drop the three local declarations; the body already references
m/n/y by name, so removing the locals lets the writes hit the globals.
A comment documents the pitfall so the locals do not get re-added by
a future cleanup. Numeric ranges are safe (m_max=32, n_max=31,
y_max=50, all fit in uint8_t).

Test: new standalone test_bug16_runradar_shadows_globals.c reproduces
both the buggy (locals shadow globals) and fixed (globals advance)
patterns and asserts the expected post-sweep values
(g_n=16, g_m=1 wraps each iter, g_y=2 after one revolution).

MCU regression: 76/76 (was 75).
2026-04-27 13:36:28 +05:45
copilot-swe-agent[bot] df875bdf4d Merge origin/develop into feat/um982-gps-driver
Co-authored-by: JJassonn69 <83615043+JJassonn69@users.noreply.github.com>
2026-04-16 06:23:05 +00:00
3aLaee 35539ea934 fix(mcu): harden checkSystemHealth() watchdog against cold-start + stale-ts
checkSystemHealth()'s internal watchdog (pre-fix step 9) had two linked
defects that, combined with the previous commit's escalation of
ERROR_WATCHDOG_TIMEOUT to Emergency_Stop(), would false-latch AERIS-10:

  1. Cold-start false trip:
       static uint32_t last_health_check = 0;
       if (HAL_GetTick() - last_health_check > 60000) { trip; }
     On the first call, last_health_check == 0, so the subtraction
     against a seeded-zero sentinel exceeds 60 000 ms as soon as the MCU
     has been up >60 s -- normal after the ADAR1000 / AD9523 / ADF4382
     init sequence -- and the watchdog trips spuriously.

  2. Stale timestamp after early returns:
       last_health_check = HAL_GetTick();   // at END of function
     Every earlier sub-check (IMU, BMP180, GPS, PA Idq, temperature) has
     an `if (fault) return current_error;` path that skips the update.
     After ~60 s of transient faults, the next clean call compares
     against a long-stale last_health_check and trips.

With ERROR_WATCHDOG_TIMEOUT now escalating to Emergency_Stop(), either
failure mode would cut the RF rails on a perfectly healthy system.

Fix: move the watchdog check to function ENTRY. A dedicated cold-start
branch seeds the timestamp on the first call without checking. On every
subsequent call, the elapsed delta is captured first and
last_health_check is updated BEFORE any sub-check runs, so early returns
no longer leave a stale value. 32-bit tick-wrap semantics are preserved
because the subtraction remains on uint32_t.

Add test_gap3_health_watchdog_cold_start.c covering cold-start, paced
main-loop, stall detection, boundary (exactly 60 000 ms), recovery
after trip, and 32-bit HAL_GetTick() wrap -- wired into tests/Makefile
alongside the existing gap-3 safety tests.
2026-04-15 20:36:19 +02:00
Jason b0e5b298fe feat(gps): add UM982 GPS driver replacing broken TinyGPS++
Implement a complete UM982 GNSS driver (um982_gps.h/.c) with:
- NMEA parser for GGA, RMC, THS, VTG with multi-talker support (GP/GN/GL/GA/GB)
- Correct coordinate parsing using decimal-point-based degree detection
  (fixes PR #68 bug: 3-digit longitude degrees)
- Checksum verification on all incoming sentences
- Non-blocking line assembler with ring buffer
- Init sequence: UNLOG, HEADING FIXLENGTH, baseline config, NMEA enables,
  VERSIONA handshake (no SAVECONFIG to avoid NVM wear)
- Validity/age checks with configurable timeouts

Integration into main.cpp:
- Replace TinyGPSPlus with UM982_GPS_t, UART5 baud 9600->115200
- Non-blocking um982_process() in main loop (single-byte UART reads)
- GPS heading override with magnetometer fallback
- Health check using um982_position_age()

Test infrastructure:
- 49 unit tests covering checksums, coordinate parsing, all sentence types,
  talker IDs, feed/assembly, validity, init sequence, edge cases
- Mock HAL_UART_Receive with per-UART ring buffer for integration tests
- All 72 MCU tests passing (23 existing + 49 new)

Fixes all 12 bugs identified in PR #68 analysis (5 compile errors + 7 functional).
2026-04-15 17:46:21 +05:45
3aLaee 4900282042 fix(mcu-tests): strip stray literal backslash-r in Makefile continuations
The previous commit accidentally introduced the literal 2-byte sequence
'\r' at the end of two backslash-continuation lines (TESTS_STANDALONE
and the .PHONY list). GNU make on Linux treats that as text rather than
a line continuation, which orphans the following line with leading
spaces and aborts CI with:

  Makefile:68: *** missing separator (did you mean TAB instead of 8 spaces?)

Strip the extraneous 'r' so each continuation ends with a real backslash
+ LF.
2026-04-15 09:16:03 +02:00
3aLaee a2686b7424 fix(mcu): escalate overtemp and watchdog-timeout faults to Emergency_Stop()
handleSystemError() only called Emergency_Stop() for error codes in
[ERROR_RF_PA_OVERCURRENT .. ERROR_POWER_SUPPLY] (9..13). Two critical
faults were left out of the gate and fell through to attemptErrorRecovery()'s
default log-and-continue branch:

  - ERROR_TEMPERATURE_HIGH (14): raised by checkSystemHealth() when the
    hottest of 8 PA thermal sensors exceeds 75 C. Without cutting bias
    (DAC CLR) and the PA 5V0/5V5/RFPA_VDD rails, the 10 W GaN QPA2962
    stages remain biased in an overtemperature state -- a thermal-runaway
    path in AERIS-10E.

  - ERROR_WATCHDOG_TIMEOUT (16): indicates the health-check loop has
    stalled (>60 s since last pass). Transmitter state is unknown;
    relying on IWDG to reset the MCU re-runs startup and re-energises
    the PA rails rather than latching the safe state.

Fix: extend the critical-error predicate so these two codes also trigger
Emergency_Stop(). Add test_gap3_overtemp_emergency_stop.c covering all
17 SystemError_t values (must-trigger and must-not-trigger), wired into
tests/Makefile alongside the existing gap-3 safety tests.
2026-04-14 21:53:39 +02:00
Jason 666527fa7d feat: AGC phases 4-5 — STM32 outer-loop AGC class + main.cpp integration
Implements the STM32 outer-loop AGC (ADAR1000_AGC) that reads the FPGA
saturation flag on DIG_5/PD13 once per radar frame and adjusts the
ADAR1000 VGA common gain across all 16 RX channels.

Phase 4 — ADAR1000_AGC class (new files):
- ADAR1000_AGC.h/.cpp: attack/recovery/holdoff logic, per-channel
  calibration offsets, effectiveGain() with OOB safety
- test_agc_outer_loop.cpp: 13 tests covering saturation, holdoff,
  recovery, clamping, calibration, SPI spy, reset, mixed sequences

Phase 5 — main.cpp integration:
- Added #include and global outerAgc instance
- AGC update+applyGain call between runRadarPulseSequence() and
  HAL_IWDG_Refresh() in main loop

Build system & shim fixes:
- Makefile: added CXX/CXXFLAGS, C++ object rules, TESTS_WITH_CXX in
  ALL_TESTS (21 total tests)
- stm32_hal_mock.h: const uint8_t* for HAL_UART_Transmit (C++ compat),
  __NOP() macro for host builds
- shims/main.h + real main.h: FPGA_DIG5_SAT pin defines

All tests passing: MCU 21/21, GUI 92/92, cross-layer 29/29.
2026-04-13 20:14:31 +05:45
Jason f3bbf77ca1 Gap 3 Safety Architecture: IWDG watchdog, Emergency_Stop PA rail cutoff, temp max, periodic IDQ re-read, emergency state ordering + 5 tests (20/20 pass) 2026-03-19 21:58:39 +02:00
Jason c466021bb6 Fix bugs B12-B17 (PA cal loop, ADC buffer, DIAG_SECTION args, htim3 init, stale annotations) with regression tests
B12: PA IDQ calibration loop condition inverted (< 0.2 -> > 0.2) for both DAC1/DAC2
B13: DAC2 ADC buffer mismatch — reads from hadc2 now correctly stored to adc2_readings
B14: DIAG_SECTION macro call sites changed from 2-arg to 1-arg form (4 sites)
B15: htim3 definition + MX_TIM3_Init() added (PWM mode, CH2+CH3, Period=999)
B16: Removed stale NO-OP annotation on TriggerTimedSync (already fixed in Bug #3)
B17: Updated stale GPIO-only warnings to reflect TIM3 PWM implementation (Bug #5)

All 15 tests pass (11 original + 4 new for B12-B15).
2026-03-19 11:04:53 +02:00
Jason 49c9aa28ad Fix Bug #11 (platform SPI transmit-only), FPGA B2 (chirp BRAM migration), FPGA B3 (DSP48 pipelining)
Bug #11: platform_noos_stm32.c used HAL_SPI_Transmit instead of
HAL_SPI_TransmitReceive — reads returned garbage. Changed to in-place
full-duplex. Dead code (never called), fixed per audit recommendation.
Test added: test_bug11_platform_spi_transmit_only.c. Mock infrastructure
updated with SPI spy types. All 11 firmware tests pass.

FPGA B2: Migrated long_chirp_lut[0:3599] from ~700 lines of hardcoded
assignments to BRAM with (* ram_style = "block" *) attribute and
$readmemh("long_chirp_lut.mem"). Added sync-only read block for proper
BRAM inference. 1-cycle read latency introduced. short_chirp_lut left
as distributed RAM (60 entries, too small for BRAM).

FPGA B3: Added BREG (window_val_reg) and MREG (mult_i_raw/mult_q_raw)
pipeline stages to doppler_processor.v. Eliminates DPIP-1 and DPOP-2
DRC warnings. S_LOAD_FFT retimed: fft_input_valid starts at sub=2,
+1 cycle total latency. BREG primed in S_PRE_READ at no extra cost.
Both FPGA files compile clean with Icarus Verilog.
2026-03-19 10:31:16 +02:00
Jason 3b32f67087 Fix SPI bugs #9 (NULL platform_ops) and #10 (missing CS toggle), widen chip_select to uint16_t
Bug #9: Both TX and RX SPI init params had platform_ops = NULL, causing
adf4382_init() -> no_os_spi_init() to fail with -EINVAL. Fixed by setting
platform_ops = &stm32_spi_ops and passing stm32_spi_extra with correct CS
port/pin for each device.

Bug #10: stm32_spi_write_and_read() never toggled chip select. Since TX
and RX ADF4382A share SPI4, every register write hit both PLLs. Rewrote
stm32_spi.c to assert CS LOW before transfer and deassert HIGH after,
using stm32_spi_extra metadata. Backward-compatible: legacy callers
(e.g., AD9523) with cs_port=NULL skip CS management.

Also widened chip_select from uint8_t to uint16_t in no_os_spi.h since
STM32 GPIO_PIN_xx values (e.g., GPIO_PIN_14=0x4000) overflow uint8_t.

10/10 tests pass (8 original + 2 new regression tests).
2026-03-19 10:00:05 +02:00
Jason 397969348e Fix all 8 firmware bugs with regression tests
Bugs fixed in adf4382a_manager.c:
- Bug #1: Move initialized=true before sync setup, propagate sync failure
- Bug #3: Implement TriggerTimedSync with sw_sync pulse (was no-op)
- Bug #5: Replace GPIO-only placeholder with TIM3 PWM for DELADJ
- Bug #7: Correct GPIOG pin definitions to match CubeMX (pins 6-15)

Bugs fixed in main.cpp:
- Bug #2: Remove pre-reset ad9523_setup() call (keep only post-reset)
- Bug #4: Move init error check before phase shift calls
- Bug #6: Fix timer variable (last_check -> last_check1) in temp block
- Bug #8: Uncomment uart_print/uart_println debug helpers

Test harness updates:
- All 8 tests rewritten to assert correct post-fix behavior
- Added TIM PWM mock (SPY_TIM_PWM_START/STOP/SET_COMPARE)
- Added mock_adf4382_set_timed_sync_retval for failure injection
- Updated shims and Makefile for new test dependencies
- All 8 tests pass: make clean && make test -> 8/8 passed
2026-03-19 09:42:59 +02:00
Jason 28a66889ad Add MCU firmware test harness with 8 bug-confirming tests
Complete test infrastructure for the observe-before-fix methodology:

- stm32_hal_mock: HAL stub types + spy/recording ring buffer (512 entries)
- ad_driver_mock: ADF4382/AD9523 mock drivers with configurable returns
- 9 shim headers redirecting real #includes to mock types
- Makefile with individual (test_bug1..8) and aggregate (test) targets

All 8 tests pass, confirming:
  #1 Timed sync init ordering (SetupTimedSync before initialized=true)
  #2 AD9523 double setup (first call before reset release)
  #3 TriggerTimedSync no-op (prints messages, no HW action)
  #4 Phase shift before init error check
  #5 SetFinePhaseShift GPIO-only placeholder (no PWM)
  #6 Timer variable collision (last_check vs last_check1)
  #7 GPIO pin mapping conflict (manager.h vs CubeMX main.h)
  #8 uart_print/uart_println commented out
2026-03-19 09:28:19 +02:00