perf(fpga): move CIC comb stages to fabric — 80→70 DSPs (-10)

Strip the explicit DSP48E1 instance from comb stage 0 and the (* use_dsp = "yes" *) attribute from comb stages 1-4. The combs are gated by data_valid_comb_pipe (fires once every 4 clk_400m cycles post-decimation), so a multicycle path of 4 -setup / 3 -hold scoped to the comb registers in xc7a50t_ftg256.xdc gives STA 10 ns of slack for fabric carry-chain to close 28-bit subtracts comfortably. Pipeline depth and bit-widths unchanged: the new fabric model mirrors the prior CREG+AREG+BREG+PREG structure exactly, so data_valid_comb_0_out alignment and downstream stages 1-4 see bit-identical samples. CIC behavioral simulation model now lives outside the SIMULATION ifdef branch (used unconditionally) since there is no longer a synthesis-only DSP48E1 to replace. 50T post-impl results (Vivado 2025.2): DSPs: 80 → 70 / 120 (66.7% → 58.3%, freed 10) LUTs: 22114 / 32600 (67.8%) BRAM: 55.5 / 75 (74.0%, unchanged) adc_dco_p WNS: +0.022 ns → +0.906 ns (margin improved) All clocks meet timing, 0 failing endpoints. Local regression: 32/34 PASS — same as baseline; the two failures (Receiver Integration, Matched Filter Chain) are pre-existing RX-NEW-3 (FFT throughput) and unaffected by this change. Bit-exact through DDC chain (NCO→CIC→FIR) and MF cosim verified. Cumulative DSP savings today: 112 → 70 (freed 42), enough headroom for Xilinx LogiCORE FFT Pipelined Streaming swap (~33 DSPs for the 3-instance matched-filter chain) with 17 DSPs to spare.
2026-06-09 15:07:14 +00:00 · 2026-04-23 11:32:03 +05:45
parent 0b2f75620e
commit cc6691dec9
2 changed files with 76 additions and 139 deletions
@@ -457,6 +457,33 @@ set_false_path -from [get_cells -hierarchical -filter {NAME =~ *reset_sync*_reg*
 set_false_path -from [get_clocks clk_100m] -to [get_clocks adc_dco_p]
 set_false_path -from [get_clocks adc_dco_p] -to [get_clocks clk_100m]

+# --------------------------------------------------------------------------
+# CIC comb stages — multicycle path (4-cycle setup / 3-cycle hold)
+# --------------------------------------------------------------------------
+# Comb registers (cic_*/comb_reg[*], cic_*/comb_delay_reg[*][*],
+# cic_*/comb_0_c_reg, cic_*/comb_0_ab_reg, cic_*/comb_0_p_reg) are clocked at
+# adc_dco_p (400 MHz) but their CE pins are driven by data_valid_comb_pipe /
+# data_valid_comb_0_out, which fire once every 4 cycles after the 4× decimator.
+# Effective throughput is 100 MHz, so STA can budget 4·2.5 ns = 10 ns of setup
+# slack instead of 2.5 ns. This frees the DSP48E1s these stages previously
+# occupied (5 per channel × 2 channels = 10 DSPs) and lets fabric carry-chain
+# subtracts close timing comfortably. See cic_decimator_4x_enhanced.v header
+# comment on the comb array declaration.
+set_multicycle_path 4 -setup \
+  -from [get_cells -hierarchical -filter {NAME =~ *cic_*/comb_*reg*}] \
+  -to   [get_cells -hierarchical -filter {NAME =~ *cic_*/comb_*reg*}]
+set_multicycle_path 3 -hold \
+  -from [get_cells -hierarchical -filter {NAME =~ *cic_*/comb_*reg*}] \
+  -to   [get_cells -hierarchical -filter {NAME =~ *cic_*/comb_*reg*}]
+# Also relax the launch path from integrator_sampled_comb (fed by integrator_4
+# DSP48E1 at decimated rate) into comb_0_c_reg.
+set_multicycle_path 4 -setup \
+  -from [get_cells -hierarchical -filter {NAME =~ *cic_*/integrator_sampled_comb_reg*}] \
+  -to   [get_cells -hierarchical -filter {NAME =~ *cic_*/comb_*reg*}]
+set_multicycle_path 3 -hold \
+  -from [get_cells -hierarchical -filter {NAME =~ *cic_*/integrator_sampled_comb_reg*}] \
+  -to   [get_cells -hierarchical -filter {NAME =~ *cic_*/comb_*reg*}]
+
 # clk_100m ↔ clk_120m_dac: CDC via synchronizers in radar_system_top
 set_false_path -from [get_clocks clk_100m] -to [get_clocks clk_120m_dac]
 set_false_path -from [get_clocks clk_120m_dac] -to [get_clocks clk_100m]