perf(fpga): move CIC comb stages to fabric — 80→70 DSPs (-10)

Strip the explicit DSP48E1 instance from comb stage 0 and the
(* use_dsp = "yes" *) attribute from comb stages 1-4. The combs are
gated by data_valid_comb_pipe (fires once every 4 clk_400m cycles
post-decimation), so a multicycle path of 4 -setup / 3 -hold scoped
to the comb registers in xc7a50t_ftg256.xdc gives STA 10 ns of slack
for fabric carry-chain to close 28-bit subtracts comfortably.

Pipeline depth and bit-widths unchanged: the new fabric model mirrors
the prior CREG+AREG+BREG+PREG structure exactly, so data_valid_comb_0_out
alignment and downstream stages 1-4 see bit-identical samples. CIC
behavioral simulation model now lives outside the SIMULATION ifdef
branch (used unconditionally) since there is no longer a synthesis-only
DSP48E1 to replace.

50T post-impl results (Vivado 2025.2):
  DSPs:         80 → 70 / 120 (66.7% → 58.3%, freed 10)
  LUTs:         22114 / 32600 (67.8%)
  BRAM:         55.5 / 75 (74.0%, unchanged)
  adc_dco_p WNS: +0.022 ns → +0.906 ns (margin improved)
  All clocks meet timing, 0 failing endpoints.

Local regression: 32/34 PASS — same as baseline; the two failures
(Receiver Integration, Matched Filter Chain) are pre-existing
RX-NEW-3 (FFT throughput) and unaffected by this change. Bit-exact
through DDC chain (NCO→CIC→FIR) and MF cosim verified.

Cumulative DSP savings today: 112 → 70 (freed 42), enough headroom
for Xilinx LogiCORE FFT Pipelined Streaming swap (~33 DSPs for the
3-instance matched-filter chain) with 17 DSPs to spare.
This commit is contained in:
Jason
2026-04-23 11:32:03 +05:45
parent 0b2f75620e
commit cc6691dec9
2 changed files with 76 additions and 139 deletions
@@ -457,6 +457,33 @@ set_false_path -from [get_cells -hierarchical -filter {NAME =~ *reset_sync*_reg*
set_false_path -from [get_clocks clk_100m] -to [get_clocks adc_dco_p]
set_false_path -from [get_clocks adc_dco_p] -to [get_clocks clk_100m]
# --------------------------------------------------------------------------
# CIC comb stages — multicycle path (4-cycle setup / 3-cycle hold)
# --------------------------------------------------------------------------
# Comb registers (cic_*/comb_reg[*], cic_*/comb_delay_reg[*][*],
# cic_*/comb_0_c_reg, cic_*/comb_0_ab_reg, cic_*/comb_0_p_reg) are clocked at
# adc_dco_p (400 MHz) but their CE pins are driven by data_valid_comb_pipe /
# data_valid_comb_0_out, which fire once every 4 cycles after the 4× decimator.
# Effective throughput is 100 MHz, so STA can budget 4·2.5 ns = 10 ns of setup
# slack instead of 2.5 ns. This frees the DSP48E1s these stages previously
# occupied (5 per channel × 2 channels = 10 DSPs) and lets fabric carry-chain
# subtracts close timing comfortably. See cic_decimator_4x_enhanced.v header
# comment on the comb array declaration.
set_multicycle_path 4 -setup \
-from [get_cells -hierarchical -filter {NAME =~ *cic_*/comb_*reg*}] \
-to [get_cells -hierarchical -filter {NAME =~ *cic_*/comb_*reg*}]
set_multicycle_path 3 -hold \
-from [get_cells -hierarchical -filter {NAME =~ *cic_*/comb_*reg*}] \
-to [get_cells -hierarchical -filter {NAME =~ *cic_*/comb_*reg*}]
# Also relax the launch path from integrator_sampled_comb (fed by integrator_4
# DSP48E1 at decimated rate) into comb_0_c_reg.
set_multicycle_path 4 -setup \
-from [get_cells -hierarchical -filter {NAME =~ *cic_*/integrator_sampled_comb_reg*}] \
-to [get_cells -hierarchical -filter {NAME =~ *cic_*/comb_*reg*}]
set_multicycle_path 3 -hold \
-from [get_cells -hierarchical -filter {NAME =~ *cic_*/integrator_sampled_comb_reg*}] \
-to [get_cells -hierarchical -filter {NAME =~ *cic_*/comb_*reg*}]
# clk_100m ↔ clk_120m_dac: CDC via synchronizers in radar_system_top
set_false_path -from [get_clocks clk_100m] -to [get_clocks clk_120m_dac]
set_false_path -from [get_clocks clk_120m_dac] -to [get_clocks clk_100m]