In the last post, I uncovered a bug in the Vivado implementation which accidently removes the DIFF_TERM from my input buffer. With that problem solved, I picked up the project again with a goal to achieve high speed imaging. Now I’m going to cover the design principal and its intermediate steps to achieve it.
End Result – My customized VDMA IP highlighted
The end goal is to have an unified, resource-efficient and high-performance VDMA to receive, decode, and transfer data to an external memory. In this case, it will be the DDR3 on PS side. The screenshot above looks very satisfactory and concise. But in reality, any error in along its data path will complicates the entire debugging process. Thus I decided to tackle them one module at a time from the upstream, only combining them in the final run.
The initial step is to make sure all receiving banks are decoding the SOL/EOL sync code properly. The stock ILA IP could be attached to the immediate downstream of the receiving module to verify their function. The ILA requires a free running clock. In this case, I had to let the sensor run in high power and continuous mode to feed a constant clock into FPGA.
Intermediate Design: Receiver feeding multiple external AXI-S Data FIFOs
After verifying that all 4 banks are functioning, the next step is to receive the data with integrity. Due to inter-bank skew, each pixel from a different bank could arrive at a slightly different clock period. And the source synchronizing clocks are not aligned among banks. Asynchronous FIFO is the answer. Each bank will feed a FIFO at its own source clock. But all 4 banks are read out under a single internal clock when all FIFOs have available data. The AXI-Stream Infrastructure IPs contains a Data FIFO for AXI4-S protocol. Its TReady on the slave port will not go down before FIFO is full. To read off from all 4 FIFOs in an aligned manner, all data must be valid.
assign Ready = (Data_valid_0_in & Data_valid_1_in) & (Data_valid_2_in & Data_valid_3_in);
The above method was tested under 2015.2. It will no longer work in the most recent version for some unknown reasons. This drives me to use the FIFO_DUAL_CLOCK_MACRO provided by Xilinx. It is essentially a BRAM with built in hardware asynchronous FIFO support on 7 Series devices. The read enable signal must be de-asserted in the same cycle when any of their Empty signal rises.
assign RDEN_all = !EMPTY_0 & !EMPTY_1 & !EMPTY_2 & !EMPTY_3;
Also, the FIFO must be reset for 5 period on both reading and writing clock before use. Thus I implemented a resetting handshake mechanism to enable reset as soon as the source clock starts running. This will give plenty time before actual data arrives. The downstream goes through an AXI-S Data FIFO buffer before the AXI-DMA.
So far so good, I’ve got a complete image. But many areas should be improved. First of, the pixels are not organized. Each transfer of 64 bits are 2 pixels from even row and 2 from odd rows. This column-wise de-interlacing simply costs too much for an ARM CPU to do. Secondly, at least a quarter of the bandwidth is wasted at 12bit ADC or lower. Under the 150MHz timing constrain of AXI-DMA. this severely hinders high frame rate operation. Thus I decided to use an interleave transfer mechanism coupled with bit packing. To use interleave transfer, address generation and switching is essential to jump between even and odd rows. As for bit packing, it’s a common trick used in many uncompressed RAW formats (DNG, Nikon NEF). In this instance, I will use 3 bytes 2 pixel packing at 10/12bit ADC.
The most complicated part is 4K boundary check required by AXI protocol. Consider the following code block:
If (Target_addr[11:3] + Burst_count > 4096)
m_axi_s2mm_awlen <= ((Target_addr | 12’hFF8) – Target_addr) >> 3;
else m_axi_s2mm_awlen <= Burst_count – 1;
This is doomed to fail at high frequency due to large number subtraction. The critical path length will be doubled given the bit width of the target address. The above step can be split into pipeline fashion. First checking if condition and set a flag FDRE. In the next cycle, use the flag to direct calculation. The pipeline latency is high in my implementation, totaling at 5 clock cycle assuming constant AWReady state. However, considering that the address can be generated in advanced while the last data burst is in transfer, the actual latency can be ignored.
Eventually, the design is successfully constrained under 200MHz AXI-HP clock. After hardware validation, the actual bandwidth for a 4K@60FPS video will be a whopping of 1.2GB/s given 1.6GB/s available.
Implementation Map: AXI-mem-interconnect; AXI-GP-interconnect; Peripheral Resets
Receiver and Sync Detector; Interleave FIFO Control; S2MM Channel Core; AXI-GP Slave Control Registers
The synthesis and implementation run is only 3 minutes combined, given only 5 core Verilog files within my VDMA IP. The rest is just 2 AXI-interconnects and peripheral resets. The resource utilization is only 11% for LUT and less for FDREs in a 7010 device, leaving much space for future use.
Test pattern looks perfect! In my next post, I’ll showcase some actual images.