Cheaper yet powerful camera solutions

It’s been a while since my last blog post. During this past year, I’ve built a few other cameras yet released on this blog. In the meantime, I have been looking into options to make this work available to fellow amateur astronomers as a viable product. One major blocker here is the cost. FPGAs are expensive devices due to two factors: 1. They are less produced compared to ASIC and still uses state of art silicon process. 2. Massive area dedicated to routing and configuration logic. Let’s look at a simple comparison. The MicroZed board I’m using cost $200 with dual Cortex-A9 core clocking at 666MHz. This contrasts with quad core Ras Pi 3B clocking at doubling frequency. And it only cost $30.

However, using these single board computer SoC devices are not free from challenges. Most scientific CMOS sensors do not output data using standard MIPI CSI2 interfaces and require a FPGA fabric to do the conversion. Beyond that, we also need to choose a SoC that has CSI2 interfaces supporting high enough total bandwidth. Then to take functionality into consideration, it’d be preferable to enable edge computing/storage and provide internet hosting in a single solution. In the end, we conclude the next generation should have the following connectivity.

1. 1000Base-T Ethernet and built-in WiFi support

2. USB3.0 in type-C connector

3. Fast storage with PCI-E NVME SSD

Besides these, the device should be open enough with Technical Reference Manual (TRM) and driver source code available for its various IP blocks. Ras Pi clearly drops out due to limited CSI2 bandwidth and absence of fast I/O. After length and careful comparison, I landed on Rockchip RK3399. It has dual CSI2 providing a total 1.5GB/s bandwidth and powerful hex A72/A53 cores running above 1.5GHz for any processing. One platform from friendlyArm NanoPC-T4 board is the most compact among all 3399 dev kits. This board also has IO interfaces aligning on one edge making case design straightforward. It is vastly cheaper compared to Zynq MPSoC with similar I/O connectivity.

NanoPC T4

Two MIPI CSI2 connector on the right

Now the rest is to provide a cheap FPGA bridge between the sensor and CSI2 interface. The difficult part is of course the 1.5Gbs of MIPI CSI2 transmitter. On datasheet, the 7 series HR bank OSERDES is rated at 1250Mbs. But like any other chip vendor, Xilinx down rate the I/O with some conserved margin. It’s been shown before that these I/O can be toggled safely at 1.5Gbs for 1080P60 HDMI operation. But still, that is TMDS33 with a much larger swing compared to LVDS/SLVS for MIPI D-PHY. To test this out, I put a compatible connector on the last carrier card design using extra I/Os. Because D-PHY is a mix I/O standard running on the same wire, only the latest Ultrascale Plus supports it natively. To combine both low power single ended LVCMOS12 and high-speed differential SLVS using cheap 7 Series FPGA, we must add an external resistor network according to Figure 10 in Xilinx XAPP894.

PCB resistor network with some rework

It is possible though, to merge all LP positive and negative line respectively to save some I/O if we are only using high-speed differential signaling. In this case, tying these LP will toggle all four lanes into HS mode simultaneously. The resistor divider ratio has also been changed because I need to share with LVDS25 signals from CMOS sensor in the same HR bank.

To produce an image, I wrote a test pattern generator to produce a simple ramp up value pixel by pixel in each line. Every next frame the starting value will increase by four. Timing closure was done at 190MHz for the AXI stream. This prevents FIFO underrun at 1.5Gbs at four lanes. I then took the stock OV13850 camera as mimicking target. A simple bare metal application runs on PS7. This app listens for I2C command interrupt, configures the MMCM clocking, sets image size and blanking and enables the core.

Finally, some non-trivial changes need to be done on the RK3399 side to receive correctly. After lengthy driver code review, I found two places requires change. First, the lane frequency setting in the driver. This eventually populates the a V4L2 struct that affect HS settling timing between LP and HS transition. Second, the device tree contains the entry for number of lanes used for this sensor.

MicroZed stack on top of NanoPC T4. Jumper cable are I2C

There’s a mode to disable all ISP function to get RAW data. This proves extremely helpful to verify data integrity. In the end, we won’t need ISP for astronomical imaging anyway.

Timing of low power toggle plus HS settle costs 21% overhead

Rolling ramp TPG wraps around through HDMI screen preview

This work paves the way for our ongoing full fledge adapter board. Stay tuned for more information soon!

Advertisements

CMOS Camera – P7: Streaming Lossless RAW Compression

Now this post will be for some serious stuff involving video compression. Early this year I decided to make a lossless compression IP core for my camera in case one day I make it for video. And because it’s for video, the compression has to be stream operable and real time. That means, you cannot save it to DDR ram and do random lookup during compression. JPEG needs to at least buffer 8 rows as it does compression on 8×8 blocks. Other complex algorithm such as H264 requires even larger transient memory for inter frame look up. Most of these lossy compression cores consume a lot of logic resource which my cheap Zynq 7010 doesn’t have, or not up to the performance when fitting into a small device. Also I would prefer lossless than lossy video stream.

There’s an algorithm every RAW image format uses but rarely implemented in common image format. NEF, CR2, DNG, you name it. It’s the Lossless JPEG defined in 1993. The process is very simple: use the neighboring pixels’ intensity to predict the current pixel you’d like to encode. In another word, let’s record the difference instead of the full intensity. It’s so simple yet powerful (up to 50% size reduction) because most of our images are continuous tone or lack high spatial frequency details. This method is called Differential pulse-code modulation (DPCM). A Huffman code is then attached in front to record the number of digits.

The block design

Sounds easy huh? But once I decided to get it parallel and high speed, the implementation will be very hard. All the later bits have to be shifted correctly for a contiguous bit stream. Timing is especially of concern when the number of potential bits gets large when data is running in high parallel. So I smash the process into 8 pipeline stages in locked steps. 6 pixels are processed simultaneously at each clock cycle. At 12 bit, the worst possible bit length will be 144. That is 12 for Huffman and 12 for differential code each. The result needs to go into a 64bit bus by concatenating with the leftover bits from the previous clock cycle. A double buffer is inserted between the concatenator and compressor. FIFOs are employed up and downstream of the compression core to relieve pressure on the AXI data bus.

Now resource usage is high!

By optimizing some control RTL, the core is happily confined at 200MHz now. Thus theoretically, it could easily process at a insane rate of 1.2GPixel/sec, although this sensor would not stably do it with four LVDS banks. When properly modified, it could suit other sensor which does block based parallel read out. For resource usage, a lot of the LUTs cannot be placed in the same slice as Flip-Flops. Splitting the bit shifter into multiple pipeline stages would definitely reduced the LUT usage and improve timing. But generally the FFs will shoots up to match the number of LUT thus the overall slice usage will probably be identical.

During the test, I used the Zynq core to setup the Huffman look up table. The tree can be modified between the frames so the optimal compression will be realized based on the temporal scene using a statistics block I built in.

Now I just verified the bit stream to decompress correctly using DNG/dcraw/libraw converter. The only addition is a file header and bunch of zeros following 0xFF in compliance with JPEG stream format.

2017/10/31

CMOS Camera – P6: First Light

In July I finally got the UV/IR cut filter for this camera. I designed a simple filter rack and 3D printed it. The whole thing now fits together nicely in front of the sensor. IR cut is necessary due to a huge proportion of light pollution in the near-infrared spectrum.

Filter rack

UV/IR cut taped to the plastic rack.

With all the hardware in place, I added a single trigger exposure mode in the camera firmware. And accordingly a protocol command to issue a release on the PC software.

70SA

The camera is then attached to a SkyRover 70SA astrograph. In the camera angle adjuster, there’s a 12nm bandwidth Ha filter. This would allow me to easily reject light pollution while imaging in front of my house. Focusing through the Ha filter is extremely difficult. I chose a bright star and pulled the exposure time to maximum during liveview for focusing. Finally, before the battery pack went dry (supplying both AZ-EQ6 mount and my camera), I managed to obtain 15 frames with 5 minutes each.

NGC7000

No dark frame was used for the first light image and guiding performance was exceptional. This foiled the kappa sigma algorithm for hot pixel removal and makes the background very noisy. Anyway, NGC7000 already shows rich details!

Remarks

1. This sensor has higher dark current than Sony CMOS. Somewhat >4 folds more at the same temperature. However, doubling temperature is small. In another word, its dark current reduces quickly with cooling. Last time I observed no dark noise at –15C. Thus imaging the horsehead during winter would be brilliant here in Michigan!

2. Power issue. The sensor consumes ~110mA @5V during long integration comparing to ~400mA for continuous readout, which is minimal. However, the Zynq SoC + Ethernet PHY consumes much more than a full running CMOS sensor. Thus some power saving technique can be employed. CPU throttling during long integration/standby, powering down the fabric during standby mode, move the bulk of RTOS to OCM instead of using DDR, etc. But many of these require substantial work.

 

Anyway, I’m going to use this during the solar eclipse here in USA!

CMOS Camera – P5: Ethernet Liveview

To make camera control easier, I spent the last several weeks making a control scheme based on Ethernet. The camera will be a server with LWIP tasks running on a freeRTOS operating system. The client will be my computer of any OS platform. The only thing connects the two will be a 1G Ethernet cable. To speed things up, the client demo program is written in python3.

image

Client application based on TKinter

Once the RTOS is boot up, a core task will set up the network and instantiate a listening port. On the client side, all control commands are sent through TCP protocol once connection is established. On the application layer, there’s really not much protocol going here. I chose to decode the command using a magic code followed by actual command id. Four commands are established so far:

1. Send Setting

2. Start Capture (RTOS will create the CMOS run task)

3. Halt Capture

4. Send Image

Once TCP handshake is done, client could send 1 and 2 to begin video capture with defined setting. During this time, command 4 will retrieve the latest image to decode and display on GUI. The camera setting includes exposure time and gain, frame definition and on chip binning, shutter mode and ADC depth, as well as many other readout related registers.

The image are transferred in RAW data, which is linear. Thus numpy functions become very helpful here to implement the level control and post readout binning. RAW image can be written to disk as RAW video given a fast enough I/O.

Several ongoing improvements are under progress. First and foremost is the Ethernet performance. In a direct point to point connection, there really should be reliability issue. And according to test, TCP could achieve ~75MB/s on a GigaETH. UDP will be even fast might need to with potential packet drop. But anyway, TCP will be able to handle 24FPS 1080P liveview. But both server and client needs optimization. Other issue includes file saving task on RTOS and better long exposure control.

Update 6/24

Some updates on the board operation system.

1. By modifying on socket API, I incorporated  the zero copy mode of TCP operation. Thus pointer to data memory is passed directly to EMAC task and no stack memcpy is involved. This provides a 15% bandwidth gain under TCP operation. Top speed is around 70MB/s for payload.

2. I added in an interrupt event on SDIO driver to avoid polling the status register. Thus IO will not waste CPU cycle and the single core can perform EMAC listening task. As a result, SD file I/O can be performed simultaneously along the video liveview. 

Cooled CMOS Camera – P4: Lens Mount

Things have been going slowly recently. Instead of improving the image acquisition pipeline, I decided to apply some mechanical touch to make it more stable. The PCI-E connector is without a doubt, the weakest link for the entire structure. Also I need to actually make this a camera by mounting a lens on it, instead of just several pieces of PCBs.

Drawing_1Drawing_2

3D Visualization with PCBs

Notice that the linkage of side plate consists of three slots instead of holes. This was designed for tuning the flange distance from the focal plane. Both PCBs are mounted on M3x0.5mm standoffs just like your motherboard in a computer case.

ASM_BackASM_FrontMount

View through the lens mount

An EF macro extension tube is used to mounting the lens. The flange distance is approximately 44mm. The electrical contacts are left float for now. I attached a 50mm 1.8D lens using a mount adapter.

50mm Lens

First image this camera sees through my window.

Cooled CMOS Camera – P3: Image Quality

In the previous post I successfully obtained the test pattern with custom VDMA core. The next step will be to implement an operating system and software on host machine. In order to obtain real time live view and control, both software should be developed in parallel. Thus in this post, let’s take a look at the image quality with a simple baremetal application.

The sensor is capable for 10FPS 14Bit, 30FPS 12Bit, or 70FPS at 10bit ADC resolution. For astrophotography, 14bit provides the best setting for dynamic range and achieves unity gain at default setting. The sensor IR filter holder and the camera mounting plate are still in design. I will only provide a glimpse into some bias and dark images at this moment.

To facilitate dark current estimation, the cover glass protective tape was glued to a piece of cardboard. The whole sensor was then shielded from light with metal can lid. Lastly, the camera assembly was placed inside a box and exposed to -15°C winter temperature. During the process, my camera would continuously acquire 2min dark frames for 2 hours, followed by 50 bias frames.

Bias Hist

Pixel Intensity distribution for a 2×4 repeating block (Magenta, Green, Blue for odd rows)

The above distribution reflects a RAW bias frame. It appears each readout bank has different bias voltage in its construction. The readout banks assignment is a 2 rows by 4 columns repeating pattern, each color for each channel. A spike in the histogram at certain interval implies a scaling factor is applied to odd rows post-digitalization to correct for uneven gain between top and bottom ADCs.

Read Noise Distribution

Read Noise – Mode 3.12 Median 4.13 Mean 4.81

The read noise distribution is obtained by taking standard deviation among 50 bias frames for each pixel. Then I plot the above distribution to look at the mode, median and mean. The result is much better compared to a typical CCD.

Dark_current_minus_15

Finally the dark current in a series of 2-minute exposures is measured by subtracting master bias frame. Two interesting observations: 1. The density plot gets sharper (taller, narrower) as temperature decreases corresponding to even lower dark generation rate at colder temperature. 2. The bias is drifting with respect to temperature. This could be in my voltage regulator or in the sensor, or a combination of two.

The bias drift is usually compensated internally by the clamping circuit prior to ADC. But I had to turn this calibration off due to a specific issue with this particular sensor design. I will elaborate more in a later post. Thus to measure dark generation rate, I have to use FWHM of the noise distribution and compare against that in a bias frame. At temperature stabilization, FWHM was registered at 8.774, while a corrected bias is 8.415 e-. For a Gaussian distribution, FWHM is 2.3548 of sigma. Thus the variance for the accumulated dark current is 1.113 given the independent noise source. As such, the dark generation rate at this temperature is less than 0.01 eps. Excellent!

Preliminary Summary

The sensor performs well in terms of noise. For long exposure, the dark generation rate in this CMOS is more sensitive to temperature change than CCDs. The dark current is massively reduced when cooled below freezing point. The doubling temperature is below 5°C.

LEXP_001

An uncorrected dark frame after 120s exposure showing visible column bias and hot pixels

The Making of a Cooled CMOS Camera – P2

In the last post, I uncovered a bug in the Vivado implementation which accidently removes the DIFF_TERM from my input buffer. With that problem solved, I picked up the project again with a goal to achieve high speed imaging. Now I’m going to cover the design principal and its intermediate steps to achieve it.

VDMA

End Result – My customized VDMA IP highlighted

The end goal is to have an unified, resource-efficient and high-performance VDMA to receive, decode, and transfer data to an external memory. In this case, it will be the DDR3 on PS side. The screenshot above looks very satisfactory and concise. But in reality, any error in along its data path will complicates the entire debugging process. Thus I decided to tackle them one module at a time from the upstream, only combining them in the final run.

Step 1

The initial step is to make sure all receiving banks are decoding the SOL/EOL sync code properly. The stock ILA IP could be attached to the immediate downstream of the receiving module to verify their function. The ILA requires a free running clock. In this case, I had to let the sensor run in high power and continuous mode to feed a constant clock into FPGA.

Step 2

External FIFO

Intermediate Design: Receiver feeding multiple external AXI-S Data FIFOs

After verifying that all 4 banks are functioning, the next step is to receive the data with integrity. Due to inter-bank skew, each pixel from a different bank could arrive at a slightly different clock period. And the source synchronizing clocks are not aligned among banks. Asynchronous FIFO is the answer. Each bank will feed a FIFO at its own source clock. But all 4 banks are read out under a single internal clock when all FIFOs have available data. The AXI-Stream Infrastructure IPs contains a Data FIFO for AXI4-S protocol. Its TReady on the slave port will not go down before FIFO is full. To read off from all 4 FIFOs in an aligned manner, all data must be valid.

assign Ready = (Data_valid_0_in & Data_valid_1_in) & (Data_valid_2_in & Data_valid_3_in);

Step 3

The above method was tested under 2015.2. It will no longer work in the most recent version for some unknown reasons. This drives me to use the FIFO_DUAL_CLOCK_MACRO provided by Xilinx. It is essentially a BRAM with built in hardware asynchronous FIFO support on 7 Series devices. The read enable signal must be de-asserted in the same cycle when any of their Empty signal rises.

assign RDEN_all = !EMPTY_0 & !EMPTY_1 & !EMPTY_2 & !EMPTY_3;

Also, the FIFO must be reset for 5 period on both reading and writing clock before use. Thus I implemented a resetting handshake mechanism to enable reset as soon as the source clock starts running. This will give plenty time before actual data arrives. The downstream goes through an AXI-S Data FIFO buffer before the AXI-DMA.

 Internal FIFO

Final Step

So far so good, I’ve got a complete image. But many areas should be improved. First of, the pixels are not organized. Each transfer of 64 bits are 2 pixels from even row and 2 from odd rows. This column-wise de-interlacing simply costs too much for an ARM CPU to do. Secondly, at least a quarter of the bandwidth is wasted at 12bit ADC or lower. Under the 150MHz timing constrain of AXI-DMA. this severely hinders high frame rate operation. Thus I decided to use an interleave transfer mechanism coupled with bit packing. To use interleave transfer, address generation and switching is essential to jump between even and odd rows. As for bit packing, it’s a common trick used in many uncompressed RAW formats (DNG, Nikon NEF). In this instance, I will use 3 bytes 2 pixel packing at 10/12bit ADC.

The most complicated part is 4K boundary check required by AXI protocol. Consider the following code block:

If (Target_addr[11:3] + Burst_count > 4096)

m_axi_s2mm_awlen <= ((Target_addr | 12’hFF8) – Target_addr) >> 3;

else m_axi_s2mm_awlen <= Burst_count – 1;

This is doomed to fail at high frequency due to large number subtraction. The critical path length will be doubled given the bit width of the target address. The above step can be split into pipeline fashion. First checking if condition and set a flag FDRE. In the next cycle, use the flag to direct calculation. The pipeline latency is high in my implementation, totaling at 5 clock cycle assuming constant AWReady state. However, considering that the address can be generated in advanced while the last data burst is in transfer, the actual latency can be ignored.

Eventually, the design is successfully constrained under 200MHz AXI-HP clock. After hardware validation, the actual bandwidth for a 4K@60FPS video will be a whopping of 1.2GB/s given 1.6GB/s available.

Implementation

Implementation Map: AXI-mem-interconnect; AXI-GP-interconnect; Peripheral Resets

Receiver and Sync Detector; Interleave FIFO Control; S2MM Channel Core; AXI-GP Slave Control Registers

The synthesis and implementation run is only 3 minutes combined, given only 5 core Verilog files within my VDMA IP. The rest is just 2 AXI-interconnects and peripheral resets. The resource utilization is only 11% for LUT and less for FDREs in a 7010 device, leaving much space for future use.

Test pattern looks perfect! In my next post, I’ll showcase some actual images.

2016/12/5