Featured

Python Multi-core / GPU Digital Phosphor Rendering of Huge Waveform Data

Most modern oscilloscopes are marketed as Digital Phosphor Oscilloscope (DPO) because the waveform shown on those scopes looks night-and-day compared to their old counterparts.  


DPO vs DSO, from Tektronix TDS784D marketing materials

This is because although the traditional DSO can capture data at a blazling fast speed, they lacked the processing bandwidth to show them on the display: It may be able to capture 100 million waveform data for one trigger point and store it in the sample memory, but the monitor only has say 1024 pixels wide. DSO simply throw away most of the points, resulting in an ugly aliased apperarance with 1bit per pixel.  

To achieve the nice and smooth look of a DPO, what we want to do is to down sample the 100 million points to 1024 pixels wide with a correct down sampling algorithm. 

Recently I've been working with some huge waveform captures with more than 1G points. Plotting such data with the beloved matplotlib will result in an ugly blob with all the nuances lost. Even worse, matplotlib uses a vector drawing method, so the speed is extremely slow: drawing things beyond 100M points will become untolerable.

One way to speed things up and achieve a DPO look is to use plt.hist2d. There seems to be some automatic vectorization happening, but it's still ugly and slow. Another issue with this approach is that hist2d draws only the points but not the lines connecting them. In some applications (like the NTSC video signal example shown in the TDS784D comparison), the thin lines in regions like the rising and falling edges of a square wave will dissappear. And pyplt's hist2d implementation seems to also have some aliasing on the x axis.

I couldn't find any useful library for this particular requirement. So I wrote my own multi-thread rasterizer that can be deployed on CPU / GPU. The speed up is significant, and the result is a nice looking DPO plot with vector lines connecting all the dots together. 

A Random Walk Sequence, 100M Points

An AM Signal with fc=1MHz, fmod=10Hz, fs=10MHz, 100M Points

An AM Signal with fc=100kHz, fmod=0.1Hz, fs=10MHz, 100M Points

The implementation is straight forward. Think about a single-thread implementation: all we want to do is to use the Bresenham line drawing algorithm to draw all the lines connecting the points in your dataset into a pixel buffer. With Numba, it can be turned in to a parallel code such that each worker is responsible for one Bresenham line. The only modification we need to do is to change the add instruction into an atomic operation, which can be achieved by numba.cuda.atomic.add.

The summation operation can also be parallelized to get extra speed up. Based on my experiments with 6GB of VRAM, 64 workers is already saturating the processing capability of my graphics card.

Comments