A second application that benefits from parallel signal processing techniques is the use of multiple instruments for simultaneous input and output. In general, these are referred to as hardware-in-the-loop (HIL) or in-line processing applications. In this scenario, you may use either a high-speed digitizer or high-speed digital I/O module to acquire a signal. In your software, you perform a digital signal processing algorithm. Finally, the result is generated by another modular instrument. A typical block diagram is illustrated in Figure 7.
Figure 7. This diagram shows the steps in a typical hardware-in-the-loop (HIL) application.
Common HIL applications include in-line digital signal processing (such as filtering and interpolation), sensor simulation, and custom component emulation. You can use several techniques to achieve the best throughput for in-line digital signal processing applications.
In general, you can use two basic programming structures – the single-loop structure and the pipelined multiloop structure with queues. The single-loop structure is simple to implement and offers low latency for small block sizes. In contrast, multiloop architectures are capable of much higher throughput because they better utilize multicore processors.
Using the traditional single-loop approach, you place a high-speed digitizer read function, signal processing algorithm, and high-speed digital I/O write function in sequential order. As the block diagram in Figure 8 illustrates, each of these subroutines must execute in a series, as determined by the LabVIEW programming model.
Figure 8. With the LabVIEW single-loop approach, each subroutine must execute in series.
The single-loop structure is subject to several limitations. Because each stage is performed in a series, the processor is limited from performing instrument I/O while processing the data. With this approach, you cannot efficiently use a multicore CPU because the processor only executes one function at a time. While the single-loop structure is sufficient for lower acquisition rates, a multiloop approach is required for higher data throughput.
The multiloop architecture uses queue structures to pass data between each while loop. Figure 9 illustrates this programming between while loops with a queue structure.
Figure 9. With queue structures, multiple loops can share data.
Figure 9 represents what is typically referred to as a producer-consumer loop structure. In this case, a high-speed digitizer acquires data in one loop and passes a new data set to the FIFO during each iteration. The consumer loop simply monitors the queue status and writes each data set to disk when it becomes available. The value of using queues is that both loops execute independently of one another. In the example above, the high-speed digitizer continues to acquire data even if there is a delay in writing it to disk. The extra samples are simply stored in the FIFO in the meantime. Generally, the producer-consumer pipelined approach provides greater data throughput with more efficient processor utilization. This advantage is even more apparent in multicore processors because LabVIEW dynamically assigns processor threads to each core.
For an in-line signal processing application, you can use three independent while loops and two queue structures to pass data among them. In this scenario, one loop acquires data from an instrument, one performs dedicated signal processing, and the third writes data to a second instrument.
Figure 10. This block diagram illustrates pipelined signal processing with multiple loops and queue structures.
In Figure 10, the top loop is a producer loop that acquires data from a high-speed digitizer and passes it to the first queue structure (FIFO). The middle loop operates as both a producer and a consumer. During each iteration, it unloads (consumes) several data sets from the queue structure and processes them independently in a pipelined fashion. This pipelined approach improves performance in multicore processors by processing up to four data sets independently. Note that the middle loop also operates as a producer by passing the processed data into the second queue structure. Finally, the bottom loop writes the processed data to a high-speed digital I/O module.
Parallel processing algorithms improve processor utilization on multicore CPUs. In fact, the total throughput is dependent on two factors – processor utilization and bus transfer speeds. In general, the CPU and data bus operate most efficiently when processing large blocks of data. Also, you can reduce data transfer times even further using PXI Express instruments, which offer faster transfer times.
Figure 11. The throughput of multiloop structures is much faster than single-loop structures.
Figure 11 illustrates the maximum throughput in terms of sample rate, according to the acquisition size in samples. All benchmarks illustrated here were performed on 16-bit samples. In addition, the signal processing algorithm used was a seventh-order Butterworth low-pass filter with a cutoff of 0.45x sample rate. As the data illustrates, you achieve the most data throughput with the four-stage pipelined (multiloop) approach. Note that a two-stage signal processing approach yields better performance than the single-loop method (sequential), but it does not use the processor as efficiently as the four-stage method. The sample rates listed above are the maximum sample rates of both input and output for an NI PXIe-5122 high-speed digitizer and an NI PXIe-6537 high-speed digital I/O module. Note that at 20 MS/s, the application bus is transferring data at rates of 40 MB/s for input and 40 MB/s for output for a total bus bandwidth of 80 MB/s.
It is also important to consider that the pipelined processing approach does introduce latency between input and output. The latency is dependent upon several factors, including the block size and sample rate. Tables 1 and 2 below compare the measured latency according to block size and maximum sample rate for the single-loop and four-stage multiloop architectures.
Tables 1 and 2. These tables illustrate the latency of single-loop and four-stage pipeline benchmarks.
As you expect, the latency increases as the CPU usage approaches 100 percent utilization. This is particularly evident in the four-stage pipeline example with a sample rate of 20 MS/s. By contrast, the CPU usage barely exceeds 50 percent in any of the single-loop examples.