High-Speed Data Streaming: Programming and Benchmarks

Publish Date: Jun 05, 2017 | 11 Ratings | 2.64 out of 5 | Print

Overview

PXI Express is changing the way engineers design systems. This document discusses the technology that enables high-speed data streaming, application design that maximizes system streaming performance, and data rate benchmarks that can be achieved in stream-to-disk and stream-to-memory applications.

Table of Contents

  1. Introduction
  2. Streaming Technology
  3. Best Programming Practices for Stream-to-Disk Applications
  4. Hardware and Software Considerations
  5. Stream to/from Disk Benchmarks
  6. Stream to/from Memory Benchmarks
  7. Conclusion

NOTE: The products and benchmarks in this document do not reflect NI's latest technology and solutions. Please visit ni.com/streaming for more current information. 

 

1. Introduction

Many engineers utilize streaming, but for numerous applications data cannot be generated or acquired fast enough. In these situations, engineers must compromise by using a slower sample rate to transfer data over the bus or by sampling at the necessary high-speeds for the short periods of time that onboard instrument memory allows. Neither sacrifice is desirable.

Traditionally, benchtop instrumentation systems such as oscilloscopes, logic analyzers, and arbitrary waveform generators have implemented limited data streaming. Although many instruments have incredibly fast sampling rates and high bandwidths, the bus that interfaces with the PC to return data to the user is often overlooked, yet it can dramatically increase overall test times. For example, the majority of acquisitions performed with stand-alone oscilloscopes are finite. The duration of the acquisition is dictated by the amount of onboard memory available in the oscilloscope (a stand-alone arbitrary waveform generator has the same limitation, except the waveform is downloaded to the onboard device memory for generation). After the acquisition is complete, the data is transferred to the controlling PC using Ethernet or, more commonly, GPIB. Consider a case where data is sampled at 1 GS/s after an event trigger. If the device has 256 MB of onboard memory per channel, the memory would be full and end acquiring after about 250 ms. If the instrument interfaces using the GPIB bus (which has a bandwidth of about 1 MB/s), the user must wait almost 4.5 minutes (250 s) for this data to be transferred to the computer for analysis. Now compare this to an NI digitizer/oscilloscope with the same sample rate and onboard memory. The same data transfer would take fewer than 3 seconds over the high-bandwidth PCI/PXI bus: a more than 80x improvement! The PCI Express/PXI Express bus enables even faster data transfers  

 

Back to Top

2. Streaming Technology

PXI Express, built on PCI Express technology, offers dedicated bandwidth per instrument. PCI Express, available in x1, x4, x8, and x16 links (pronounced “by 1,” “by 4,” and so on), provides 250 MB/s of throughput per lane with very low latency. The x1 and x4 options are most common for instrument-class hardware and provide 250 MB/s and 1 GB/s (four lanes at 250 MB/s) of dedicated throughput, respectively. As a result, total system throughput increases as the number of instruments in a chassis increase. The figure below highlights the bandwidth of various buses versus their latency. Latency describes the delay that occurs in any transmission of data, and it is frequently forgotten when considering system design. Many people recognize that higher bandwidth is desirable, but high latency can also detrimentally affect test times and should be a consideration when designing a system.  

 

 

Figure 1. Bandwidth vs. Latency of Popular Instrument Buses

The PXI platform, since it is based on the high-bandwidth PCI and PCI Express buses, enables instruments to stream data to or from sources other than onboard device memory. A PXI/PXI Express digitizer or oscilloscope is able to continuously acquire at a high sample rate because the high bandwidth of the bus allows theoretical real-time data transfer to PC memory or disk at rates up to 1 GB/s so that data can be fetched before it is overwritten in device memory. 

Consequently, the bottleneck for an acquisition or generation is no longer the bus, but actually reading or writing the data to the system storage – a hard drive or even a Redundant Array of Inexpensive Disks (RAID) array. Again, this means engineers can acquire or generate data for long periods of time at the high sampling rates they need, instead of compromising their sample rate or test time.  For example, using an NI PXIe-5122 digitizer and a 24-drive RAID array with a capacity of 24 TB, data can be captured at the maximum sampling rate of 100 MS/s on both simultaneously-sampled channels for more than 15 hours. 

What does all this mean? Many application challenges were previously unsolvable because they required expensive proprietary systems, but now these challenges become feasible using commercially available PXI Express. Some applications include RF/IF data streaming in signal intelligence, data recording and playback, digital video generation/streaming for image sensor and display panel testing, and other high data throughput applications.

 

Back to Top

3. Best Programming Practices for Stream-to-Disk Applications

It is widely recognized that the progression of applications from single-threaded to multithreaded architectures is a significant programming challenge. LabVIEW offers an ideal programming environment for multicore processors because LabVIEW applications are inherently multithreaded. As a result, LabVIEW programmers can benefit from multicore processors with little or no extra code. Multithreaded applications provide the greatest benefits to parallel test and stream-to-disk applications, and using proper programming in streaming applications allows maximum performance of PXI Express instruments. Both these benefits are attained by parallelizing the code.

 

The same rules of parallelism apply for creating stream-data-to-disk applications or for getting the most performance from the computer processor(s). In a streaming application, the two main bus- and processor-intensive tasks are: 1) Acquiring data from the digitizer and 2) Writing data to a file. To better utilize processor resources, users can divide processes into multiple loops. Data is shared between each loop with the use of a LabVIEW queue structure commonly referred to as a producer-consumer algorithm structure.


 
 
Figure 2.  Producer/Consumer Loop Architecture with Queue Structure

In the preceding example, the top loop (the producer) acquires data from a high-speed digitizer and passes it to a queue. The bottom loop (the consumer) reads data from the queue and writes it to hard disk. At the same time, LabVIEW handles the queue as a block of allocated PC memory. This memory block is utilized as a temporary storage FIFO for data passing between two loops. In most programming languages, sharing memory between multiple processes requires significant overhead programming. However, LabVIEW handles all the memory access to ensure that read-write race conditions do not occur. The execution of a queue structure can be visualized with the following diagram.

 
Figure 3.  Data-Flow Programming Model of Queue Structure

As data is acquired from the digitizer, it is placed into memory in a first-in-first-out (FIFO) buffer using the queue structure (element 0, element 1…element n-1, element n ). As the figure illustrates, queues can pass data between multiple loops. The dequeue element accesses the same memory FIFO, removing elements in the same order (starting with element 0). LabVIEW automatically creates independent execution threads for the two While loops. Stream-to-disk applications benefit from this parallel execution because the completion of one task does not delay execution of the entire program. By contrast, using the sequential model most text-based programming languages employ causes drastically reduced performance.

Beyond overall application architecture, stream-to-disk or stream-from-disk rates can be affected by some of the following factors, which will be talked about in more detail later:

  • Running background programs such as virus scan
  • How the hard drive is formatted to group data
  • Using system restore or the recycle bin
  • Disk fragmentation
  • Location of the file on the hard drive

Back to Top

4. Hardware and Software Considerations

 

Controllers

NI PXIe-8135 Embedded Controller

The PXIe-8135 controller provides optimal streaming performance for certain PCI Express links that are connected to the PXIe backplane.  In order to understand which slots in which chassis are affected by these special behaviors, refer to the following table:

        

Figure 4.  PCI/PCIe Links on Chassis Backplane

Links 3 and 4 both return 128 byte write completions rather than the 64 byte write completions returned by links 1 and 2.  Links 3 and 4 come from a x8 link. In order for the controller to return 128 byte completions, the read requests must be 128 or 256 bytes (the default PCI Express capability is 512 bytes).  This can be accomplished with Modular Instruments arbitrary waveform generators and high-speed digital I/O devices by changing the Preferred Packet Size attribute in the API to 256 bytes.

Performance issues:

Sizing of windows, including minimize and maximize operations, can severely affect streaming applications.  This is a CPU limitation as opposed to an issue with data transfer across the PCIe bus

To turn off this effect, do the following:

  1. Go to Start>>My Computer, right-click on My Computer and choose Properties
  2. Choose the Advanced tab and click on the Settings for Performance
  3. Under the Visual Effects tab, choose Custom, and uncheck
    1. Animate windows when minimizing and maximizing (recommended)
    2. Show window contents while dragging (optional)


Figure 5.  Performance Options Window Used to Change Window Visual Effects

 

Chassis

PXIe-1062Q Chassis:

In the PXIe-1062Q chassis slots 3, 4 and 5 each have dedicated x4 PCIe links to the controller, which allow for high bandwidth measurements from these slots. The PCI connections from slots 2, 3, 5, 6, 7 and 8 share a PCIe-PCI bridge to the controller. These will perform at PCI transfer rates.

PXIe-1065 Chassis:

The best performance from any of the slots in the PXIe-1065 chassis will come from the use of express devices in slots 7 and 8. Each of slots 7 and 8 have dedicated x4 links to the host controller. These slots do not share any of the slot bandwidth with any other devices and no switching considerations need to be made because of multi-slot switching on the backplane.

Slots 9-18 all share one PCIe switch. Each of slots 9-14 have individual x4 links to the switch, and slots 15-18 share a x1 link to the PCIe switch. This greatly reduces the ability to perform high bandwidth generations or measurements with multiple devices in slots 9-18 simultaneously.

PXIe-1075 Chassis:

The slots on the 1075 chassis are broken into groups that share 4 PCIe switches. It is important to note that the right hand PCI segment is connected to switch 3, not switch 4, via a PCIe/PCI bridge. So, if you are trying to maximize your PCI performance by minimizing the stress on the same switch as the PCIe/PCI bridge, it is important to reduce the load on switches 1 (left-hand PCI segment) and 3 (right-hand PCI segment).

 

LabVIEW I/O Performance

The standard LabVIEW File I/O VIs (Open/Create/Replace File.vi, Read From Binary File.vi, Close File.vi) need to be functional for all situations. Until the release of LabVIEW 8.6 these functions could not be optimized for streaming applications. After the release of LabVIEW 8.6, the option to disable buffering using a boolean input was added to the Open/Create/Replace File.vi, optimizing the function for streaming applications. It is important to note that you must read from or write to the file in integer multiples of the disk sector size when using this option. The Read and Write VIs will return an error if reads or writes of an inappropriate size are attempted. The following figure shows this option on the Open/Create/Replace File.vi.

Figure 6.  Disable Buffering Option for the Open/Create/Replace File.vi in LabVIEW

There is also an analogous ANSI C file I/O function that implements equivalent functionality. There is a special flag on the Windows CreateFile function that will disable buffering as well (see code below).

Figure 7.  C Code for Creating a File that Disables Buffering

 

Effects of Using a Virus Scanner

Virus scanners can have a significant impact on any time-critical application where the application requires access to disk or depends on scheduled access to CPU resources.

Virus scanners can interrupt sustained operation of an application for things like a scheduled daily scans or scheduled daily updates to the scanner. National Instruments recommends disabling the scheduled scans and updates for the entire duration of extended time streaming applications. The on-access scanner can be left in place to provide real-time protection from viruses.

 

RAID Arrays

The HDD-8264 12 drive RAID array is one of the highest performing RAID that National Instruments offers (for higher performance, consider the HDD-8266). The HDD-8264 has maximum possible transfer rates of approximately 685MB/s write and 740MB/s read. These values describe the limitations of the RAID controller’s transfer rate across the PXI Express x4 link. These values are decreased when reading from and writing to disk in a streaming application using hardware like an arbitrary waveform generator or high-speed digitizer. Benchmarks are shown below for streaming to arbitrary waveform generators.

The other option for a RAID array is the HDD-8263 4 drive RAID array. This RAID has possible peak rates of about 325MB/s with files located on the outer rim of the hard disk, which will be shown later in the this section.

 

MXI Express x4 Remote Controller

NI PXIe/PCIe-8371, NI PXIe/PCIe-8372 Remote PCI Express Control of PXI Express

On the 2-port PCIe-8372, port 2 provides higher performance (throughput) than port 1.  This is due to the internal architecture of the PCI Express switch used on this product.  The 1-port PCIe-8371 exposes port 2 and depopulates port 1.

The maximum aggregate data rate at which data can be sent upstream (e.g. digitizers writing to memory) through the PXIe-8370 is no more 799 MB/s, due to hardware  limitations on this module. 

 

Optimizations

There are some things that a user can do to slightly optimize the read and write rates of the RAID array. One thing that helps with the performance of reading and writing with a RAID is where on the actual hard disk that the file is written to. Performance is considerably better when the file is located near the outer rim of the hard disk.

For example, a write to disk test was run with the 8263 with a 950GB file that was broken into three parts. The first segment of the file was located near the inner rim of the disks, the second fragment located somewhere near the middle and the third fragment starting at the outer rim of the disks.

Figure 8.  Performance of LV Write to Different Disk Locations

 

As you can see the first part of the write near middle of the disks operates at significantly lower rates compared to the two consecutive file segment’s writes.

Another thing that the user can do to optimize the file reads and writes is to pre-allocate the file space and then replace the contents when doing a write to disk operation. This tactic is only applicable for streaming to disk, for applications such as high-speed digitizer or HSDIO streaming. Note that in the above example, the file space was allocated dynamically by the operating system, so slightly increased performance could be possible with pre-allocated file space. When using this approach, it is important to make sure to use the open option for the operation(0:open) input of the Open/Create/Replace File.vi.

Figure 9.  Open/Create/Replace File.vi Using "Open" Input to Overwrite the Pre-Allocated File

If you use replace or create or replace as the input, the application will replace the files that already exist instead of using them as they are.

Another optimization that can be made when reading and writing with RAID, is to use overlapped I/Os (asynchronous reads/writes). While this is fine for C applications, it is not practical for LabVIEW applications due to LabVIEW’s synchronous dataflow programming model. However, by writing to multiple files simultaneously on the HDD-8264, you can achieve aggregate rates that are faster than reading from or writing to one file. The reason for this is that the “dead-time” between writes is used by the write operations to the other file(s). As long as the read or write size is large enough, the penalty for re-locating the write head to different locations on disk for the file(s) should be small compared to the performance benefit. Taking this option a step further, the HDD-8264 can be formatted as 3 separate RAID volumes of 4 drives each, and read from or write to 3 separate files, each of which is pre-allocated per volume. This should allow for reads and writes on the outer edge of each of the 3 disks on each volume’s respective disks.

National Instruments recommendation is to read from or write to multiple files that are located on separate volumes of the HDD-8264.

 

Back to Top

5. Stream to/from Disk Benchmarks

Earlier discussion described how data streaming speed for traditional instrumentation systems is limited by the amount of data that can be pushed through the bus. The high bandwidth of PXI/PXI Express completely changes the bottleneck so the read and write speed of the storage system becomes the new limiting factor. On most PXI controllers, the hard disk is capable of speeds of around 40 MB/s. However, these disk rates can be increased significantly by using external ExpressCard or PXI Express RAID-0 hard drive configurations. RAID technology is an easy way to combine multiple hard disk drives for faster disk speeds.

When calculating stream-to-disk or stream-to-memory throughput for an instrument, we can use the following equation:
Throughput = Sampling rate x Bytes/Sample x Number of Channels

NI-Scope Benchmarks

For an NI PXIe-5122 high-speed digitizer with a x4 connector, sampling at the maximum sampling rate of 100 MS/s on two 14-bit channels translates to 400 MB/s of data over the bus. This number is well within the bandwidth limit of x4 PCI Express, so we can address stream-to-disk applications using a RAID-0 hard drive configuration. Using the NI PXIe-5122, we achieved the following benchmarks for stream-to-disk applications.


Figure 10. Maximum Stream-to-Disk Rates for NI PXIe-5122

For the NI PXIe-5122 benchmarks shown in the preceding table and also for the following NI PXIe-6537 and NI PXIe-5442 benchmarks, a PXI Express dual-core controller was used with a x4 PXI Express RAID-0 hard drive configuration. The maximum hard drive read and write speeds were tested at over 600 MB/s, and the acquisition size for the test results shown above was 40 GB.  The NI PXIe-5122 devices used in this test came with 256 MB of onboard memory, and the PXIe-5442 devices had 512 MB of onboard memory.

 

NI-HSDIO Benchmarks

For an NI PXIe-6537 high-speed digital I/O module with a x1 connector, sampling at the maximum clock rate of 50 MHz on all 32 channels translates to 200 MB/s of data over the bus. Using the NI PXIe-6537 with the RAID-0 hard drive configuration, we achieved the following benchmarks for stream-to-disk and stream-from-disk applications.


Figure 11. Maximum Stream-to/from-Disk Rates for NI PXIe-6537

One number that requires an explanation is the throughput for 32 or more channels streaming-from-disk (generation).  The lower throughput is not a limitation of PXI Express bandwidth; it is actually a result of the maximum allowable packet transfer size the controller chipset allows.


Figure 12. Maximum Stream-to/from-Disk Rates for NI PXIe-6537 using NI PXIe-1065 and NI PXIe-8130

As a result of of the controller chipset, generating data with the NI PXIe-6537 in Sots 7 and 8 of the NI PXIe-1065 and Slots 3 and 5 of the NI PXIe-1062Q results in lower maximum output rates. NI recommends using the NI PXIe-6537 in Slots 9 through 14 of the NI PXIe-1065 and Slot 4 of the NI PXIe-1062Q for maximum generation performance.


Figure 13. Maximum Stream-to/from-Disk Rates for PXIe-6537 High-Speed Digital I/O using PXIe-1062Q and PXIe-8130

 

NI-FGEN Benchmarks

Below, are benchmarks for a stream from disk configuration that includes the hardware combinations listed in the table. The hardware that was used for testing these benchmarks was focused on express chassis PXIe-1062Q, PXIe-1065 and PXIe-1075. The benchmarks were also performed using both the HDD-8263 and HDD-8264 RAID arrays. These tests were only concerned with benchmarking the performance of the PXI-5421, PXI-5441, PXIe-5442 and PXIe-5450 as these are the most commonly used devices for streaming.

 

 

Figure 14. Maximum Stream from Disk Rates for Multiple Arbitrary Waveform Generator, Chassis and RAID Combinations.

 

Back to Top

6. Stream to/from Memory Benchmarks

As a variation of a stream-to-disk application, we also can stream data from a high-speed digitizer into the onboard memory of our PXI controller. This scenario conclusively shows that even in the previous example, the bus is not limiting the throughput; the disk write speed of the RAID-0 array is the bottleneck. In this experiment, the acquisition size is actually limited by the amount of available PC memory. As a result, the following performance for a stream-to-memory application using the NI PXIe-5122 high-speed digitizer can be achieved.  

 


Figure 15. Maximum Stream-to-Memory Rates for NI PXIe-5122

In the test described previously, a PXI Express dual-core controller with 2 GB of onboard memory was used. The acquisition length was 100,000,000 samples per channel, which requires 800 MB of PC memory for four channels (2 bytes per sample). The NI PXIe-5122 devices used in this test came with 256 MB of onboard memory. A similar test can be run with the NI PXIe-6537 high-speed digital I/O module, as shown in the following table. 


Figure 16. Maximum Stream-to/from-Memory Rates for NI PXIe-6537

For the same reason described above, in streaming-from-disk with the NI PXIe-6537, the throughput is limited by the controller chipset, not PXI Express bandwidth. Using the same setup as the digitizer test, we can stream to the NI PXIe-5442 at 200 MB/s per channel. As seen below, we can generate from memory on up to four channels at the full device sample rate.


Figure 17. Maximum Stream-from-Memory Rates for NI PXIe-5442

The most important takeaway from these stream-to/from-memory benchmarks is that the system throughput increases above the write speed of the RAID array. This increase means that the throughput over the bus is increased and the bus is no longer the bottleneck. One reason why both stream-to/from-disk and stream-to/from-memory applications can achieve such high throughput in PXI Express is through the use of the high-bandwidth and low-latency data bus: PCI Express.

 

Back to Top

7. Conclusion

PXI and PXI Express are enabling engineers to take the capabilities of their systems to the next level. The high bandwidth of the PCI bus used in the PXI platform allows high sampling rates and long acquisitions to coexist. By integrating PCI Express technology into the platform, even higher performance is possible with data rates up to 4 GB/s per slot and that speed is ever increasing with newer generations of PCI Express. Good application design can help maximize the streaming performance of a system, and several PXI Express instruments can now stream to or from PC memory or disk at their maximum sampling rates so that entire data sets can be later processed or analyzed.


Related Links:
Modular Instruments for PCI Express and PXI Express
NI-SCOPE Stream to Disk Examples
NI-HSDIO Stream to Disk Examples
NI-HSDIO Stream from Disk Examples
            

               

Back to Top

Bookmark & Share


Ratings

Rate this document

Answered Your Question?
Yes No

Submit