This document provides an overview of a typical control application created specifically to provide a benchmarking reference platform to test the performance characteristics of NI hardware & software in this application space. This application is an example of one type of benchmark that NI produces: a whole system benchmark. The purpose of such whole system benchmarks is to show the interactions common in a large system and analyze the impact different parts of an application have on not only the performance of the system but also on the system architecture. In addition to such whole system benchmarks, NI also publishes a suite of targeted benchmarks that analyze individual parts of such systems in isolation.
In this paper, we begin with an in-depth review of the application architecture, follow with a discussion of the tools and techniques used to develop the system and evaluate its performance, and end with detailed benchmarking results.
1. Application Architecture
Figure 1 shows our application, which is comprised of four major subsystems. Three of the subsystems are interdependent loops on a real-time target, each focused on a particular task common in a closed-loop control application--control, monitoring, and communication and local logging. The other subsystem is the host PC, which provides the operator interface and communication to and from the real-time target to the host PC.
The LabVIEW Real-Time Module is used on the real-time target to provide the determinism required by the control subsystem. Additionally, in this application example, we also run monitoring and logging subsystems on the RT target.
Figure 2 shows a breakdown of the control subsystem. The control subsystem is a collection of control loops set at high priority to ensure deterministic performance. The control loops run at a base rate of x. The monitoring and communication/logging loops run at a fraction of this base rate.
The purpose of this benchmark is to determine the maximum value for xsuch that the entire system remains stable. We define a stable system as a system where no control loop misses a period (is not “late”) and buffers used to communicate information from the high-priority tasks to lower priority tasks do not overflow.
The control loops utilize both discrete and continuous data from a mix of I/O provided by NI data acquisition hardware and NI-DAQmx driver software. In this case, the control loops use analog, digital, and counter I/O. In addition to I/O, each control loop performs a PID calculation or simple discrete control algorithm depending on the nature of the signals.
The monitoring subsystem is shown in Figure 3. The monitoring loops run at normal priority and are responsible for collecting a large set of auxiliary input at a slower rate than the control I/O. The monitored input is not processed but is logged locally and communicated to the host. The monitoring subsystem runs at one tenth the rate of the control loop.
The communication and logging subsystem shown in Figure 4 is responsible for communicating all control data—acquired, processed, and output—to the host computer. The subsystem also monitors input from the host computer and logs all of the control and monitoring data to local storage to preserve an execution record of the system. An important additional function of this subsystem is to allow a user on the host computer to change set points and thresholds used in the real-time subsystems.
The final subsystems of the application are the operator interface & host logging. From the host computer, a user can monitor the status of the real-time target, view the acquired and processed data, and change the set points and thresholds of the control tasks on-the-fly. Additionally, the host subsystem redundantly logs the communicated data to its local storage.
In the next few sections, we go into more detail regarding some of the important implementation choices made for this application. Many of the topics discussed are good choices for other applications written with the LabVIEW Real-Time Module and/or NI-DAQmx.
All data points acquired, processed, and outputted within the control and monitoring loops are passed to the communication loop for local logging and transfer to the host. In the other direction, PID set point values and stop commands are passed from the communication loop to the control and monitoring loops. To transfer data deterministically between subsystems on the RT target, this application uses Real-Time FIFOs, as shown in Figure 5.
Real-Time FIFOs are non-blocking queues that do not use the memory manager and thus can provide the determinism necessary when used to communicate data between high-priority and low-priority loops. To learn more about RT FIFOs, navigate to the Real-Time Module»Real-Time Module Concepts»Sharing Data in Deterministic Applications book in the Contents tab of the LabVIEW Help. In LabVIEW, select Help»Search the LabVIEW Help to display the LabVIEW Help.
The size of an RT FIFO is set at creation time to ensure that it does not allocate memory at run time. The size of the RT FIFO is determined by the production and consumption rates of the tasks that utilize the FIFO for communication. In this application, the rates for the control task and communication & logging task are x andx/50, respectively, meaning that the control loop runs 50 times for every iteration of the communication & logging loop. If we assume a perfectly balanced system, where the control loop creates one data point every iteration and the communication & logging loop always empties the FIFO when it runs, a FIFO size of 50 is sufficient. Since there is always some additional jitter in any system, setting the FIFO sizes larger ensures that no data is lost. In this application, we chose a safety factor of 20 and set the FIFO size between the control and communication & logging tasks to 1000.
This application requires multiple loops running at different rates for control and monitoring subsystems of the application. The LabVIEW Timed Loop provides an ideal structure for developing this type of multi-rate system.
A timing source drives the execution of a Timed Loop and by default is the 1 kHz clock of the operating system. In the systems that include a supported hardware device, such as a National Instruments data acquisition board, alternate timing sources, such as the end-of-scan interrupt of an E-Series board AI engine, might drive the Timed Loop. Figure 6 shows how this is accomplished in a LabVIEW block diagram.
Since we want to run the control loop ten times faster than the monitoring loop, we could have created separate timing sources for the control and monitoring loops and run the two loops in parallel. The issue with this approach is that the separate timing sources will not be synchronized together and, over time, the clocks driving the two parallel timed loops will drift relative to each other.
In order to ensure that our control & monitoring loops stay in sync, we created one timebase for the control loop and divided it down by 10 for the monitoring loop. Figure 7 shows the basic structure of how the DAQmx Sample Clock Timebase Divisor property allows this operation to be performed. The sample clock division must take place after a sample clock is created and before the task begins.
We create a timing source for the control loop and use that same timing source for the monitoring loop. The dt input on the Timed Loop allows us to specify a period for Timed Loop, and we set this period to “1” for the control loop and “10” for the monitoring loop, thus allowing the monitoring loop to execute once for every 10 iterations of the control loop.
In order to get the best performance for this benchmark application, we utilized a number of techniques presented here that should help in other applications also.
By default, the NI-DAQmx 8.3 driver acquires samples from multiple channels using a semi-round-robin method. This means that the device uses about 50% of the available time to convert and read the required samples and these acquisitions are evenly spread out over this span. The DAQmx Timing Property Node can return the maximum possible rate, which can be used to configure the analog input task so that it is performed at the fastest possible convert rate of the hardware device, as shown in Figure 8.
The AI Convert Maximum Rate property outputs the maximum convert rate of the analog input channel. This maximum convert rate is then written to the DAQmx Timing AI Convert Rate for the appropriate analog input task, which overwrites the round robin convert rate of the task. This speeds up the rate at which samples are converted and allows for higher execution rates of the program.
Loopfrom Task” Timing Source
A feature of the DAQmx Control Loop from Task timing source is the ability to specify a period of time to wait (or “sleep”) before polling for data. This reduces CPU utilization by avoiding unnecessary polling for data not yet available. Analysis of traces made with the Execution Trace Tool showed that the DAQmx AI Read VI in the control loop was spending 80 microseconds in polling before data was available. Thus, by configuring a sleep of 80 microseconds, we were able to dramatically reduce CPU utilization, as shown in Figure 9.
Note that while the sleep does free up resources for lower priority loops to run while the control loop waits for data, it does not allow the control loop to finish any earlier. As a result, the application eventually reaches a rate at which the control loop fails to finish on time, despite the CPU usage being well below 100%.
The communication & logging task in this application is responsible for logging all of the input, output, and processed data from the control loop and monitoring loops to local storage on the RT target. A characteristic of file I/O operations is that the rate at which data can be written to file depends upon the total size of the data being written. In our system, if the size of the data being written is a multiple of 512 bytes, the speed at which the data can be written dramatically increases. Figure 10 illustrates this behavior in LabVIEW 7.1 with two different PXI controllers showing the spikes in throughput corresponding to optimal file writing sizes.
To take advantage of this behavior, data was written to file in 512 byte chunks.
Similar to the file I/O optimization described above, we optimized the communication loop to reduce the time it takes to transfer data across the network. The rate at which data can be transferred over a network is related to the amount of data being sent. As shown in Figure 11, larger payloads increase the overall throughput of the application.
The communication loop in this application sends a large amount of data over the TCP network to the host. To take advantage of the network throughput behavior described above, we broke up the data being sent over the network into 512 byte portions to get a good balance between network throughput and memory used.
4. Benchmarking Results
The performance of the application was tested by running the applications at successively higher rates until a loop was late or a FIFO overflowed. The hardware and software used is detailed below:
Controller: NI PXI-8196
I/O: (2) NI PXI-6071; NI PXI-6602; NI PXI-6713
LabVIEW Real-Time 8.20
We benchmarked two versions of the application which differed only in the communication method used between the host and the real-time target, either TCP/IP or network-published shared variables.
Figure 12 illustrates the performance using both TCP/IP and network-published shared variables, plotting application rates versus total CPU utilization.
Table 1 shows some sample rates and the corresponding CPU usage.