This White Paper is part of a series of articles related to Real-Time High-Performance Computing (RT-HPC) and Scientific Computing applications with NI LabVIEW. In this document, we present a comprehensive review of the different approaches, technologies and methods that can be implemented in NI LabVIEW to take advantage of parallel computing architectures.
Some scientific and technical applications are very demanding in terms of computational intensity, size of data sets and number of I/O channels for online data streaming, analysis and visualization. These applications perform High-Performance Computing-type (HPC) computations under real-time constraints. As processing capabilities increase and parallel programming barriers decrease, we expect this trend to continue and most likely accelerate (see Figure 1.1) in terms of Tera and Giga- Floating Point Operations Per Second (TFLOPS and GFLOPS) and Giga Multiply- Accumulate operations per Second (GMACS). This includes such applications as plasma control in nuclear fusion, adaptive optics in extremely large telescopes, RF MIMO systems, smart electrical grid control, high-resolution medical imaging, multi- agent distributed robotic systems and large real-time hardware-in-the-loop simulations for complex dynamic systems, among many others.
Fig. 1.1: Real-Time HPC Trend
In order to respond to these and other challenges, the following key elements are supported: a) Intel® multicore CPUs (mCPUs or nCores) running a real-time operating system (RTOS), b) capability for the direct implementation of algorithms and signal processing functions at the embedded level, on Xilinx® Field Programmable Gate Arrays (FPGAs), c) access to nVIDA® Graphical Processing Units (GPUs) as hardware accelerators for numerical/computational tasks through the CUDA™ and CUBLAS™ toolkits and libraries, d) multi-channel, high-speed, mixed signal (analog/digital) acquisition and generation with NI’s modular hardware and, e) precise synchronization/timing between all measurements and computations with NI’s synchronization boards.
By combining different hardware components and sub-systems, available as Commercial-Off-The-Shelf (COTS) technologies, a hybrid, heterogeneous architecture is easily configured. This architecture combines mCPUs, GPUs and FPGAs as part of an integrated system. The hardware architecture is based on the PXI Express (PXIe) platform and is compatible with the PICMG’s CompactPCI Express standard. CompactPCI Express brings PCI Express technology to the popular PICMG 2.0 CompactPCI (cPCI) form factor, while maintaining compatibility with CompactPCI hardware and software. The suggested system is based on a combination of the PXIe 1U and 3U form factors.
Fig. 1.2: RT-HPC system with PXI and nVIDIA® GPU Computing system
The system includes a numerical computing/accelerator sub-system, and a measurement/control sub-system.
1.1 Numerical Computing/Accelerator Sub-system
As shown in Figure 1.3, two NI 8353 multicore computers (1U form factor) are used in conjunction with one nVIDIA® Tesla™ GPU computing system (which includes four T10 GPUs) for highly demanding computational tasks and as co-processors or accelerators for the PXIe controllers. These two NI 8353 computers and the nVIDIA® Tesla™ GPU computing system conform the Numerical Computing/Accelerator System. Although GPUs are not directly programmed with NI LabVIEW, the architecture of this framework is designed to integrate GPU execution into LabVIEW's parallel execution system by establishing a compute context in which to execute the CUDA™ code.
Fig. 1.3: Numerical Computing/Accelerator Sub-system
At the hardware level, access to four nVIDIA® T10 processors via a Tesla™ S1070 computing system (see Figure 1.4) offers a theoretical maximum of 4 TFLOPS on single-precision data that, in practice, could yield 400 GFLOPS or better performance. The NI 8353 multicore computers are designed for high-performance measurement and control applications. Each one of these computers include four removable SATA II hard drives in a RAID 0 arrangement (high-capacity hard-disks are supported).
Fig. 1.4: Four nVIDIA® T10 GPUs in the Numerical Computing/Accelerator Sub-system
1.2 Real-Time Measurement and Control Sub-system
For direct, real-time measurement and control, two or more PXIe chassis (8 or more slots each) are used; in this example, the PXIe-1082 chassis are used. Each chassis has its own embedded controller with a mCPU and one or more FPGA-based data acquisition/control boards. The number of boards installed on each chassis can vary. In this example, two PXIe controllers take advantage of the FPGAs available on the data acquisition boards as co-processors and algorithm accelerators. A Real-Time Operating System (RTOS) integrated with LabVIEW RT is used on each multicore controller.
Fig. 1.5: Real-Time Measurement and Control Sub-system
More specifically, one PXIe chassis (chassis on the left, Figure 1.5), includes two FPGA-based, reconfigurable data acquisition boards (PXI-7852R), each one with 8 analog-input/outputs with 16-bit resolution ADC/DACs and simultaneous sampling rate of up to 750 kHz per channel (analog inputs). On the other PXI Express chassis (chassis on the right, Figure 1.5), three FlexRIO high-speed digitizers based on DSP- optimized FPGAs are included (PXIe-7965R), each one with a FlexRIO Adapter Module (FAM) with 32 analog inputs and 12-bit analog ADCs that are simultaneously sampled at 50 MS/s (NI 5752). Using Gigabit Ethernet, the two PXIe embedded controllers communicate with the NI 8353 multicore computers that will in turn have access to four GPUs in the nVIDIA® computing system to offload computations of the control algorithms and logs all acquired data for further use.
Fig. 1.6: P2P using two FlexRIO boards on the PXI Express Bus
Another feature of this configuration is the new NI peer-to-peer (P2P) streaming technology that uses the PXIe bus to enable direct, point-to-point transfers between two or more FlexRIO boards and other instruments without sending data through the host processor in the embedded controller or memory. This enables devices, such as the FPGA-based FlexRIO and R Series DAQ boards in each PXIe system to share information directly on the PXIe bus without burdening other system resources (see Figure 1.6 above).
Finally, each PXIe chassis includes a timing board (PXIe-6672) for synchronization purposes. A timing board on one of the two PXIe chassis (as shown in Figure 1.5) is configured as the “master” and provides a synchronization signal through a high- resolution DDS clock generation (DC to 105 MHz) and an onboard high-stability reference TCXO (1 ppm); the other timing board is configured as a “slave”.
2. RT-HPC Programming
A graphical system design approach is used taking advantage of the NI LabVIEW programming environment. All the software and codes for a NI LabVIEW-based RTHPC system are all written in G (LabVIEW’s compiled programming language), including the embedded, parallel code to be executed in the mCPU, FGPAs and GPUs (see Figure 2.1); NI LabVIEW is also used for all the GPU-based algorithms as well as all the communication and synchronization functions. The LabVIEW Professional Development System (PDS), the LabVIEW Real-Time module and LabVIEW FPGA module plus various toolkits and toolsets (Control Design and Simulation, Advanced Signal Processing, etc.) are used to design, prototype and deploy the data acquisition, control, analysis and visualization system that conforms the RT-HPC system.
Figure 2.1: LabVIEW as a common programming tool for all “targets”
Based on the dataflow paradigm, the NI LabVIEW programming language is intrinsically parallel and can easily take advantage of multi-core processors such as the Intel® Core Duo/Quad and higher-core density processors using such programming techniques as Task Parallelism, Data Parallelism, Pipelining, Structured Grid (Systolic Arrays) and more. Also, NI LabVIEW is used to program FPGAs; GPUs are supported through a “wrapper: that allows the access to the nVIDIA® CUDA™ and CUBLAS™ toolkit and libraries; additionally, C code can be generated by NI LabVIEW and compiled with a different tool chain to target a specific architecture or custom hardware.
3. Taking Advantage of Multicore Processors with NI LabVIEW
Thanks to its dataflow architecture and the support of different models of computation (e.g. State-Charts), the NI LabVIEW graphical programming paradigm makes parallel programming easy, even for novice users. In addition to the three common methods to achieve parallelism (Figure 2.1), other methods can also be implemented in NI LabVIEW.
We know that “Task parallelism” is easily implemented in NI LabVIEW, where independent parallel tasks can run concurrently on all available cores in the mCPU, resulting in significant speedups. Also, “Data parallelism” and “Pipelining” can be implemented on multicore processors using NI LabVIEW. In order to show how easy is to implement these methods in NI LabVIEW, an example of parallelized matrix-vector multiplication operation is shown in Figure 3.1.
Figure 3.1: Matrix vector multiplication in NI LabVIEW (one core vs. eight cores)
As shown, a matrix-vector multiplication can go from one to eight cores very easily in NI LabVIEW. In a real-world application, data could be captured from sensors to provide the vector on a per-loop basis. The result of the matrix-vector multiplication operation can then be used to control the corresponding actuators.
The same applies to matrix multiplication. The performance gain when using multicore CPUs is shown in Figure 3.2. In this example a matrix multiplication is performed. As shown, as the number of threads (and number of cores) increases, so the performance gain increases. This performance gain is obtained even if the size of the matrix also increases. These gains are achievable in a more deterministic way when using a RTOS and LabVIEW RT; still, as shown in the Figure, NI LabVIEW for MS Windows offers some significant performance gains although with less determinism (higher “jitter”).
Figure 3.2: Multicore Matrix Multiplication Performance Gains
In order to take advantage of these parallel architectures, different programming techniques are required (Figure 2.1). In addition to the obvious ones (Task Parallelism, Data Parallelism and Pipelining), NI LabVIEW supports other models of computation that can take advantage of parallel architectures. One of those models is the Systolic Arrays which is analyzed in the next section.
3.1 Systolic Arrays (Structured Grids) on Multicore processors:
Many computations that involve physical models can take advantage of Systolic Arrays (SA), also known as structured grid patterns (Figure 3.3). SAs are commonly used in such applications as genomics, bioinformatics, sequence alignment, computational fluid dynamics and dynamic programming.
In SAs, you calculate a 2D or nD grid or array on every iteration; each updated grid value is a function of its neighbors as each processor element (PE) shares information with their them; information is shared after each PE performs the needed operations on the data, flowing synchronously across the array between neighbors. Dataflow computing and multicores and FPGAs are obvious platforms for SAs.
Just as in SAs where cells are connected by “wires”, in NI LabVIEW graphical programming paradigm blocks or nodes or “icons” are connected with “graphical wires”.
With a parallel version of a SA, you split the grid into sub-grids and compute each sub- grid independently. Communication between workers is only the width of the neighborhood. Parallel efficiency is a function of the area-to-perimeter ratio.
Figure 3.3: A basic Systolic Array
SAs are also an option for performing basic matrix-matrix multiplication operations. As shown in Figure 3.4, a systolic algorithm can be easily applied for a 3x3 matrix multiplication operation.
Figure 3.4: A Matrix-Matrix multiplication operation with a Systolic Array
SAs are also natural architectures for developing solvers to Partial Differential Equations (PDEs). For example, the block diagram in Figure 3.5 (a 2-D Systolic Array) can solve the Heat Equation, where the boundary conditions constantly change. The 16 visible icons in Figure 3.4 represent tasks that can solve the equation of a certain grid size. In this example, the 16 tasks map to a 16-core Intel® multicore CPU. On every iteration, the 16 cores exchange boundary conditions and the process builds up a global solution. The Feedback Nodes, which appear as arrows above small dots, represent data exchange between elements. You can also map such a block diagram to a two-, four-, or eight-core CPU. As computers with more cores become available, you can use a similar strategy.
Figure 3.5: Solving the Heat Equation with NI LabVIEW using a Systolic Array (Structured Grid)
Systolic Arrays can also be implemented on FPGAs. Processing elements or cells can be created on the FPGA, together with the corresponding shift-registers, adders and other required elements. For example, NI LabVIEW FPGA can be used for implementing a two dimensional SA that performs a matrix-vector or matrix-matrix multiplication operation in a few clock cycles. Clock cycles can be as fast as 10ns (100 Mhz clock).
Performance is usually measured in terms of Giga Cell Updates Per Second (GCUPS), representing the ratio of the number of processors or cells in the array to the total time consumed. On FPGAs, the performance can also be measured by number of cells in the array and the clock frequency used. In short, the performance of SAs in FPGAs mainly depends on the number and efficiency of the cells, including the clock cycles used and the area required (# of slices). For example, a popular algorithm in bioinformatics is BLAST (Basic Local Alignment Search Tool), used for comparing primary biological sequence information, searching both protein and DNA sequence databases for sequence similarities; the BLAST algorithm is a perfect candidate for a NI LabVIEW implementation of a Systolic Array on a FPGA.
4. Taking Advantage of GPUs in NI LabVIEW
Some applications are tailor made for deployment to GPUs, such as those related to matrix- vector operations. LabVIEW GPU Computing unleashes the computing power of nVIDIA® GPUs via the CUDA™ toolkit interface from within a LabVIEW application. Code that calls the GPU for computation is integrated into the native parallel execution system of LabVIEW as if it were any other multi-threaded external library function call.
Supporting GPU computing from NI LabVIEW requires an interface component and a compute framework. The interface is made up of two LabVIEW libraries – lvcuda.lvlib and lvcublas.lvlib – that contain Controls and VIs that map to CUDA™ types and CUDA™ runtime or CUBLAS™ functions. The interface exposes key nVIDIA® CUDA™ resource and CUBLAS™ library management functions. It cannot include the real-world solutions developed by users, designed to run on GPUs, and invoked from C/C++ code. For the user-defined class of CUDA™ implementations, a special compute framework is available through NILabs.
The NI LabVIEW GPU Computing includes:
- A collection of LabVIEW data types and VIs (see Figure 4.1), called LVCUDA and LVCUBLAS, for interfacing with the CUDA runtime and CUBLAS Library functions.
- A framework, called NI LabVIEW GPU Computing that establishes a compute context in which user-define GPU functions execute.
Figure 4.1: LabVIEW VIs (LVCUDA) for the nVIDIA® CUDA™ and CUBLAS™ libraries
Using these data types, VIs and framework allow LabVIEW users to:
-Target multiple GPU devices from the LabVIEW diagram.
- Manage resources across all GPU devices.
- Facilitate numeric array data transfers to and from the GPU.
- Execute GPU process in parallel with CPU execution.
- Provide shared resource protection when performing GPU processes from different user libraries.
- Block invalid references when used on the wrong GPU device.
- Establish clean-up callbacks that ensure resources are freed even when the application is aborted.
A basic matrix-matrix multiplication operation (AxB) using the NI LabVIEW GPU Computing library for nVIDIA® GPUs is shown in Figure 4.2.
Figure 4.2: Matrix vector multiplication in LabVIEW using the nVIDIA® CUDABLAS® library
In the above example, a square matrix (m=n) of size 300 is used for the multiplication operation. In order to perform this operation, the following four steps are taken:
a) Step 1: Set the GPU Device Context and the Generate the two matrices (A and B), as shown in Figure 4.3
Figure 4.3: Step 1 for the AxB matrix multiplication operation with NI LabVIEW and an nVIDIA® GPU
b) Step 2: Memory allocation for matrices A, B and C (Figure 4.4)
Figure 4.3: Step 2, Memory allocation for “C” on the CPU and for A, B and C on the GPU.
c) Step 3: AxB multiplication operation to obtain C (Figure 4.4)
Figure 4.4: Step 3, memory allocation for “C” on the CPU and for A, B and C on the GPU.
d) Step 4: Free allocated resources (Figure 4.5)
The performance gain when using GPUs can also be significant. As shown in Figure 4.6, GPUs are another processor architecture that has peak performance beyond that of CPUs and multicore processors. Instead of throughput, data size dictates the tipping point between CPU and GPU usage.
Figure 4.6: GPU-based parallel Architectures Drive Performance (Source: nVIDIA®)
5. Taking Advantage of FPGAs with NI LabVIEW
The NI LabVIEW FPGA Module uses LabVIEW embedded technology to extend LabVIEW graphical development and target field-programmable gate arrays (FPGAs) on NI reconfigurable I/O (RIO) hardware. LabVIEW is distinctly suited for FPGA programming because it clearly represents parallelism and dataflow. With the LabVIEW FPGA Module, you can create custom measurement and control hardware without low-level hardware description languages or board- level design. You can use this custom hardware for unique timing and triggering routines, ultrahigh-speed control, interfacing to digital protocols, digital signal processing (DSP), RF and communications, and many other applications requiring high-speed hardware reliability and tight determinism.
Despite great strides in raw processing performance on CPUs, it can’t match the raw computing ability of FPGAs. In fact, some applications are tailor made for deployment to FPGAs. Like GPUs, FPGAs can be an excellent option for accelerating algorithms and computations in real- time simulation and control systems (See Figure 5.1)
Figure 5.1: FPGA-based Parallel Architectures Drive Performance
When using FPGAs, the inherent parallelism of LabVIEW is realized in hardware as true simultaneous execution. The compiled code is implemented in hardware by configuring logic cells in the FPGA. Your embedded VI does not need access to a processor to execute. Independent sections of code, such as parallel “while loops”, are implemented in independent sections of the FPGA.
Figure 5.2: LabVIEW FPGA-based code
After the FPGA is configured, data is clocked through the device at a rate specified by the onboard clock, executing independent areas of the chip simultaneously. In Figure 5.2, the parallelism of LabVIEW FPGA and reconfigurable I/O enables the analog and digital loops to execute simultaneously without competing for execution time from the host processor. The timing functions take advantage of the on-board clock, achieving timing resolution of 25 nanoseconds or better (time cycles of 40 Mhz, 80 Mhz, 100 Mhz, 200 Mhz and more are available).
As an additional example, the following streaming application represents a class of problems that simply cannot be efficiently addressed by more general purpose processors like CPUs.
The problem consists in applying a trivial pattern matching algorithm to megapixel images at exceptionally high frame rates – 20Kf/s. This represents a throughput is 20 GSamples/s which is well out of the reach of CPUs (including multicore CPUs) but well within the bounds of FPGAs
(see Figure 5.3).
Figure 5.3: LabVIEW FPGA-based streaming sensor data processing
In order to solve this problem, a four (4) quadrant partitioning is used to address the input saturation by the high frame rate. The goal is to scale an incoming 1k x 1k sensor grid by a 1k x 1k coefficient matrix, and integrate it to one measurement. “Image” sensor data is streaming at 50 us per frame, or 20,000 frames per second. The computation challenge is 1k x 1k / 50us = 20 Giga Operations per Second (20 x 10^9 operations per second or 20 GOPS).
Using a single NI FlexRIO board with LabVIEW FPGA (see Figure 5.4), it is possible to process 64 “pixel” streams to compute a 512x512 image size at a rate of 24,000 frames per second. The benchmarked computational power is 512x512x24k/s = 6.3 GOPS. Even at these very high rates, the resource (FPGA) utilization is only at 25% of the logic and only 10% (64/640) of the DSP blocks, at only 100MHz (10 ns), on the NI FlexRIO (PXIe-7965R). This application requires at least 3 FlexRIO modules to handle the incoming data bandwidth if each sensor “pixel” is 8 bits. For 16-bit sensor data, we would require twice the number of FlexRIO modules.
Figure 5.4: Mask and integrate a 64 PE Array in LabVIEW FPGA
The availability of COTS technologies for real-time high-performance computing and real-time simulation and control allow scientists and engineers to easily respond to new challenges and demanding applications with high computational requirements, large data sets and large number of I/O channels. Powerful multicore processors, FPGAs and GPUs provide increased processing capabilities, overcoming the parallel programming barrier. Dataflow-based graphical programming for heterogeneous architectures is part of the response to these new challenges. We expect this trend to accelerate in terms of GFLOPS, GOPS, GCUPS and GMACS in a clear path to reach Exascale computing levels in the next few years.
NI LabVIEW and the NI hardware platform allows scientists and engineers to develop scalable systems that do not require highly specialized application programmers to explicitly manage system complexity, in terms of how to achieve true parallelism, how to handle memory issues and how to program heterogeneous architectures from a high-level of abstraction; to achieve higher performance, shorter time-to-prototype and time-to-deployment, and dynamic adaptability to new requirements (reconfigurability) at the cabinet, desktop and embedded levels. NI LabVIEW provides a wide variety of tools and models of computation for rapid conversion (design-to-deployment) of parallelized algorithms to COTS hardware, using readily available IP cores or custom-designed ones, an extensive library of math and signal processing functions that can be easily implemented on multicore CPUs and FPGAs, or accelerated by GPUs through the nVIDIA® CUDA™ and CUBLAS™ toolkit/libraries.
There seems not to exist one solution for all applications, so the heterogeneous, hybrid architecture approach seems to be the best option to today’s very demanding scientific and engineering applications. Also, the scalability of these systems seems to be a key success factor as a typical application could include very small form factor embedded elements (e.g. smart sensors) as well as large, multichannel real-time systems such as the ones described in this document, as part of a large, distributed complex cyber-physical system that involves different data types and structures, math operations, signal processing and analysis, and visualization, all within one integrated system. RT-HPC responds to some of those needs and can be easily adapted to different applications and scalability needs.
Contact: Igor Alvarado – Academic Business Development Manager
1. High-Performance Computing (HPC) in a Real-Time Environment, http://zone.ni.com/devzone/cda/tut/p/id/7431
2. LabVIEW High Performance Analysis Library, http://decibel.ni.com/content/docs/DOC-12086
3. Prototyping Algorithms for Next-Generation Radio Astronomy Receivers Using PXI- Based Instruments and High-Speed Streaming, http://sine.ni.com/cs/app/doc/p/id/cs-12972
4. Developing Real-Time Control for the World’s Largest Telescope, http://zone.ni.com/devzone/cda/pub/p/id/711
5. Scientific Computing with Graphical System Design, http://zone.ni.com/devzone/cda/tut/p/id/7661
6. Programming Strategies for Multicore Processing: Data Parallelism, http://zone.ni.com/devzone/cda/tut/p/id/6421
7. Deterministic Synchronization of Distributed Simulation Systems, http://zone.ni.com/devzone/cda/tut/p/id/3246
8. LabVIEW GPU Computing, http://decibel.ni.com/content/docs/DOC-6064
9. Programming Strategies for Multicore Processing: Pipelining, http://zone.ni.com/devzone/cda/tut/p/id/6425
10. Optimizing your LabVIEW FPGA VIs: Parallel Execution and Pipelining, http://zone.ni.com/devzone/cda/tut/p/id/3749
11. New Parallel Technologies -From Signal to Software, http://zone.ni.com/devzone/cda/tut/p/id/9403
12. NILabs, http://decibel.ni.com/content/groups/ni-labs
13. NI Academic Research, www.ni.com/academic/researchers
14. Advantages of the Xilinx Virtex-5 FPGA, http://zone.ni.com/devzone/cda/tut/p/id/7440