A common expression goes that ”Real-time doesn't necessarily mean real fast” -- but that might be changing. Engineers and scientists are now able to address a new domain of problem-solving based on “Real-Time Numerical Analysis” using a high-performance computing (HPC) approach with off-the-shelf hardware.
This new area in innovation is being driven by the advent of multicore processors and more sophisticated real-time OS technologies that utilize symmetric multiprocessing (SMP) to allow for real-time software to be load-balanced across multiple CPU cores. This paper will look at how a parallel programming approach can be applied to allow domain experts (such as control engineers, physicists, geologists, and biomedical researchers) to create HPC applications that meet real-time constraints.
2. How Real-Time HPC fits in with a Traditional HPC approach
In all forms of HPC, whether the compute engine is a supercomputer or distributed computers, the goal is to accelerate the calculation of the problem at hand. Because tasks that require acceleration are so computationally intensive, traditionally speaking your typical HPC problem couldn’t be solved with normal desktop computer let alone an embedded system. However, disruptive technologies such as multicore processors, GPUs, and FPGAs, enable more and more HPC applications to now be solved in with off-the-shelf hardware.
Where the concept of “real-time HPC” comes into the picture is with regards to latency, and more specifically the time it takes to close a loop. Many HPC applications perform off-line simulations thousands and thousands of times and then report the results – this is not a real-time operation because there isn’t a timing constraint specifying how quickly the results must be returned. The results just need to be calculated as fast as possible.
Real-time applications have algorithms that need to be accelerated but often involve the control of real-world physical systems – so the traditional HPC approach is not applicable. In a real-time scenario the result of an operation must be returned in a predictable amount of time. The challenge is that until recently, it has been very hard to solve an HPC problem while at the same time closing a loop under 1ms.
Furthermore, a more embedded approach may need to be implemented, where physical size and power constraints place limitations on the design of the system.
3. Addressing Latency Concerns
Many HPC applications are developed using a message passing protocol (such as MPI or MPICH) to divide tasks across the different nodes in the system. A typical distributed computer scenario looks like the following with one head node that acts as a master and distributes processing to the slave nodes in the system:
Figure 1: Example configuration in an HPC system. Adapted from Cleary and Hobbs “A Comparison of LAM-MPI and MPICH Messaging Calls with Cluster Computing.”
By default, it is not real-time friendly because of latencies associated with networking technologies (like Ethernet). In addition, the synchronization implied by the message passing protocol is not necessarily predictable with granular timing in the millisecond ranges. Note that such a configuration could potentially be made real-time by replacing the communication layer with a real-time hardware and software layer (such as reflective memory), and by adding manual synchronization to prioritize and ensure completion of tasks in a bounded timeframe. Generally speaking though, the standard HPC approach was not designed for real-time systems. It solves other HPC apps extremely well but falls short in real-time applications. According to Wikipedia, “distributed programs often must deal with heterogeneous environments, network links of varying latencies, and unpredictable failures in the network or the computers.” These present serious challenges when real-time control is needed.
Now consider a multicore architecture, where today you can find up to 16 processing cores. In the short term future we will see 32 cores, and Intel has recently unveiled their 80-core prototypes as part of their tera-scale research program. It is the first programmable chip to deliver more than one trillion floating point operations per second (1 Teraflops) of performance while consuming very little power. What this means is off-the-shelf multicore processors have the potential to become a miniaturized supercomputer in a single piece of silicon.
From a latency perspective, instead of communicating over Ethernet, with a multicore architecture that can found in off-the-hardware there is inter-core communication which is determined by system bus speeds. So return-trip times are much more bounded. Consider a simplified diagram of a quad-core system:
Figure 2: Example configuration in an HPC system. Adapted from Tian and Shih, “Software Techniques for Shared-Cache Multi-Core Systems, Intel Software Network. ”
In addition, multicore processors can utilize symmetric multiprocessing (SMP) OSs – a technology found in General Purpose Oss like Windows, Linux, and MacOS for years to auto load-balance tasks across available CPU resources. Now real-time operating systems are offering SMP support. This means that a developer can specify timing and prioritize tasks that are applicable across many cores at one time – and the OS handles the thread interactions. This is a tremendous simplification compared with message-passing and manual synchronization, and it can all be done in real-time.
In addition, a correlation can be drawn between the typical numbers of nodes in an HPC system and how many cores CPUs will be offering in the near future. Douglas Eadline, Senior HPC Editor for Linux Magazine, noted that “A high number of clusters have 64 nodes or less, almost no clusters have between 64 and 256 nodes, then above 256 the number increases.” While embedded system designers will not be using 64 or 256 cores anytime soon, nonetheless for real-time acceleration the scalable possibilities that multicore presents will very soon to be comparable to a 16 or 32 node systems addressed by systems that fall in the category or the lower end HPC use-cases. Note that the reason 64 nodes and fewer are commonplace today in HPC systems stems in large part from the fact that creating highly parallel code is not trivial.
What are some examples of needing real-time acceleration? Two examples of accelerating real-time applications with multicore processors include an autonomous vehicle application and nuclear fusion research.
In an autonomous vehicle application, TORC Technologies and Virginia Tech used LabVIEW to implement parallel processing while developing vision intelligence in its autonomous vehicle for the 2007 DARPA Urban Challenge. LabVIEW runs on two quad-core servers that perform the primary perception in the vehicle. TORC did not require hard real-time and was able to implement a soft real-time solution with a general purpose operating system.
At the Max Planck Institute for Plasma Physics in Garching, Germany, researchers implemented a tokamak control system to more effectively confine plasma. For the primary processing, they developed a LabVIEW application which split up matrix multiplication operations using a data parallelism technique on an octal-core system. A hard real-time OS with symmetric multiprocessing (SMP) support was installed on an off-the-system based on an Intel multicore architecture. Dr. Louis Giannone, the lead researcher on the project, was able to speed up the matrix multiplication operations by a factor of five while meeting the 1 ms real-time control loop rate.
Other areas that show potential are areas in structural and geological research, in particular the simulation of earthquakes to improve the design of bridges and building structures. We will begin to see new domains applying hardware-in-the-loop (HIL) techniques that in the past may have required off-line simulations but were unable to utilize an HIL approach.
The key consideration to implementing a Real-Time HPC approach is the software design and architecture in the system.
4. Example of Parallel Programming: Pipelining
One widely accepted technique for improving the performance of serial software tasks is pipelining. Simply put, pipelining is the process of dividing a serial task into concrete stages that can be executed in assembly-line fashion. Let´s consider a use-case in LabVIEW.
Essentially, you can use LabVIEW to make an “assembly line” out of any given program. The graphical source code in Figure 3 below shows how a sample pipelined application might run on several CPU cores. The code implements a pipelined approach with the gray-colored “while loop” structure surrounding functions that must execute one after another. A construct on the border of the loop called a “shift register” stores the previous iteration’s value and then feeds that into the next block in the algorithm. This ensures that sequential order of execution is met, while at the same time allowing the pipelined stages to all execute in parallel.
Figure 3. Pipelining in NI LabVIEW.
In this implementation of basic pipelining, LabVIEW will automatically thread the application and in most cases the OS scheduler will run this on separate CPUs on a multicore system.
A more advanced implementation of pipelining in LabVIEW utilizes a special “Timed Structure” which may set processor affinity to a section of code, along with low-latency real-time FIFO structures that pass data between cores via queues. This implementation is best utilized when optimizing for cache performance.
It is important to note that acceleration in a real-time environment is not limited to multicore processor systems but also commonly found in heterogeneous processor systems involving a combination of CPUs, FPGAs and DSPs. LabVIEW can represent both parallel software for multicore processors and FPGAs with the same graphical code representation.
Figure 4. Parallelism represented in NI LabVIEW.
“Real-time” doesn´t necessarily mean “real fast”, but there´s no reason it cannot be. Engineers and scientists are now able to address a new domain of problem-solving based on “Real-Time Numerical Analysis” using a high-performance computing (HPC) approach with off-the-shelf hardware. Two example application noted in this paper were in autonomous vehicle vision perception and nuclear fusion control. The key consideration for developers is finding a programming approach that allows for implementation of parallel architectures, such as the pipelining example demonstrated above with LabVIEW.
Cleary and Hobbs, California Institute of Technology. “A Comparison of LAM-MPI and MPICH Messaging Calls with Cluster Computing.” http://www.jyi.org/research/re.php?id=752
Tian and Shih. “Software Techniques for Shared-Cache Multi-Core Systems. ” Intel Software Network. July 9th, 2007. http://softwarecommunity.intel.com/articles/eng/2760.htm
Meisel and Weltzin. “Programming Strategies for Multicore Processing: Pipelining. ” www.techonline.com/electronics_directory/techpaper/207600982¨
Eadline, Douglas. “Polls, Trends, and the Multi-core Effect. ” September 18th, 2007 http://www.linux-mag.com/id/4127