While loops are a fundamental structure that can be used with a variety of programming patterns (task parallelism, data parallelism, pipelining, or structured grid). Depending on the pattern, a regular while loop will suffice, or in other situations a specialized type of while loop (such as the timed loop) may be appropriate.
Shift Registers and Feedback Nodes
For the pipelining approach outlined above, either shift registers or feedback nodes should be used (their behavior is the same in this scenario).
Parallel For Loop
The parallel for loop allows you to programmatically set the number of parallel “workers” that execute your code to achieve implicit parallelism (i.e. the code abstracts the complexity and maps different workers to different cores). Create a worker for each processor core to maximize your parallel execution.
The Parallel For loop is a valid approach for intensive operation that needs to execute over and over in loop, without having dependencies from one iteration to the next. However, if there are dependencies, the Parallel For loop should not be used because the dependencies imply that the algorithm should be executed sequentially. In that case, you another technique such as pipelining to achieve parallelism.
The timed loop acts as a while loop but has special characteristics that can help you optimize performance based on the multicore hardware configuration. For example, unlike a regular while loop which may utilize multithreads, any code enclosed within the timed loop executes in a single thread. This may seem counter intuitive, and one might wonder why single thread execution would be desirable on a multicore system. In particular, it's a useful characteristic on real-time systems and where optimizing for cache is important. In addition to executing in a single thread, the loop can set processor affinity, which is a mechanism to assign that thread to a particular CPU (and hence help optimize for cache).
It's important to note that parallel patterns that work well within a regular while loop (such as data parallelism and pipelining), will not work with a timed loop, because no parallelism can be achievable in a single thread. Instead, the techniques can be implemented using multiple timed loops, for example, with pipelining one timed loop can represent a unique stage in the pipeline, with data transfer between loops via FIFOs.
Queues and RT FIFOs
Queues are important for synchronizing data between multiple loops. For example, they can be used to implement a producer/consumer architecture. The producer/consumer architecture was not mentioned in this document specifically, due to the fact that it's not unique to parallel programming and is more of a general purpose programming architecture. Nonetheless it works quite well on multicore CPUs to minimize CPU usage, and the combination of loops and queues make it all possible.
Note that queues are not a deterministic mechanism to share data between loops, if real-time is required use RT FIFOs instead.
CPU Pool Reservation VIs
Specific to LabVIEW Real-Time, CPUs can be "reserved" for certain thread pools using CPU Pool VIs. This is another mechanism to optimize for cache.
For example, imagine an application will be executed on a quad-core system, whereby the application is intended to operate on data as quickly as possible over and over again. This type of operation is ideal to run in cache, assuming the dataset will actually fit in the CPU cache. In fact, running the operation in cache may be more effective than attempting to parallelize the code and utilize all four CPUs. So, instead of allowing the OS to schedule parallel tasks across all four CPUs (0-3), the developer may choose to reserve only two CPUs for the OS scheduler, such as CPU 0 and 2. (Perhaps the quad-core in question has a large shared cache between CPU 0 and 1, and another large shared cache between CPU 2-3) By reserving CPUs, the developer can help ensure the data stays in cache, and also ensure that the two large shared caches are at full disposal to be used by the operation.
CPU Information VIs
The CPU Information VIs provide information specific to the system the LabVIEW application is running on. This information is very useful if the application may be deployed on a myriad of different machines (such as dual-cores, quad-cores, or even octal cores.
By using the CPU Information VIs, the application can read parameters such as "# of logical processors" and based on the result on a given machine, use that result to feed into a parallel For loop.
For example, if the application is running on a dual-core machine, the # of logical processors = 2, and therefore the optimal number setting for the parallel For loop would also be two. This allows for code to more easily adapt to the underlying hardware.
The Desktop Execution Trace Toolkit and the Real-Time Trace Viewer
Tracing is a very useful way to debug multicore applications, and can be performed on both the desktop or a real-time system. Refer to the product documentation for the Desktop Execution Trace Toolkit and the Real-Time Trace Viewer. In LabVIEW 2013 and prior releases of the LabVIEW Real-Time Module, the Real-Time Trace Viewer is packaged as a separate toolkit (Real-Time Execution Trace Toolkit).