Improving Performance with Parallel For Loops

Publish Date: Nov 09, 2012 | 21 Ratings | 3.43 out of 5 | Print | 2 Customer Reviews | Submit your review

Overview

For Loop iteration parallelism is a feature introduced in LabVIEW 2009 that executes the iterations of a For Loop concurrently in multiple threads, yielding greater CPU utilization and reduced processing time on multicore machines. This document provides an overview of the feature, how and when to use it, performance tips, and example code.

Table of Contents

  1. Feature Overview
  2. Configuring Iteration Parallelism
  3. When to Use Iteration Parallelism
  4. Performance Tips
  5. Performance Study
  6. Summary
  7.  Related Links

 

 

1. Feature Overview

LabVIEW automatically takes advantage of multicore machines by executing independent portions of diagrams in different threads. In LabVIEW 2009 and later, you can get even more parallelism from your diagrams by enabling iteration parallelism on For Loops. This feature can be applied to a For Loop if the computation in one iteration does not depend on the results from another iteration. With iteration parallelism enabled, the iterations of the loop execute in parallel on multiple cores.

Before this feature was introduced, one way to execute the iterations of a loop in parallel was to make several copies of the loop and divide the work among the loop copies, as shown in the left of Figure 1. This can be tedious and difficult to maintain.

With LabVIEW 2009 and later, you can enable iteration parallelism on a For Loop, and the iterations will automatically execute in parallel. The parallel instances (P) terminal appears on the loop signifying that parallelism is enabled. The code in the right of Figure 1 is equivalent to the manually parallelized code in the left of the same figure. The LabVIEW compiler generates multiple independent loop instances to execute the iterations. The loop instances execute in parallel using LabVIEW’s multithreaded execution environment, just like parallel sections of code in a diagram.

Figure 1 Loop parallelism in LabVIEW

In LabVIEW 2009, each loop instance executes a statically predetermined subset of the iterations. LabVIEW 2010 load balances better by assigning iterations to the loop instances dynamically. The iterations are divided into chunks, and each instance requests a chunk, executes it, and then requests another chunk. If one instance runs on a thread that gets more time on the CPU or executes chunks of iterations that contain fewer operations, the instance can help reduce the overall execution time by executing additional iterations.

Iteration parallelism can dramatically improve the performance of loops on multicore machines. The next section describes how to enable and configure iteration parallelism on For Loops.

Back to Top

2. Configuring Iteration Parallelism

To enable this feature, right-click on a For Loop, select Configure Iteration Parallelism..., and check Enable loop iteration parallelism in the dialog box. The parallel instances (P) terminal, and optionally the chunk size (C) terminal, will appear on the loop, as shown below in Figure 2. This section provides a description of the dialog box, followed by guidance on wiring the terminals.

For Loop Iteration Parallelism Dialog Box


[+] Enlarge Image

Figure 2 The For Loop Iteration Parallelism dialog box in LabVIEW 2010

The Number of generated parallel loop instances setting specifies the number of loop instances to generate at compile time. At run-time, the loop will use the minimum of the value entered in the dialog box and the value wired to (P), so enter the maximum amount of parallelism you ever expect to use in the dialog. For example, if you expect to execute this application on an eight-core computer in the future, enter eight for this setting.

The Iteration partitioning schedule determines how the iterations are divided into chunks before being distributed to the parallel loop instances. This configuration option is available in LabVIEW 2010 and later. With the Automatically partition iterations setting, LabVIEW will generate code such that the initial chunks of iterations are large, and the chunk size decreases. Starting with large chunks reduces the overall number of chunks, which reduces the scheduling overhead. Ending with smaller chunk sizes makes it less likely that one loop instance will be assigned a large amount of work at the end when the other loop instances are sitting idle, which achieves better load balancing.

The Automatically partition iterations setting works well across a variety of load patterns and does not require any additional input on your diagram. It is possible, however, that there may be a better performing partitioning strategy for a particular algorithm or data set. If you believe this to be the case, then you can choose the Specify partitioning with chunk size (C) terminal setting. You must then wire a chunk size, or array of chunk sizes, to the (C) terminal, as described later in this section.

You cannot probe or put breakpoints in a For Loop when iteration parallelism is enabled. If you want to temporarily debug the loop, check the Allow debugging box. The iterations of the loop will execute serially, but the (P) and (C) terminals will remain on the loop. Turn debugging off when you are finished debugging to reenable parallel execution.

Wiring the (P) and (C) Terminals

After you enable iteration parallelism on a For Loop and close the dialog, you can configure the loop further by wiring values to the (P) and (C) terminals.

The number of loop instances used is the minimum of the value entered in the dialog box and the run-time value specified at the (P) terminal. If you leave (P) unwired, the default run-time value is the number of logical processors on the machine, so it is recommended that you leave (P) unwired. To specify a different number of loop instances to use at run-time, wire a value to the (P) terminal. See Table 1 below for an explanation of how the number wired to (P) translates to the number of loop instances used at run-time. The special cases for -1 and 0 are available in LabVIEW 2010 and later.

Value wired to (P) Number of instances used at run-time
1…positive infinity Min(Value wired to (P), Dialog value)
0, or not wired Min(Number of logical processors, Dialog value)
-1 Dialog value
Negative infinity…-2 1

Table 1 Behavior of different values wired to (P), where “Dialog value” is the number entered in the For Loop Iteration Parallelism dialog

If you choose the Specify partitioning with chunk size (C) terminal schedule, you must wire a chunk size to the (C) terminal. Consider the total number of iterations when selecting the chunk size. If the chunk size is too large, it will limit the amount of parallel work available. If the chunk size is too small, it will increase the amount of overhead incurred by requesting the chunks.

For finer control over the chunk sizes, you can wire an array of chunk sizes to the (C) terminal. For example, if you know that the first iterations of the loop take longer than the last iterations, you may want to create an array with small chunk sizes at the beginning to prevent the first chunks from containing too many long iterations and with large chunk sizes at the end to bundle the short iterations together. If you wire too many chunk sizes, LabVIEW ignores the extra values. If you wire too few chunk sizes, LabVIEW uses the last element in the array to determine the size of the remaining chunks of iterations.

Back to Top

3. When to Use Iteration Parallelism

In general, you should only use iteration parallelism when the loop would produce the same results regardless of the order in which the iterations are executed. If the computation in one iteration of the loop relies on a value computed in an earlier iteration, reordering the iterations may produce incorrect results. For example, if an element of an array is written on the i'th iteration and read on the i+1'th iteration, the read could happen before the write, producing a different value, since parallelizing the loop may cause these iterations to occur in a different order.

Iteration Dependence Analysis

One of the strengths of LabVIEW is that you do not need to analyze loops for dependencies yourself, since LabVIEW automatically determines whether there are dependencies between the iterations. When you enable iteration parallelism on a For Loop, LabVIEW analyzes the reads and writes to the data accessed in the loop to determine if the same data could be written on one iteration and read or written on another, creating a dependence.

When LabVIEW detects an iteration dependence, it breaks the VI and describes the problem in the Error List window. In addition, if the For Loop contains nodes that have side effects, the Error List window displays a warning. (You can configure LabVIEW to show warnings by default in the Debugging section of the Environment category in the Tools>>Options dialog.)

In LabVIEW, most For Loops that do not have shift registers are safe to parallelize. For example, the loop in Figure 3 reads the i'th value of array A and produces the i'th value of the result array on every iteration. These iterations can safely execute in any order.

Figure 3 Loop that can be parallelized

Certain types of For Loops with shift registers are safe to parallelize. For example, you can rewrite the loop from Figure 3 using shift registers as shown in Figure 4. Each iteration of the loop replaces a different element of the result array. Since the iterations do not depend on values computed in other iterations, it is safe to execute the iterations in parallel.

Figure 4 Loop with shift registers that can be parallelized

Figure 5 shows another example of a parallelizable loop with shift registers. Ordinarily, this loop would not be safe to parallelize because each iteration adds to the result from the previous iteration. However, LabVIEW recognizes this pattern as a reduction and generates code to compute partial sums in parallel and to add the partial sums at the end. See the shipping example Parallel For Loop Reduction.vi for additional information on reductions.

Figure 5 Loop computing a reduction that can be parallelized

Possible Errors and Warnings

Table 2 lists the possible errors and warnings that will be reported if the loop may not be safe to parallelize. If there is a shift register on the For Loop, the value must be an array where the iterations access different elements, or the value must used in a recognized reduction. (When there is a dependence between iterations, consider using another technique for obtaining parallelism, such as pipelining.)

Additionally, the shift register cannot be stacked, and it must be initialized. A For Loop cannot be parallelized if it contains a conditional terminal, a feedback node, or a Boolean control with latching mechanical action.

If the For Loop contains a node that may have side effects, the error list window will give you a warning. Examples of nodes with side effects include Local Variables and the Write to Text File function. When you see this warning, you should evaluate whether it is safe for your application to execute the operations out of order. For example, the order in which results are written to a file may or may not matter, depending on the application.

This feature is supported on desktop and real-time targets, but it is not supported on FPGA, PDA, Touch Panel, or embedded devices.

Array dependence between loop iterations
Dependence between loop iterations
Stacked shift register
Uninitialized shift register
Conditional terminal
Feedback node
Boolean control with latch mechanical action
Node with side effects
Feature not supported on target

Table 2 All parallel For Loop errors and warnings

Due to rounding effects with floating point numbers, the results can differ, though usually only in the lowest-order bits, when operations are performed in a different order. See the article on precision for more information about how this can occur with iteration parallelism and also in sequential code.

Find Parallelizable Loops Tool

To easily identify opportunities for iteration parallelism, use the Find Parallelizable Loops tool (Tools>>Profile>>Find Parallelizable Loops…). The tool lists all For Loops in the hierarchy of the current VI or project, and marks them as Parallelizable, Possibly not parallelizable, or Not parallelizable. Figure 6 shows a screenshot of this tool.

 

Figure 6 Find Parallelizable Loops tool

Back to Top

4. Performance Tips

Iteration parallelism is able to achieve significant performance gains on multicore machines, as shown in the Performance Study section. However, to get the most out of this feature, there are some performance tips to keep in mind. This section gives a brief overview of the LabVIEW execution system and explains how to configure iteration parallelism to get the best performance.

LabVIEW manages a pool of execution system threads for running sections of LabVIEW diagrams. (In LabVIEW 2010, the number of threads is at least as many as the number of cores on your machine, and no fewer than four.) During a sequential phase of a LabVIEW application, some of these threads may be sleeping. When the amount of parallelism increases, LabVIEW must signal the operating system to wake up the idle threads. It is the operating system’s responsibility to resume execution of these idle threads and to preemptively share the available processors with the other threads in the system.

Within each execution system thread, LabVIEW cooperatively multitasks among sections of code called clumps. The LabVIEW compiler creates clumps by determining which sections of your diagram can run in parallel with other sections. At execution time, clumps periodically yield their execution to the scheduler to give other clumps that may be waiting a chance to run. If another clump is waiting, the scheduler pauses the currently running clump, and executes waiting clump. If nothing is waiting, the clump can continue running.

With iteration parallelism, the compiler puts each parallel loop instance into its own clump and generates code in each loop instance for getting the next chunk of iterations. Between this generated code and the execution system, there is a small amount of overhead associated with enabling iteration parallelism. Thus, loops that perform a trivial amount of computation will likely not benefit from iteration parallelism.  The time saved by executing the iterations in parallel must be greater than the time spent scheduling the iterations for there to be a performance improvement.

In general, when clumps execute efficiently without blocking, it may not improve performance to have more clumps than there are threads. In this case, task switching does not reduce the overall execution time, and when the number of clumps is significantly larger than the number of threads, it can cause unnecessary overhead.

When For Loops with iteration parallelism are nested, the total number of loop instances is the product of the number of instances for each loop, which can easily exceed the number of threads. Additionally, the overhead of waking up threads for the inner loop is repeated on each iteration of the outer loop. However, if only the outer loop is parallelized, this overhead is only incurred once. As a result, it is usually best to enable parallelism only on the outermost loop.

Similarly, if you are aware of other sections of code which will execute at the same time, and you want to limit the amount of resources given to the For Loop, you can wire a fraction of the number of logical processors to (P) using the CPU Information VI. Figure 7 shows two For Loops with iteration parallelism enabled. Since the loops can execute at the same time, it may be beneficial to wire half of the number of logical processors to (P) on each loop.

Figure 7 Limiting the number of workers

Alternatively, when the computation must wait for something like an I/O operation to complete before proceeding, it can be beneficial to have more clumps available than there are threads. This is commonly referred to as “oversubscribing.” This creates additional clumps which can be swapped in when other clumps are waiting. In the LabVIEW execution system, when a clump executes an operation that causes it to wait, the clump yields to allow other clumps to execute. Thus, if the For Loop contains blocking nodes, like I/O operations, you may want to use more parallel loop instances than the number of cores in your machine.

Finally, you should avoid calling serializing functions, like non-reentrant subVIs, in the loop, since the parallel loop instances would have to take turns executing the function. If possible, make subVIs reentrant to increase the parallelism available. Use the VI Analyzer or the Find Parallelizable Loops tool to find calls to non-reentrant subVIs in For Loops with parallelism enabled.

In summary,

  • only enable iteration parallelism when the loop performs a significant amount of computation to outweigh the scheduling overhead,
  • limit the total number of parallel loop instances to the number of cores unless the iterations call blocking nodes, and
  • avoid calling serializing nodes in the loop.

Back to Top

5. Performance Study

The loop shown below calculates the Mandelbrot set. Each iteration of the outer loop computes one row of the result array. Since the values computed for one row do not depend on the values computed for any other row, the iterations of the outer loop can execute in parallel.


[+] Enlarge Image

Figure 8 Mandelbrot set computation

  Sequential Time Parallel Time Speedup
LabVIEW 2009 14.9 s 5.8 s 2.6
LabVIEW 2010 9.5 s 2.6 s 3.7

Table 3 Performance results for computing a 500 by 500 Mandelbrot set on a quad-core machine

Using LabVIEW 2010, the sequential version of this algorithm takes 9.46 seconds on a 500 by 500 set. Changing the outer loop to use iteration parallelism reduces the execution time to 2.55 seconds on a quad-core machine, which is 3.7 times faster than the sequential version.

This algorithm highlights the benefit of the dynamic scheduling strategy introduced in LabVIEW 2010. When the same benchmarks are executed using LabVIEW 2009, the parallel version is 2.6 times faster than the sequential version. While this is a significant improvement, the scheduling strategy in LabVIEW 2010 can achieve even higher performance.

Back to Top

6. Summary

To achieve better performance on multicore machines, consider enabling iteration parallelism on For Loops where the iterations do not depend on each other. The Find Parallelizable Loops tool can help you find loops which are candidates for iteration parallelism in your projects or VI hierarchies. Loops that perform a significant amount of computation per iteration and that do not call serializing nodes, like non-reentrant subVIs, will benefit the most from this feature.

Back to Top

7.  Related Links

To see everything that's new in LabVIEW, visit ni.com/labview/whatsnew.

To try this and other new features, download LabVIEW to evaluate it for free.

To read more about parallelism in LabVIEW, visit ni.com/multicore and  Parallel Programming for Everyone – Take Advantage of Multicore CPUs with LabVIEW.

Mary Fletcher is a software engineer at National Instruments. She holds a bachelor's degree in computer science from the University of Oklahoma and a master's degree with a focus on parallelizing compilers from Rice University.

Back to Top

Customer Reviews
2 Reviews | Submit your review

No further info about side effects  - Nov 9, 2015

I've a complex parallel for loop. If runned in sequential mode, everything runs well. But if I switch to parallel mode, LV hangs. LV only gives me the side effects warning, but nothing more. If LV is able to detect this problem, why does LV not give further info where the trap is potentially located ?

FPGA support  - Sep 6, 2011

Do Parallel for loops supported in FPGA ?

Bookmark & Share


Ratings

Rate this document

Answered Your Question?
Yes No

Submit