Using NI LabVIEW FPGA IP Builder to Optimize and Port VIs for Use on FPGAS

Publish Date: Aug 01, 2014 | 4 Ratings | 3.75 out of 5 |  PDF

Overview

FPGA IP Builder is a feature of LabVIEW that you can use to create highly optimized FPGA implementations using natural, high-level code. This white paper covers the underlying technology of FPGA IP Builder, its key benefits and examples to show how you can iteratively optimize your code for the FPGA. This paper assumes a basic understanding of FPGAs, LabVIEW, and the LabVIEW FPGA module.

Table of Contents

  1. What is FPGA IP Builder?
  2. How to use FPGA IP Builder
  3. Directive-Driven Optimization
  4. Conclusion
  5. Next Steps

1. What is FPGA IP Builder?

The LabVIEW FPGA Module makes it easy to extend your LabVIEW programming skills to the FPGA. However, to create really high performance FPGA designs, you need to use the Single Cycle Timed-Loop. Because the code inside a the Single Cycle Timed-Loop executes in one clock cycle, it restricts programing elements that you can use inside it. Frequently, you have to refactor or rewrite your algorithm code using advanced techniques, like loop unrolling and pipelining to make it suitable for execution inside a Single Cycle Timed-Loop. Refactoring your code requires knowledge of your algorithm and the underlying FPGA hardware. For complex algorithms, this process can be tedious and error-prone and can make the resultant code harder to debug and maintain. FPGA IP Builder is a feature of the LabVIEW FPGA Module that enables you to create reusable algorithm code and eliminates the need for manual optimization. FPGA IP Builder enables you to:

1. Automatically optimize algorithm code for your FPGA

With FPGA IP Builder, you can create your algorithm code using high level programming elements, like arrays and loops and automatically optimize it for your FPGA using directives. Directives are specifications that you can use to tailor the optimization process.

2. Create reusable algorithm code

Timing performance of your algorithm code is typically measured in terms of its throughput, latency and the maximum clock frequency that it can run at. On FPGAs, timing performance is typically obtained by parallelizing your code, which tends to use more resources. Consequently, optimizing your algorithm code is a tradeoff between timing performance and resource utilization. Your optimization goals are typically governed by the requirements of your overall application.

FPGA IP Builder enables you to reuse the same high level algorithm VI in multiple applications by simply using  different directives to optimize it.

3. Quickly estimate timing performance without compiling your code

For your algorithm code and a specified set of directives, FPGA IP Builder can give you an estimate of timing performance and resource utilization within minutes. These results are typically only available after a full compilation run which can take several minutes to an hour. The quick estimation tool enables you to rapidly iterate over many different values of directives until you meet your optimization goals.

 

IP builder concept.png

Fig 1. FPGA IP Builder uses your directives to generate an optimized FPGA implementation of your algorithm code that can be used inside a Single Cycle Timed-Loop in your Top-Level FPGA VI

 

Back to Top

2. How to use FPGA IP Builder

 

IP builder workflow.png

Fig 2. Typical FPGA IP Builder workflow involves taking high-level algorithm code through multiple rounds of directive driven estimation. Once your optimization goals are met, you can generate optimized code and integrate it into your Top-Level FPGA VI.


IP Builder is available as a project item under supported FPGA devices in the LabVIEW project.

 

 



1. Create your high level algorithm VI under the IP Builder project item. Alternately, you can also copy or move VIs from other targets.
While IP Builder VIs have fewer palette restrictions than Single Cycle Timed-Loops, they do not support all functions and data types as LabVIEW on Desktop. Refer to product documentation for details on supported functions and data types.

Algorithm VIs should be hardware agnostic and should not contain IO, timing or hard-coded constants.


2. Create a directives item for your algorithm VI. You can specify your directives and run quick estimations in the directives property dialog. Choosing directives and evaluating results is covered in the next section. You can create one or more sets of directives for your algorithm VI.


3. Once you achieve your optimization goals in the estimation tool, you should run a thorough estimation to validate your design. Thorough estimation provides a very accurate performance and resource estimate by actually compiling your optimized algorithm code using the Xilinx toolchain.


4. Create a Build Specification from your directives to generate the optimized FPGA implementation of  your algorithm VI. The Build Specification generates a VI containing the optimized FPGA implementation in a folder named IP Builder generated VIs under your FPGA target.


5. Create a top-level VI under your FPGA target containing a Single Cycle Timed-Loop with the necessary timing and IO. Insert the IP Builder generated VI into the Single Cycle Timed-Loop.


Back to Top

3. Directive-Driven Optimization

The following section explains the use of directives to optimize your algorithm code using examples. Each example lists the location of the source VI and a step by step procedure to achieve the stated optimization goal. For a detailed procedure, refer to the FPGA IP Builder Tutorial in the Product Documentation.

NOTE: the examples below are run on a NI Compact RIO 9068 target. Results presented here might not replicate exactly. If you see significant discrepancies while replicating these results, please report it on the LabVIEW FPGA IP Builder Community.

Example 1: FIR Filter

Objective: This example demonstrates the use of the Initiation Interval directive to achieve a significant improvement in Throughput.


Source VI: Consider the FIR Filter code below. The block diagram code shown below is included as FIR.vi in the FIR shipping example with the LabVIEW FPGA Module.



Setup: Copy the source VI under the IP Builder project item. Right-click FIR.vi and select create Directives from the shortcut menu. LabVIEW creates an FIR directives project item above FIR.vi. Double-click FIR directives to display the FIR directives Properties dialog box. Navigate to the Directives tab.



The Directives tab shows you the Block Diagram of FIR.vi. In the top-left corner, you can see a hierarchical list of Block Diagram components. The bottom-left corner shows directives that you can configure for a selected component in the Block Diagram hierarchy. Directives that are specified with the Top-Level VI selected are referred to as Top Level Directives in this paper.


Iteration 1

Establish the baseline timing performance of the design by running a quick estimate using default value of 40MHz for the Clock rate (MHz) directive. No other directives are specified. Navigate to the Estimates tab and click Quick estimate to generate and estimation report.




The Report Summary contains a Device Utilization estimate that lists the percentage of FPGA hardware resources used by your design. A thorough understanding of FPGA hardware resources is not necessary to follow this example. To learn more about them, refer to product documentation here and here.


The Report Summary also contains a Performance estimate in terms of Clock rate (MHz), Initiation interval (cycles) and minimum and maximum latency.

 

Clock rate (MHz) is the frequency of the clock that drives the Single Cycle Timed-Loop containing your algorithm code. If your Single Cycle Timed-Loop can process one sample of data per iteration, a faster clock rate directly improves performance. Clock rate is ultimately limited by the critical path in your design. Critical path is the sequence of logical/mathematical operations that takes the longest time to execute in your design. Specifying a high clock rate causes FPGA IP Builder to pipeline your design and break up the critical path whenever possible. To learn more about pipelining, refer to product documentation here.


Initiation Interval (cycles) is the number of cycles between inputs to your design. Its basically a measure of how frequently your design can accept new inputs. On FPGA hardware, inputs are typically passed into logic blocks using clock pulses or ticks. If your design can accept an input on every consecutive tick, there is a one clock period (or cycle) between them. This implies an initiation interval of 1. Setting this directive causes FPGA IP Builder to recursively unroll loops until your Initiation Interval specification can be met.


Latency is the number of clock cycles that it takes your design to complete one iteration of your algorithm VI. It can also be interpreted as the time between an input entering your design and a corresponding output being generated. In certain applications, like closed loop control, this time is an important consideration.


Throughput is a key metric of performance for your algorithm but it is not directly estimated by the estimation tool. Throughput is the number of samples of data that your design can output per cycle. You can calculate throughput for your design using the formula below

Samples per Iteration is simply the number of samples that your design can accept per call. You can look at the inputs of your algorithm VI to learn this.


If you analyze FIR.vi, you will see that it accepts  data value (sample) per iteration of the VI. Therefore, Samples per Iteration is 1. To calculate the Throughput, use Initiation Interval obtained from the quick estimate summary.


NOTE: Clock rate in the quick estimate summary might be higher than the directive specified. However, it might not always be possible to generate an arbitrary clock rate on the FPGA. Further more, your Single Cycle Timed-Loop clock rate might be limited by other functionality in the loop. Hence this paper uses the clock rate specified in the directives to calculate throughput.


Throughput of your design is measured in megasamples per second.


The table below summarizes the inputs and outputs of the quick estimation


Top Level Directives: Clock Rate (MHz) : 40 MHz


Slice Registers (%)

Slice LUTs (%)

DSP48s

(%)

BRAMs

(%)

Clock Rate (MHz)

Initiation Interval

(cycles)

Min. Latency

(cycles)

Max. Latency

(cycles)

Throughput

(MS/s)

Iter. 1

0.1%

0.2%

0.5%

0.7%

40

32

31

31

1.25


Iteration 2

Set Clock rate (MHz) to 200 MHz and run a quick estimate. With no other changes, increasing clock rate results in higher throughput (1.25 MS/s). However, resource utilization is virtually unchanged.


Top Level Directives:  Clock Rate (MHz) : 200 MHz


Slice Registers (%)

Slice LUTs (%)

DSP48s

(%)

BRAMs

(%)

Clock Rate (MHz)

Initiation Interval

(cycles)

Min. Latency

(cycles)

Max. Latency

(cycles)

Throughput

(MS/s)

Iter. 1

0.1%

0.2%

0.5%

0.7%

40

32

31

31

1.25

Iter. 2

0.1%

0.2%

0.5%

0.7%

200

92

91

91

2.17



Iteration 3

Keep clock rate at 200MHz and set Iteration Interval to 1. Run a quick estimate and review the results. The throughput now jumps to 200 MS/s and Latency reduces significantly. However, resource utilization goes up significantly.


Top Level Directives:  Clock Rate (MHz) : 200 MHz, Initiation Interval (cycles) : 1


Slice Registers (%)

Slice LUTs (%)

DSP48s

(%)

BRAMs

(%)

Clock Rate (MHz)

Initiation Interval

(cycles)

Min. Latency

(cycles)

Max. Latency

(cycles)

Throughput

(MS/s)

Iter. 1

0.1%

0.2%

0.5%

0.7%

40

32

31

31

1.25

Iter. 2

0.1%

0.2%

0.5%

0.7%

200

92

91

91

2.17

Iter. 3

1.0%

0.8%

6.8%

0%

200

1

6

6

200


Takeaway

The Iteration Interval directive forces FPGA IP Builder to unroll loops until the design can accept one input per cycle. Unrolling loops leads to a highly parallel design that consumes more resources on the FPGA.

Using directive-driven optimization, within minutes, you can achieve nearly 200X improvement in throughput and nearly 1/5th the latency of the original design.



Understanding the significance and relative priority of directives can enable you to drive to your desired result faster. Refer to the product documentation to better understand directives.


Example 2

Objective: This example demonstrates the use of interface directives to reduce resource utilization while retaining the performance gains achieved with loop unrolling. Interface directives allow you to specify how FPGA IP Builder handles the inputs and outputs of your algorithm VI.


Source VI: Consider the Matrix multiplication example below. You can download this example at the LabVIEW FPGA IP Builder Community.

Setup: Open the source project and create a directives project item for the source VI.


Iteration 1

Establish the baseline timing performance of the design by running a quick estimate using default value of 40MHz for the Clock rate (MHz) directive. No other directives are specified.


If you analyze MatrixVector.vi, you will see that it accepts an 8 x 8 matrix per iteration of the VI. Therefore, Samples per Iteration is 64. To calculate the Throughput, use Initiation Interval obtained from the quick estimate summary.

Example 2 Throughput.png


The following table shows the inputs and outputs of the quick estimate


Top Level Directives: Clock Rate (MHz) : 40 MHz


Slice Registers (%)

Slice LUTs (%)

DSP48s

(%)

BRAMs

(%)

Clock Rate (MHz)

Initiation Interval

(cycles)

Min. Latency

(cycles)

Max. Latency

(cycles)

Throughput

(MS/s)

Iter. 1

0.1%

0.3%

0.5%

1.4%

40

423

358

422

6.05



Iteration 2

Set Clock rate (MHz) to 200 MHz and Initiation Interval to 1. Run a quick estimate and review the results. Performance and latency improve as soon as the clock rate and initiation interval are specified. However, resource utilization goes up significantly.


Top Level Directives: Clock Rate (MHz) : 200 MHz, Initiation Interval : 1


Slice Registers (%)

Slice LUTs (%)

DSP48s

(%)

BRAMs

(%)

Clock Rate (MHz)

Initiation Interval

(cycles)

Min. Latency

(cycles)

Max. Latency

(cycles)

Throughput

(MS/s)

Iter. 1

0.1%

0.3%

0.5%

1.4%

40

423

358

422

6.05

Iter. 2

3.1%

0.8%

3.6%

0%

200

64

83

83

200



Iteration 3

In order to reduce resource utilization, set the array interfaces at the top level to unbuffered. You can do this for array inputs and outputs as long as the elements in those arrays are accessed only once, sequentially, for every call of the VI. In the case of the matrix-vector multiply example, the Matrix A (row wise) and the Result C array fit these criteria so the directives help FPGA IP Builder remove unnecessary buffers and related circuitry, making the design more efficient and with a lower initiation interval, yielding better throughput.


Top Level Directives: Clock Rate (MHz) : 200 MHz, Initiation Interval : 1

Interface Directives: Matrix A (row wise), Result C: Element-by-element, unbuffered


Slice Registers (%)

Slice LUTs (%)

DSP48s

(%)

BRAMs

(%)

Clock Rate (MHz)

Initiation Interval

(cycles)

Min. Latency

(cycles)

Max. Latency

(cycles)

Throughput

(MS/s)

Iter. 1

0.1%

0.3%

0.5%

1.4%

40

423

358

422

6.05

Iter. 2

3.1%

0.8%

3.6%

0%

200

64

83

83

200

Iter. 3

0.80%

0.60%

3.60%

0%

200

64

69

69

200


Iteration 4

To decrease the number of multipliers (DSP48 elements) required by the design, set the Share Multipliers directive. Multipliers are often a scarce resource on FPGAs so LabVIEW FPGA IP Builder offers a directive exclusively intended to optimize their use. The Share Multipliers directive instructs the tool to generate implementations in which multipliers are reused by different parts of the design. LabVIEW FPGA IP Builder will multiplex multiple streams of data to share the same multiplier resource, bringing the number of multipliers down from 8 to 2, in this case.

Also notice that 64 is the lowest iteration interval that FPGA IP Builder can achieve so you can set the directive to 64.


Top Level Directives: Clock Rate (MHz) : 200 MHz, Initiation Interval : 64, Share Multipliers : True

Interface Directives: Matrix A (row wise), Result C: Element-by-element, unbuffered


Slice Registers (%)

Slice LUTs (%)

DSP48s

(%)

BRAMs

(%)

Clock Rate (MHz)

Initiation Interval

(cycles)

Min. Latency

(cycles)

Max. Latency

(cycles)

Throughput

(MS/s)

Iter. 1

0.1%

0.3%

0.5%

1.4%

40

423

358

422

6.05

Iter. 2

3.1%

0.8%

3.6%

0%

200

64

83

83

200

Iter. 3

0.80%

0.60%

3.60%

0%

200

64

69

69

200

Iter. 4

1.0%

1.0%

0.9%

0%

200

64

73

73

200


Refer to the product documentation (see problem #3 and possible solutions) for more tips on improving FPGA resource utilization.


Takeaway

For certain designs, you can use Share Multipliers and interface directives to reduce resource utilization while still retaining gains in throughout and reduction in latency.

Using directive-driven optimization, within minutes, you can achieve nearly 33X improvement in throughput and nearly 1/5th the latency of the original design.




Back to Top

4. Conclusion

You can use FPGA IP Builder to automatically optimize your high level algorithm VIs for your FPGA. As evidenced by the examples, within a few iterations, you can obtain almost 200X increase in throughput without code modifications. The quick estimation process typically takes under a minute per iteration and enables you to iteratively optimize your code within minutes.


Back to Top

5. Next Steps

Starting in 2014, FPGA IP Builder is a feature of the LabVIEW FPGA Module. The principles outlined in this paper can help you get started with algorithm optimization. In many cases, you can extract an even higher performance by starting with well designed code than you can by only using directives. For code creation best practices, training and examples, visit the FPGA IP Builder Community. You can also directly communicate with FPGA IP Builder developers on the community by asking them specific questions.


Back to Top

Bookmark & Share


Ratings

Rate this document

Answered Your Question?
Yes No

Submit