Data Streaming Optimization

Buffer Management

Use large, pre-allocated buffers to reduce overhead.
- Data coming out of a P2PDMA Read node in LV FPGA can have high and variable latency, because it is being read from DRAM that may have other activity from the PCIe bus. Directly feeding this data into a fixed speed data transport such as Aurora at high rates will result in data loss, because new data cannot be reliably passed in each clock cycle.
  - To interface the data with a fixed speed bus, pass the data through a large buffer that can reliably output data each cycle such as URAM or BRAM. The buffer may need to hold 4,000 to 16,000 cycles of data to smooth out the DRAM interface jitter.
  - Run the P2PDMA FIFO interface in a clock domain faster than the fixed speed bus data rate so that when lags occur in the DRAM data channel, the data can catch up again.
- When sending data out of the PXIe-8290 to a fixed speed device (for example, transmitting data to a VST), the buffers in the transmit direction should be filled completely with data before the fixed speed device starts consuming data. (For example, before the start trigger in the VST or other PXIe I/O endpoint starts data flow.) This allows for a clean startup behavior where the DRAM can provide the full bandwidth to the reader, then settles to the steady state as the writer starts filling the empty space. If the writer is using all the bandwidth initially, the reader starts with reduced throughput until it achieves steady state operation.

DRAM Performance and Interleaving

For best performance, do not use the four DRAM banks for both the PXIe-8290 streaming functionality (P2PDMA) and LabVIEW FPGA memory access at the same time.
For maximum performance of two streams at up to 10 GB/s each, streaming data must be interleaved between two DRAM banks per stream (thereby consuming all four of the available banks for two streams).
- When interleaving is enabled, a stream can use either DRAM banks 0 and 1 interleaved together or banks 2 and 3 interleaved together. Either grouping has similar performance with both fully supporting 10 GB/s. When read/write transactions swap from one DRAM bank to the other, latency will change, necessitating the buffer management suggestions above. Note that the data path from each of the two interleaved bank groupings propagates through separate PCIe buses to the ethernet NIC (unlike when interleaving is disabled; see below).
- To achieve 10 GB/s performance, the host application must ensure that the reader and writer never access the same DRAM bank at the same time. To do this, do the following:
  - Choose a large P2P DMA FIFO size (4/8/16 GB).
  - Choose an interleaving cadence that is half the FIFO size (2/4/8 GB).
  - Most applications have a rate limited reader or writer such as a VST and a non-rate limited reader or writer such as the PXIe-8290 NIC. To ensure both reader and writer do not access the same DRAM bank, when the non-rate limited reader or writer finishes its current segment (interleaving cadence), the user application must poll until the rate limited reader or writer moves off the next DRAM bank.
    - The user application can do this by polling until the next segment is entirely available. You can do this with the GetBytesAvailable call. The application should poll until GetBytesAvailable is greater than the interleaving cadence.
    - For optimal performance, this polling thread should have a 1 ms wait.
    - After GetBytesAvailable is greater than the interleaving cadence, the application should immediately acquire all regions in the segment and pass those regions into RDMA for the NIC to use.
      - Because the NIC runs faster than the rate limited FIFO, if started quickly enough, it completes its data transfer faster and begins to poll again.
      - Too much system jitter sometimes can result in the RDMA transactions starting too late, causing underflows or overflows. Elevating the priority of the LabVIEW process can help mitigate against that.

For performance of 5 GB/s to 8 GB/s per stream, interleaving is not required.

With interleaving disabled, you can use each of the four DRAM banks independently with one or more streams each. Because, in this case, the stream read and write pointers access a single DRAM bank, performance is reduced to 8 GB/s maximum assuming each stream has sole access to a single DRAM bank. If multiple streams share a DRAM bank, performance is further greatly reduced.

The four DRAM banks do not share identical performance characteristics (see the following table). Note that banks 0 and 1 share one PCIe link, and banks 2 and 3 share a second PCIe link. For best performance to the NIC, balance total streaming throughput between the two PCIe links. The latency row indicates PCIe to DRAM latency across the Network-On-Chip in Versal, and the differences between banks do not impact performance in an appreciable way.

Dram Bank Number	Throughput and Data Rate	Latency	PCIe Bus
0	Fastest 3900 MT/s	Lowest	0
1	Fast 3200 MT/s	High	0
2	Faster 3700 MT/s	Lower	1
3	Fastest 3900 MT/s	Low	1

RDMA

When opening an NI-RDMA session, the max outstanding requests control specifies how many buffers are transferred in parallel. This should have a value of at least 2 to reduce the idle time when a transfer is completed and a new transfer needs to be started. Larger numbers of outstanding requests result in slightly worse DRAM throughput, because the accesses are more fragmented. You can use larger numbers if they are needed for the application server side.
Note The PXIe-8290 is designed for use with RDMA. Using other protocols requires additional validation to understand performance and behavior. Any use that passes data to the PXI controller should work and is limited by the backplane and controller bandwidth. Peer-to-peer communication with the P2PDMA buffers works with only certain offloaded protocols such as RDMA.

PXIe-8290 User Manual

PXIe, 20 GB/s, High-Speed Serial & RDMA PXI Ethernet Interface Module

Table of Contents

Data Streaming Optimization

Buffer Management

DRAM Performance and Interleaving

RDMA