Artavazd Khachatryan, Grovf LLC
Implementing the in-memory, key-value store, such as memcached, with 10 Gb/s line rate processing for all packet sizes, which achieves 31X requests per second per watt improvement compared with a standard array of x86 servers.
Using NI hardware and the LabVIEW FPGA Module to rapidly prototype our architecture of a key-value database store and prove that FPGAs benefit this kind of memory-intensive and parallel nature application.
Artavazd Khachatryan - Grovf LLC
Khachik Sahakyan - Grovf LLC
Grovf LLC is an engineering company aimed at developing hardware and software products to assist Industrial Internet of Things (IIoT) and “big data” problems. At Grovf, we help our customers solve Internet of Things (IoT)-generated big data storage problems using hardware (FPGA) implementation of time series database functionality to close the gap between IoT-generated data throughput and software database transaction speed. This results in transactions that are 10X faster and save 3X the power.
Well-known key-value store implementation, such as memcached performance numbers, are still substantially below the maximum packet rate of 10 Gb/s Ethernet. Currently, implementation of this kind of system is done on x86 CPU-based servers. However, typical x86-based systems yield limited performance scalability and high power consumption because their architecture with its optimization for single-thread performance is not well matched toward the memory-intensive and parallel nature of this application. We decided to implement key-value softwareless systems on a single FPGA fabric by implementing full 10 Gb/s TCP stack, key-value database core operations (such as read, write, and delete) using hash tables, and DRAM memory connectivity for data storage.
The recent phenomenon of IoT and big data has irreversibly affected how the database industry should evolve. We expect 30 to 50 billion IoT devices to be connected in 2020 (compared to 6.4 billion in 2016). Large data increases the demand for better database transaction performance. The demand of dedicated hardware for IIoT-generated data storage increases, which effectively replaces software to accelerate the overall system. The processor becomes ineffective because of the random-access nature and required memory size of the application, which leads to considerable energy waste. Finally, the high latency of the communication stack on the OS heavily impacts both throughput and latency.
Our key objectives in designing an FPGA-based database server were to achieve 10 Gb/s line rate processing with a scalable architecture, minimal latency, and power efficiency. To achieve these design goals, we selected the Kintex-7 FPGA as a processing unit. Overall, the system deployed on a single FPGA acts as TCP server for a 10 G Ethernet network to accept TCP network traffic and parse it to keys and values. Next, the hash table is implemented to effectively store and retrieve keys and values. And finally, the DRAM memory controller is implemented to read and write variable length values to/from DRAM memory.
Prior to using the LabVIEW FPGA Module, we had to perform this large HDL development exclusively using VHDL in Vivado IDE and then integrate it to LabVIEW FPGA as a CLIP or IP Node to deploy it on NI hardware. However, we found out that HDL design workflows are time intensive, especially for tasks that need a proof of concept. Moving to the LabVIEW FPGA as our main development tool, we decreased the HDL development time and gained incredible flexibility to perform further improvements and changes. The VHDL/Verilog integration tools such as CLIP and IP Node, gives us the ability to not lose work done previously and integrate existing HDLs into LabVIEW FPGA.
As a hardware platform, we chose the PXI Express with the PXIe-6592 high-speed serial instrument module, which is ideal hardware for high-speed computing systems as it has a Xilinx Kintex-7 FPGA with directly connected 2 GB DRAM memory and SFP+ connectors for 10 G Ethernet networks.
During system development, we found out that if used in conjunction with other HDL designs, the flexibility of LabVIEW FPGA delivers a wide range of functionality. There are many open source implementations that can easily integrate with LabVIEW using CLIP and IP Nodes. This helps to reuse existing time-proven HDL designs and avoid reinventing the wheel.
NI’s platform provides the greatest bundle of hardware and software to rapidly develop the proofs of concept and further the industrial product. Particularly, using LabVIEW FPGA with high-speed serial instruments, we gained all the power of the LabVIEW FPGA graphical programming workflow and kept efforts to deploy design on the FPGA fabric minimal.
The key limitation of the hardware-accelerated design is the cost of increased development time, low-level nature, and complexity of traditional hardware design flow using VHDL/Verilog. The LabVIEW FPGA graphical programming approach shows promising results to reduce development, verification, and performance optimization efforts. Due to the NI platform, our database server is not only highly optimized compared to existing solutions, but also flexible enough to meet new requirements.
The HDL design is always challenging, especially for large systems. Implementing the TCP stack was the most challenging part as it required massive memory and complex logic to handle the TCP/IP. Even the most basic implementation requires a memory buffer and complex state machine. We overcame this challenge using LabVIEW FPGA above the MAC layer as everything under it was implemented in VHDL. We imported into LabVIEW FPGA as a Socketed CLIP. We used the existing libraries for DRAM connectivity in LabVIEW FPGA for implementing the DRAM controller over it to read and write variable length values. We implemented the hash tables and key-value database operations using only LabVIEW FPGA.
We used LabVIEW FPGA and high-speed serial instruments to achieve the following results:
Performance: Expressed in number of successfully served requests per second, we can handle as much as 12.8 MR/s with consistent line rate processing at 10 Gb/s for any packet size. This significantly outperforms other known implementations by 10X.
Round-trip latency: With regard to round-trip time, we recorded between 4.9 µs and 6 µs depending on packet size. This is a two-order of magnitude improvement over standard x86-based approaches.
Power: The power consumption of FPGA fabric is only up to 220 K RPS/W (request per second per watt).
The designed novel architecture implementation on FPGA delivers a maximum round-trip time of 6 µs and achieves an increase of 31X in RPS/W over the best published x86 numbers; thus, showcasing that FPGAs are promising computing systems for the NoSQL database industry and will definitely play a significant role in the near future.
The IIoT-enabled world brings new challenges for time-series (key-value) databases and memcached servers. Correspondingly, the data storage industry has become extremely competitive and lucrative, and demands constant improvement. Several institutions around the world, engaged in evaluating FPGAs, use cases in datacenter and cloud computing industries. At this moment, there is a lack of turnkey solutions in the market.
The investigation of introducing new technologies, such as FPGAs, as a computing unit in datacenter and cloud computing industries led us to the NI hardware and software platform. We were looking for flexible tools for creating a proof of concept powerful enough to achieve the results that exceed the well-known systems at least in order of magnitude in all respects. As a continuation of our work, we expect to add external SAS 3 connectivity protocol to SSD with 12 Gb/s line rate using the PXIe-6591 high-speed serial instrument to increase the storage space and provide non-volatile storage.
This research was supported by IBM ISTC research grant and Armenian National Engineering Laboratories (ANEL)
Artavazd Khachatryan
Grovf LLC
Engineering City
Yerevan
Armenia
Tel: +374 94 618089
support@grovf.com