Integrating optimized RTL Kernels into Accelerated Applications using Vitis Technology
Oct 11, 2019
This article focuses on the process used to integrate optimized RTL Kernels into an accelerated application.
Introduction
The Xilinx® Vitis™ unified software platform provides a framework for developing and delivering accelerated, heterogenous compute applications based on industry standard programming languages. The host application is written in C/C++ with OpenCL API calls. The host code offloads the application's compute intensive tasks to be accelerated in the Programmable Logic within a Xilinx FPGA/SoC, or even within the AI Engines in our new Versal™ ACAP devices. The host code can run on an x86 processor and interface with the acceleration kernels running within a Xilinx FPGA over the PCIe bus. An example of this type of acceleration platform is one or more of the Xilinx Alveo™ cards running in a Data Center Server. In the new Vitis unified software program, the host code can also run on the ARM processor with our Zynq® UltraScale+™ and Versal devices and can interface with the acceleration kernels running within the Programmable Logic via the AXI bus interconnect. This type of acceleration platform could be one of the Zynq UltraScale+ development platforms, such as the ZCU102 or ZCU104 cards, or a customer can develop their own Zynq UltraScale+ card that is tailored to their exact platform acceleration needs.
The Vitis tool allows the user to develop acceleration kernels in several different languages- C/C++, OpenCL, and RTL code. Kernels developed with C/C++ or OpenCL are compiled with the Vivado High Level Synthesis (HLS) tool. Kernels that are developed with RTL code are compiled and packaged via Vitis tool packaging utility. There is also an RTL Kernel Wizard available that greatly simplifies this RTL packaging process. The Vitis tool makes it easy to combine/link the individual acceleration kernels together into one accelerated application regardless of the language(s) used to develop them. This article focuses on the process used to integrate optimized RTL Kernels into an accelerated application.

Why RTL Kernels
One of the big advantages of the Vitis tool is that it supports higher-level languages, such as C/C++ and OpenCL, for the development of hardware acceleration kernels. These higher-level languages can enable productivity gains both in development time and simulation time compared with traditional RTL design flows. Some functions that could take 100s of lines of RTL code to implement, can be described in 10s of lines of C/C++ code. Likewise, C/C++ simulation times for functions that process large amounts of data, such as video processing algorithms, can be on the order of 1000x faster than the equivalent RTL simulation, turning days of RTL simulation into a matter of minutes. So, with these apparent advantages, why would anyone want to use RTL to implement their acceleration kernels?
Here are three examples of why a customer may want to utilize RTL Kernels:
Example 1: A customer's application may be pushing the performance and/or area limits of the Xilinx device. Some customers who have seasoned FPGA designers are using low level RTL coding techniques to squeeze the absolute maximum performance out of our devices, often achieving 500+ MHz performance in our latest Ultrascale+ families with relatively complex designs. Our HLS compiler is getting better and better at delivering high performance and area optimized kernels which are coded in C/C++, but it is still hard to compete with an FPGA designer who is highly experienced in RTL coding best practices and timing closure techniques. This is similar to the comparison of coding in C/C++ for a microprocessor versus coding in assembly, especially in the earlier days when the software compilers were not as good as they are today.
Example 2: Many Xilinx customers have built up their own extensive library of parametrizable RTL functions over years of designing FPGAs into their systems. Many of these RTL functions have been optimized for area and performance. They have also been thoroughly verified both in simulation and in hardware running in real systems. Examples of these parametrizable RTL functions are FFT cores, FIR filter cores, and high-speed encryption/decryption cores. It makes a lot of sense to reuse these IP functions when possible.
Example 3: Some customers may be interested in purchasing an IP core that is critical to accelerating their design and the IP provider may only offer the IP as RTL source code. In this case, the customer would need to be able to add this function as an RTL Kernel.
RTL Kernel Hardware/Software Interface Requirements
For an RTL function to be used as an acceleration kernel within the Vitis tool framework, it must adhere to certain hardware and software interface requirements. These requirements may dictate some minor additions and/or modifications to the original RTL source code.
The following is a summary of the hardware interface requirements for RTL Kernels:
- The RTL Kernel must have one, and only one, AXI Lite Interface, which is used as a control interface and to pass arguments from the host to the kernel.
- The RTL Kernel must also have at least one of the following interfaces, but can also have both:
- AXI Master interface to communicate with memory
- AXI Stream interface to communicate with other kernels and/or with the host
- At least one clock interface port to clock the AXI interface logic, as well as the kernel logic. There is also an option to have two clock ports, one to clock the AXI interface logic and a separate one to clock the kernel logic. This is useful if there is a requirement to run the kernel logic faster or slower than the AXI interfaces.
RTL Kernels have the same software interface model as C/C++ and OpenCL Kernels. They are all seen by the host application as functions with a void return value, scalar arguments, and pointer arguments. Here is an example function call for a matrix multiplication kernel:
void mmult (unsigned int length, int *a, int *b, int *output)
The following is a summary of the software interface requirements for RTL Kernels:
- Scalar arguments are directly written to the kernel via the AXI-Lite slave interface.
- Pointer arguments are transferred to/from memory.
- Kernels are expected to read/write data in global memory from one or more AXI Master interfaces and/or AXI stream the data directly between the host and kernel.
- Kernels are controlled by the host application via a control register connected to the AXI-Lite interface.
Additional details on the hardware and software interface requirements can be found in the RTL Kernel chapter of the Vitis User Guide
RTL Kernel Wizard
The RTL Kernel Wizard is a tool that helps automate the steps needed to ensure your RTL function is packaged into a kernel that can be successfully integrated into Vitis. This tool can be launched either from within Vitis or from within the Vivado IDE. It is recommended that the wizard be launched from the Vitis tool since this method provides a more seamless experience by automatically importing the generated RTL Kernel and example host code back into the Vitis tool once the kernel is generated
The RTL Kernel Wizard has the following high-level features and capabilities:
Allows the user to enter information about their RTL Acceleration Kernel through a series of GUI screens.
- User enters general information like kernel name, number of clock ports

- User enters information about the scalar arguments (name and type) that are used to pass control information from the host software to the kernel.

- User enters information about the kernel's AXI4 Master interfaces- number of interfaces, interface name, interface width, number of arguments associated with each interface, and argument names

- User enters information about the kernel's AXI Stream interfaces- number of interfaces, interface name, interface mode (master or slave), and interface width.

- After the user enters all information, a summary screen is provided that allows the user to check entries for correctness before generating the kernel.

- Generating the RTL Kernel provides the following output products:
- An RTL wrapper for the kernel which meets the RTL kernel interface requirements
- An AXI4 Lite interface module is included in the RTL wrapper with all the necessary control logic and registers.
- An "example" kernel IP is included inside the RTL wrapper. The example IP is a simple adder IP called VADD. The user will eventually replace this example IP with their RTL kernel IP.
- A Vivado project with an example design consisting of the RTL wrapper, AXI Lite interface module, the example VADD kernel IP, and a System Verilog testbench which can be used to simulate the RTL Kernel.
- Example Host Code which can be used to exercise the example RTL Kernel via Software and/or Hardware Emulation within the Vitis tool and in actual hardware.
- C++ simulation model for the example RTL Kernel IP
Additional details on the RTL Kernel Wizard can be found in the RTL Kernel chapter of the Vitis User Guide
Final Steps for RTL Kernel Integration
After the RTL Kernel has been designed and verified within Vivado, the final step is to package the RTL Kernel for use in Vitis. This can be done by simply clicking on the "Generate RTL Kernel" menu option within the Vivado Flow Navigator window.


Once the above options are selected, the user can click OK to generate the RTL Kernel object file (.xo file). If the RTL Kernel Wizard was launched from the Vitis tool, as recommended above, then the RTL Kernel object file and the example host code are automatically imported into the source folder within the Vitis tool. At this point, the user can run Software Emulation of the Host Code and the RTL Kernel (if a C++ model of the RTL Kernel is available), run Hardware Emulation of the Host Code and RTL Kernel, and/or build the system, which generates the executable host code and an FPGA binary image which can be downloaded, run, and verified on a hardware platform. This hardware platform could be an Alveo card plugged into an x86 server or a Zynq UltraScale+ based platform. Detailed steps on how to run Software/Hardware Emulation and how to build the system can be found in the Vitis User Guide.
Where to Learn More?
If you are interested in learning more details about integrating optimized RTL Kernels into an Accelerated Application within the Vitis tool, I recommend referencing the following document.
Vitis Application Acceleration Development Guide (UG1393)
I also recommend going through the following hands-on tutorials which provide step-by-step instructions on how to integrate RTL Kernels into your application.
https://github.com/Xilinx/Vitis-Tutorials/tree/master/docs/getting-started-rtl-kernels
https://github.com/Xilinx/Vitis-Tutorials/tree/master/docs/mixing-c-rtl-kernels