Read Get Moving with Alveo: Guided Software Examples
Full source for the examples in this article can be found here: https://github.com/xilinx/get_moving_with_alveo
For our first example, let’s look at how to load images onto the Alveo™ Data Center accelerator card. When you power on the system, the Alveo card will initialize its shell (see figure 2.1). Recall from earlier that the shell implements connectivity with the host PC but leaves most of the logic as a blank canvas for you to build designs. Before we can use that logic in our applications, we must first configure it.
Also, recall from earlier that certain operations are inherently “expensive” in terms of latency. Configuring the FPGA is, in fact, one of the most inherently time consuming parts of the application flow. To get a feel for exactly how expensive, let’s try loading an image.
In this example, we initialize the OpenCL runtime API for XRT, create a command queue, and - most importantly- configure the Alveo card FPGA. This is generally a one-time operation: once the card is configured it will typically remain configured until power is removed or it is reconfigured by another application. Note that if multiple independent applications attempt to load hardware into the card, the second application will be blocked until the first one relinquishes control. Although multiple independent applications can share the same image running on a card.
To begin, we must include the headers as shown in listing 3.1. Note that the line numbers in the documentation correspond to the line numbers in the file 00_load_kernels.cpp.
Listing 3.1: XRT and OpenCL Headers
// Xilinx OpenCL and XRT includes
#include"xcl2.hpp"
#include <CL/cl.h>
Of these two, only CL/cl.h
is required .xcl2.hpp
is a library of Xilinx-provided helper functions to wraparound some of the required initialization functions.
Once we include the appropriate headers, we need to initialize the command queue, load the binary file, and program it into the FPGA, as shown in listing 3.2. This is effectively boilerplate code you’ll need to include in every program at some point.
Listing 3.2: XRT and OpenCL Headers
// This application will use the first Xilinx device found in the system
std::vector<cl::Device> devices = xcl::get_xil_devices();
cl::Device device = devices[0];
cl::Context context(device);
cl::CommandQueue q(context, device);
std::stringdevice_name = device.getInfo<CL_DEVICE_NAME>();
std::stringbinaryFile = xcl::find_binary_file(device_name, argv[1]);
cl::Program::Binaries bins = xcl::import_binary_file(binaryFile);
devices.resize(1);
cl::Program program(context, devices, bins);
The workflow here can be summed up as follows:
Line 44 is where the programming operation is actually triggered. During the programming phase the runtime checks the current Alveo card configuration. If it is already programmed, we can return after loading the device metadata from the xclbin. But if not, let’s program the device now.
With the XRT initialized, run the application by running the following command from the build directory:
./00_load_kernels alveo_examples
The program will output a message similar to this:
-- Example 0: Loading the FPGA Binary --
Loading XCLBin to program the Alveo board:
Found Platform
Platform Name: Xilinx
XCLBIN File Name: alveo_examples
INFO: Importing ./alveo_examples.xclbin
Loading: ’./alveo_examples.xclbin’
FPGA programmed, example complete!
-- Key execution times --
OpenCL Initialization : 1624.634 ms
Note that our FPGA took 1.6 seconds to initialize. Be aware of this kernel load time; it includes disk I/O, PCIe latency, configuration overhead, and a number of other operations. Usually you will want to configure the FPGA during your application’s startup time, or even pre-configure it. Let’s run the application again with the bitstream already loaded:
-- Key execution times --
OpenCL Initialization : 262.374 ms
.26 seconds is much better than 1.6 seconds! We still have to read the file from disk, parse it, and ensure the xclbin loaded into the FPGA is correct, but the overall initialization time is significantly lower.
Some things to try to build on this experiment:
Now that we can load images into the FPGA, let’s run something!
Rob Armstrong leads the AI and Software Acceleration technical marketing team at AMD, bringing the power of adaptive compute to bear on today’s most exciting challenges. Rob has extensive experience developing FPGA and ACAP accelerated hardware applications ranging from small-scale, low-power edge applications up to high-performance, high-demand workloads in the datacenter.