Read Get Moving with Alveo Example 2: Aligned Memory Allocation
Full source for the examples in this article can be found here: https://github.com/xilinx/get_moving_with_alveo
Ensuring that our allocated memory is aligned to page boundaries gave us a significant improvement over our initial configuration. There is another workflow we can use with OpenCL, though, which is to have OpenCL and XRT allocate the buffers and then map them to userspace pointers for use by the application. Let’s experiment with that and see the effect it has on our timing.
Conceptually this is a small change, but unlike Example 2 this example is a bit more involved in terms of the required code changes. This is mostly because instead of using standard userspace memory allocation, we’regoing to ask the OpenCL runtime to allocate buffers for us. Once we have the buffers, we then neep to map them into userspace so that we can access the data they contain.
For our allocation, we change from listing 3.10 to listing 3.11.
Listing 3.11: Allocating Aligned Buffers with OpenCL
std::vector<cl::Memory> inBufVec, outBufVec;
cl::Buffer a_buf(context,
static_cast<cl_mem_flags>(CL_MEM_READ_ONLY |
CL_MEM_ALLOC_HOST_PTR),
BUFSIZE*sizeof(uint32_t),
NULL,
NULL);
cl::Buffer b_buf(context,
static_cast<cl_mem_flags>(CL_MEM_READ_ONLY |
CL_MEM_ALLOC_HOST_PTR),
BUFSIZE*sizeof(uint32_t),
NULL,
NULL);
cl::Buffer c_buf(context,
static_cast<cl_mem_flags>(CL_MEM_WRITE_ONLY |
CL_MEM_ALLOC_HOST_PTR),
BUFSIZE*sizeof(uint32_t),
NULL,
NULL);
cl::Buffer d_buf(context,
static_cast<cl_mem_flags>(CL_MEM_READ_WRITE |
CL_MEM_ALLOC_HOST_PTR),
BUFSIZE*sizeof(uint32_t),
NULL,
NULL);
inBufVec.push_back(a_buf);
inBufVec.push_back(b_buf);
outBufVec.push_back(c_buf);
In this case we’re allocating our OpenCL buffer objects significantly earlier in the program, and we also don’t have userspace pointers yet. We can still, though, pass these buffer objects to enqueue MigrateMemObjects ()
and other OpenCL functions. The backing storage is allocated at this point, we just don’t have a userspace pointer to it.
The call to the cl::Buffe
r constructor looks very similar to what we had before. In fact, only two things have changed: we pass in the flag CL_MEM_ALLOC_HOST_PT
R instead of CL_MEM_USE_HOST_PTR
to tell the runtime that we want to allocate a buffer instead of using an existing buffer. We also no longer need to pass in a pointer to the user buffer (since we’re allocating a new one), so we pass NULL
instead.
We then need to map our OpenCL buffers to the userspace pointers to a, b, and d that we’ll use immediately ins oftware. There’s no need to map a pointer to c at this time, we can do that later when we need to read from that buffer after kernel execution. We do this with the code in listing 3.12.
Listing 3.12: Mapping Allocating Aligned Buffers to Userspace Pointers
uint32_t*a = (uint32_t*)q.enqueueMapBuffer(a_buf,
CL_TRUE,
CL_MAP_WRITE,
0,
BUFSIZE*sizeof(uint32_t));
uint32_t*b = (uint32_t*)q.enqueueMapBuffer(b_buf,
CL_TRUE,
CL_MAP_WRITE,
0,
BUFSIZE*sizeof(uint32_t));
uint32_t*d =(uint32_t*)q.enqueueMapBuffer(d_buf,
CL_TRUE,
CL_MAP_WRITE | CL_MAP_READ,
0,
BUFSIZE*sizeof(uint32_t));
Once we perform the mapping, we can use the userspace pointers as normal to access the buffer contents.One thing to note, though, is that the OpenCL runtime does do reference counting of the opened buffers, so we need a corresponding call to enqueueUnmapMemObject()
for each buffer that we map.
The execution flow through the kernel is the same, but we see something new when the time comes to migrate the input buffer back into the device. Rather than manually enqueueing a migration, we can instead just map the buffer. The OpenCL runtime will recognize that the buffer contents are currently resident in the Alveo™ Data Center accelerator card global memory and will take care of migrating the buffer back to the host for us. This is a coding style choice you must make, but fundamentally the code in listing 3.13 is sufficient to migrate c back to the host memory.
Listing 3.13: Mapping Kernel Output to a Userspace Pointer
uint32_t*c = (uint32_t*)q.enqueueMapBuffer(c_buf,
CL_TRUE,
CL_MAP_READ,
0,
BUFSIZE * sizeof(uint32_t));
Finally, as we mentioned earlier you need to unmap the memory objects so that they can be destroyed cleanly by the runtime. We do this at the end of the program instead of using free()
on the buffers as before. This must be done before the command queue is finished, as in listing 3.14.
Listing 3.14: Unmapping OpenCL-Allocated Buffers
q.enqueueUnmapMemObject(a_buf, a);
q.enqueueUnmapMemObject(b_buf, b);
q.enqueueUnmapMemObject(c_buf, c);
q.enqueueUnmapMemObject(d_buf, d);
q.finish();
To summarize the key workflow for this use model, we need to:
CL_MEM_ALLOC_HOST_PTR
flag.With the XRT initialized, run the application by running the following command from the build directory:
./03_buffer_map alveo_examples
The program will output a message similar to this:
-- Example 3: Allocate and Map Contiguous Buffers --
Loading XCLBin to program the Alveo board:
Found Platform
Platform Name: Xilinx
XCLBIN File Name: alveo_examples
INFO: Importing ./alveo_examples.xclbin
Loading: ’./alveo_examples.xclbin’
Running kernel test with XRT-allocated contiguous buffers
OCL-mapped contiguous buffer example complete!
--------------- Key execution times ---------------
OpenCL Initialization: 247.460 ms
Allocate contiguous OpenCL buffers: 30.365 ms
Map buffers to userspace pointers: 0.222 ms
Populating buffer inputs: 22.527 ms
Software VADD run : 24.852 ms
Memory object migration enqueue : 6.739 ms
Set kernel arguments: 0.014 ms
OCL Enqueue task: 0.102 ms
Wait for kernel to complete : 92.068 ms
Read back computation results : 2.243 ms
Table 3.3: Timing Summary - Example 3
Operation | Example 2 | Example 3 | ∆2→3 |
---|---|---|---|
OCL Initialization | 256.254 ms | 247.460 ms | - |
Buffer Allocation | 55 μs | 30.365 ms | 30.310 ms |
Buffer Population | 47.884 ms | 22.527 ms | −25.357 ms |
Software VADD | 35.808 ms | 24.852 ms | −10.956 ms |
Buffer Mapping | 9.103 ms | 222 μs | −8.881 ms |
Write Buffers Out | 6.615 ms | 6.739 ms | - |
Set Kernel Args | 14 μs | 14 μs | - |
Kernel Runtime | 92.110 ms | 92.068 ms | - |
Read Buffer In | 2.479 ms | 2.243 ms | - |
∆Alveo→CPU | −330.889 ms | −323.996 ms | −6.893 ms |
∆FPGA→CPU (algorithm only) | −74.269 ms | −76.536 ms | - |
You may have expected a speedup here, but we see that rather than speeding up any particular operation, instead we’ve shifted the latencies in the system around. Effectively we’ve paid our taxes from a different bank account, but at the end of the day we can’t escape them. On embedded systems with a unified memory map for the processor and the kernels we would see significant differences here, but on server-class CPUs we don’t.
One thing to think about is that although pre-allocating the buffers in this way took longer, you don’t generally want to allocate buffers in your application’s critical path. By using this mechanism, the runtime use of your buffers is much faster.
You may even wonder why it’s faster for the CPU to access this memory. While we haven’t discussed it to this point, allocating memory via this API pins the virtual addresses to physical memory. This makes it more efficient for both the CPU and the DMA to access it. As with all things in engineering, though, this comes at a price - allocation time is higher and you run the risk of fragmenting the available memory if you allocate many small buffers.
In general, buffers should be allocated outside the critical path of your application and this method shifts the burden away from your high-performance sections if used correctly.
Some things to try to build on this experiment:
Our long pole is becoming our kernel runtime, though, and we can easily speed that up. Let’s take a look at that in our next few examples.
Rob Armstrong leads the AI and Software Acceleration technical marketing team at AMD, bringing the power of adaptive compute to bear on today’s most exciting challenges. Rob has extensive experience developing FPGA and ACAP accelerated hardware applications ranging from small-scale, low-power edge applications up to high-performance, high-demand workloads in the datacenter.