DPUv3E is a member of the Xilinx® DPU IP family for convolution neural network (CNN) inference application. It is designed for the latest Xilinx Alveo U50/U280 adaptable accelerator cards with HBM support. DPU V3E is a high-performance CNN inference IP optimized for throughput and data center workloads. DPUv3E runs with highly optimized instructions set and supports all mainstream convolutional neural networks, such as VGG, ResNet, GoogLeNet, YOLO, SSD, MobileNet, FPN, etc.
DPUv3E is one of the fundamental IPs (Overlays) of the Xilinx Vitis™ AI development environment, and the user can use Vitis AI toolchain to finish the full stack ML development with DPUv3E. The user can also use standard Vitis flow to finish the integration of DPUv3E with other customized acceleration kernel to realize powerful X+ML solution. DPUv3E is provided as encrypted RTL or XO file format for Vivado or Vitis based integration flow.
The major supported Neural Network operators include:
DPUv3E is highly configurable, a DPUv3E kernel consists of several Batch Engines, an Instruction Scheduler, a Shared Weights Buffer, and a Control Register Bank. Following is the block diagram of a DPUv3E kernel including 5 Batch Engines.
Batch Engine is the core computation unit of DPUv3E. A Batch Engine can handle an input image at a time, so multiple Batch Engines in a DPUv3E kernel can process several input images simultaneously. The number of Batch Engine in a DPUv3E kernel can be configured based on FPGA resource condition and customer's performance requirement. For example, in Alveo U280 card, SLR0 (with direct HBM connection) can contain a DPUv3E kernel with maximal four Batch Engines while SLR1 or 2 can contain a DPUv3E kernel with five Batch Engines. In Batch Engine, there is a convolution engine to handle regular convolution/deconvolution computation, and a MISC engine to handle pooling, ReLu, and other miscellaneous operations. MISC engine is also configurable for optional functions according to specific neural network requirements. Each Batch Engine uses a AXI read/write master interfaces for feature map data exchange between device memory (HBM).
Similar to general-purpose processor in concept, Instruction Scheduler carries out instruction fetch, decode and dispatch jobs. Since all the Batch Engines in a DPUv3E kernel will run the same neural network, so Instruction Scheduler serves all the Batch Engines with the same instruction stream. The instruction stream is loaded by host CPU to device memory (HBM) via PCIe interface, and Instruction Scheduler use a AXI read master interface to fetch DPU instruction for Batch Engine.
Shared Weight Buffer includes complex strategy and control logic to manage the loading of neural network weight from Alveo device memory and transferring them to Batch Engines efficiently. Since all the Batch Engines in a DPUv3E kernel will run the same neural network, so the weights data are wisely loaded into the on-chip buffer and shared by all the Batch Engines to eliminate unnecessary memory access to save bandwidth. Shared Weight Buffer uses two AXI read master interfaces to load Weight data from device memory (HBM).
Control Register Bank is the control interface between DPUv3E kernel and host CPU. It implements a set of controller register compliant to Vitis development flow. Control Register Bank has a AXI slave interface.
The following table lists the performance data for DPUv3E single kernel / single batch engine (smallest configuration) with some typical neural network as well as two possible implementations on Alveo U50/U280. Please note because of the characteristics of the HBM memory system, the total overall performance is nearly linear with the number of kernel and batch engines, which provide great flexibility to satisfy specific performance requirements with the least resource occupation and power consumption.
Configuration | NN Model | Frame Rate (FPS) |
---|---|---|
Single Kernel with single batch engine |
ResNet50 | 98 |
Inception V1 | 193 | |
Inception V2 | 138 | |
Inception V3 | 44 | |
Inception V4 | 23 | |
Yolo V2 | 30 | |
Tiny Yolo V2 | 90 | |
FRCNN | 35 | |
Two Kernel with five batch engine (5+5) (U50 without power throttle limitation) |
ResNet50 | 1000 |
Three Kernel, five batch engines for two kernels, and four batch engines for on kernels (5+5+4) (U280) |
ResNet50 | 1550 |
Following is the resource utilization statistics of a typical DPUv3E kernel with five batch engines.
Configuration |
LUT |
FF |
BRAM |
URAM |
DSP |
---|---|---|---|---|---|
5 Batch Engines with Leaky ReLU support | 250290 | 310752 | 628 | 320 | 2600 |
DPUv3E uses HBM of Alveo HBM card (U280, U50) as the external memory. Alveo U280 and U50 have 8GB HBM (two 4GB stacks), providing thirty-two HBM pseudo channels for customer logic and thirty-two 256 bits hardened HBM AXI ports. The Vitis target platform for U280 and U50 (such as xilinx_u280_xdma_201920_1 and xilinx_u50_xdma_201920_1) with providing the needed AXI bus fabric as well as the host-device interaction layer.
With Vitis, the steps to integrating DPUv3E on U280/U50 card is very simple and straightforward:
The following diagram shows an example of the design HBM connection scheme of Alveo U280 with two DPUv3E kernels (five batch engines), four JPEG decoder kernels and an image resizer kernel.
The JPEG decoder in the example is a Xilinx IP, which is an RTL kernel packaged to XO file. Following is the top-level port diagram, the kernel name is jpage_decoder_v1_0.
The image resizer kernel is a Vitis vision library based HLS kernel synthesized to XO file. Below is the main body of the kernel, the kernel name is resize_accel.
#include "ap_int.h"
#include "common/xf_common.hpp"
#include "common/xf_utility.hpp"
#include "hls_stream.h"
#include "imgproc/xf_resize.hpp"
#define AXI_WIDTH 256
#define NPC XF_NPPC8
#define TYPE XF_8UC3
#define MAX_DOWN_SCALE 7
#define PRAGMA_SUB(x) _Pragma(#x)
#define DYN_PRAGMA(x) PRAGMA_SUB(x)
#define MAX_IN_WIDTH 3840
#define MAX_IN_HEIGHT 2160
#define MAX_OUT_WIDTH 3840
#define MAX_OUT_HEIGHT 2160
#define STREAM_DEPTH 8
#define MAX_DOWN_SCALE 7
extern "C"
{
void resize_accel (ap_uint<AXI_WIDTH> *image_in,
ap_uint<AXI_WIDTH> *image_out,
int width_in,
int height_in,
int width_out,
int height_out)
{
#pragma HLS INTERFACE m_axi port = image_in offset = slave bundle = image_in_gmem
#pragma HLS INTERFACE m_axi port = image_out offset = slave bundle = image_out_gmem
#pragma HLS INTERFACE s_axilite port = image_in bundle = control
#pragma HLS INTERFACE s_axilite port = image_out bundle = control
#pragma HLS INTERFACE s_axilite port = width_in bundle = control
#pragma HLS INTERFACE s_axilite port = height_in bundle = control
#pragma HLS INTERFACE s_axilite port = width_out bundle = control
#pragma HLS INTERFACE s_axilite port = height_out bundle = control
#pragma HLS INTERFACE s_axilite port = return bundle = control
xf::cv::Mat<TYPE, MAX_IN_HEIGHT, MAX_IN_WIDTH, NPC> in_mat(height_in, width_in);
DYN_PRAGMA(HLS stream variable = in_mat.data depth = STREAM_DEPTH)
xf::cv::Mat<TYPE, MAX_OUT_HEIGHT, MAX_OUT_WIDTH, NPC> out_mat(height_out, width_out);
DYN_PRAGMA(HLS stream variable = out_mat.data depth = STREAM_DEPTH)
#pragma HLS DATAFLOW
xf::cv::Array2xfMat<AXI_WIDTH, TYPE, MAX_IN_HEIGHT, MAX_IN_WIDTH, NPC>(image_in, in_mat);
xf::cv::resize<XF_INTERPOLATION_AREA,
XF_8UC3,
MAX_IN_HEIGHT,
MAX_IN_WIDTH,
MAX_OUT_HEIGHT,
MAX_OUT_WIDTH,
NPC,
MAX_DOWN_SCALE>(in_mat, out_mat);
xf::cv::xfMat2Array<AXI_WIDTH, TYPE, MAX_OUT_HEIGHT, MAX_OUT_WIDTH, NPC>(out_mat, image_out);
}
}
Following is the example v++ configuration file corresponding to the HBM connection block diagram.
[connectivity]
# ---------------------------------------------------------------
# multiple instances of 'jpeg_decoder_v1_0'
nk=jpeg_decoder_v1_0:4:jpeg_decoder_1.jpeg_decoder_2.jpeg_decoder_3.jpeg_decoder_4
# SLR assignment of 'jpeg_decoder'
slr=jpeg_decoder_1:SLR0
slr=jpeg_decoder_2:SLR0
slr=jpeg_decoder_3:SLR0
slr=jpeg_decoder_4:SLR0
# HBM port assignment of 'jpeg_decoder'
sp=jpeg_decoder_1.m00_axi:HBM[16]
sp=jpeg_decoder_2.m00_axi:HBM[17]
sp=jpeg_decoder_3.m00_axi:HBM[18]
sp=jpeg_decoder_4.m00_axi:HBM[19]
# ---------------------------------------------------------------
# single instance of 'resize_accel'
nk=resize_accel:1:resize_accel_1
# SLR assignment of 'resize_accel'
slr=resize_accel_1:SLR0
# HBM port assignment of 'resizer'
sp=resize_accel_1.m_axi_image_in_gmem:HBM[20]
sp=resize_accel_1.m_axi_image_out_gmem:HBM[21]
# ---------------------------------------------------------------
# multiple instances of 'dpuv3e_5be'
nk=dpuv3e_5be:2:dpuv3e_5be_1.dpuv3e_5be_2
# SLR assignment of 'dpuv3e_5be'
slr=dpuv3e_5be_1:SLR1
slr=dpuv3e_5be_2:SLR2
# HBM port assignment of 'dpuv3e_5be'
sp=dpuv3e_5be_1.dpu_axi_0:HBM[0]
sp=dpuv3e_5be_1.dpu_axi_1:HBM[1]
sp=dpuv3e_5be_1.dpu_axi_2:HBM[2]
sp=dpuv3e_5be_1.dpu_axi_3:HBM[3]
sp=dpuv3e_5be_1.dpu_axi_4:HBM[4]
sp=dpuv3e_5be_1.dpu_axi_i:HBM[5]
sp=dpuv3e_5be_1.dpu_axi_w0:HBM[6]
sp=dpuv3e_5be_1.dpu_axi_w1:HBM[7]
sp=dpuv3e_5be_2.dpu_axi_0:HBM[8]
sp=dpuv3e_5be_2.dpu_axi_1:HBM[9]
sp=dpuv3e_5be_2.dpu_axi_2:HBM[10]
sp=dpuv3e_5be_2.dpu_axi_3:HBM[11]
sp=dpuv3e_5be_2.dpu_axi_4:HBM[12]
sp=dpuv3e_5be_2.dpu_axi_i:HBM[13]
sp=dpuv3e_5be_2.dpu_axi_w0:HBM[14]
sp=dpuv3e_5be_2.dpu_axi_w1:HBM[15]
You can use following command line to finish the xclbin build-up (assuming the XO files for DPUv3E, JPEG decoder and image resizer are dpuv3e_5be.xo, jpeg_decoder_v1_0.xo and resize_accel.xo):
v++ --link \
--target hw \
--platform xilinx_u280_xdma_201920_1 \
--config example_config.txt \
--output dpuv3e_integration.xclbin \
dpuv3e_5be.xo jpeg_decoder_v1_0.xo resize_accel.xo
DPUv3E is a flexible high-performance Convolution Neural Network inference IP target for Alveo HBM card. The number of required DPUv3E kernels and the number of Batch Engines of each DPUv3E kernel are fully configurable, which makes it very adaptive to satisfy the requirement of a specific scenario. The user can finish the integration of DPUv3E with their own specific acceleration kernels easily with Vitis and Vitis AI flow.