This article aims to explain the difference between the 3 levels of Vitis AI Library APIs. The Vitis AI Library is based on the Xilinx Vitis Unified Software Platform. The Vitis AI Library contains three different levels of APIs, how to choose the one that is right for your development API, which is important for reducing development and improving performance. The Vitis AI Library's API is available on both the MPSoC and Alveo platforms. This article is based on Vtiti AI Library V1.1.
The Vitis AI Libray is a set of high-level libraries and APIs. The development steps of neural network applications usually include:
Generally, image frames from the sensor may need to be color converted or resized, which is a system design requirement, but not Artificial Intelligence algorithm requirements. The data needs to be normalized, which is required by the AI algorithm. The output of the NN model needs to be post-processed, for example in the detection algorithm, the boundary frame needs to be decoded, which is also required by the AI algorithm. System-level and non-AI post-processing is also required throughout the development process, e.g. in an ADAS system, the results of AI algorithms are further processed for vehicle control or status updates.
Both system/non-AI pre-processing and post-processing are system-level, depending on scenarios, such as performing color conversion and resizing for metadata from the camera or video transfer devices, processing detection bounding boxes to capture photos or generate alerts. Vitis AI Library optimized codes for entire algorithmic flow open and flexible architecture for easy extension directly support models in Model Zoo.
Xilinx RunTime(XRT) is unified base APIs. Vitis AI RunTime(VART) is built on top of XRT, VART uses XRT to build the 5 unified APIs. The DpuTask APIs are built on top of VART, as apposed to VART, the DpuTask APIs encapsulate not only the DPU runner but also the algorithm-level pre-processing, such as mean and scale. The highest is the Vitis AI Library APIs, which are base on the DpuTask APIs, with this, you don't need to care about the implementation of the Algorithms and can focus on the system-level applications.
The header files related to VART APIs are distributed in the following header files. We use these unified API to process the DPU running part. The DpuRunnerExt is defined to get the fixed info from the model.
The class inheritance relationship of the VART is described as below, Runner is the base class, the DpuRunner is derived from Runner, and generation the APIs but also implement the wait(). The DpuRunnerExt derived from DpuRunner with the extension of some APIs that are used to get the fixed information of the model after compiled into an elf format. So we can get a fixed position by calling the methods of get_input_scale and get_output_scale. Once we get the input scale information we can process the mean and scale the input image data. After executing the runner and getting the result, we need to convert the result from a fixed point to floating-point using the information from calling the get_output_scale function.
Take resnet50 as an example which is using the model resnet50, the source code you could refer to resnet50 sample . The APIs we will use is included in the following header files:
#include <vart/dpu/dpu_runner_ext.hpp>
#include <vitis/dpu/dpu_runner.hpp>
#include "vart/dpu/vitis_dpu_runner_factory.hpp"
As shown in the figure above, developing applications based on the VART API, the red modules are the ones that need to be implemented by developers, the green modules encapsulate the DPU implementation. The scope of the developer's implementation includes color conversion, decoding, and algorithm level for video and image data, such as mean and scale, then create a DPU runner with the prepared data via VART. Copy the pre-processed data to the inputTendor pointer, run the runner, and then get the outputTensor. At last post-process the output results and implement the system post-processing.
/*create runner*/
auto runners = vitis::ai::DpuRunner::create_dpu_runner(argv[1]);
auto runner = runners[0].get();
// ai::XdpuRunner* runner = new ai::XdpuRunner("./");
/*get in/out tensor*/
auto inputTensors = runner->get_input_tensors();
auto outputTensors = runner->get_output_tensors();
/*get in/out tensor shape*/
int inputCnt = inputTensors.size();
int outputCnt = outputTensors.size();
TensorShape inshapes[inputCnt];
TensorShape outshapes[outputCnt];
shapes.inTensorList = inshapes;
shapes.outTensorList = outshapes;
getTensorShape(runner, &shapes, inputCnt, outputCnt);
std::vector<vitis::ai::CpuFlatTensorBuffer> inputs, outputs;
vector<Mat> imageList;
float* imageInputs = new float[inSize * batchSize];
float* softmax = new float[outSize];
float* FCResult = new float[batchSize * outSize];
std::vector<vitis::ai::TensorBuffer*> inputsPtr, outputsPtr;
std::vector<std::shared_ptr<ai::Tensor>> batchTensors;
for (unsigned int n = 0; n < images.size(); n += batchSize) {
unsigned int runSize =
(images.size() < (n + batchSize)) ? (images.size() - n) : batchSize;
in_dims[0] = runSize;
out_dims[0] = batchSize;
for (unsigned int i = 0; i < runSize; i++) {
Mat image = imread(baseImagePath + images[n + i]);
/*image pre-process*/
Mat image2 = cv::Mat(inHeight, inWidth, CV_8SC3);
resize(image, image2, Size(inHeight, inWidth), 0, 0, INTER_NEAREST);
if (runner->get_tensor_format() == DpuRunner::TensorFormat::NHWC) {
for (int h = 0; h < inHeight; h++)
for (int w = 0; w < inWidth; w++)
for (int c = 0; c < 3; c++)
imageInputs[i * inSize + h * inWidth * 3 + w * 3 + c] =
image2.at<Vec3b>(h, w)[c] - mean[c];
} else {
for (int c = 0; c < 3; c++)
for (int h = 0; h < inHeight; h++)
for (int w = 0; w < inWidth; w++)
imageInputs[i * inSize + (c * inHeight * inWidth) +
(h * inWidth) + w] =
image2.at<Vec3b>(h, w)[c] - mean[c];
}
imageList.push_back(image);
}
inputs.push_back(
ai::CpuFlatTensorBuffer(imageInputs, batchTensors.back().get()));
batchTensors.push_back(std::shared_ptr<ai::Tensor>(new ai::Tensor(
outputTensors[0]->get_name(), out_dims, ai::Tensor::DataType::FLOAT)));
outputs.push_back(
ai::CpuFlatTensorBuffer(FCResult, batchTensors.back().get()));
/*tensor buffer input/output */
inputsPtr.clear();
outputsPtr.clear();
inputsPtr.push_back(&inputs[0]);
outputsPtr.push_back(&outputs[0]);
auto job_id = runner->execute_async(inputsPtr, outputsPtr);
runner->wait(job_id.first, -1);
for (unsigned int i = 0; i < runSize; i++) {
cout << "\nImage : " << images[n + i] << endl;
/* Calculate softmax on CPU and display TOP-5 classification results */
CPUCalcSoftmax(&FCResult[i * outSize], outSize, softmax);
TopK(softmax, outSize, 5, kinds);
/* Display the impage */
cv::imshow("Classification of ResNet50", imageList[i]);
cv::waitKey(10000);
}
When the instance of DpuTask is created by using the create method, the instance of DpuTaskImp is automatically generated. DpuTaskImp is the implementation of DpuTask. The relationship between DpuTask APIs and VART APIs is the DpuTaskImp hold a pointer of DpuRunner, when the DpuTaskImp run the Dpu, will call the run method of DpuRunner. It is clear from here that the implementation of DpupuTask highly depends on the implementation of DpuRunner.
std::unique_ptr<DpuTask> DpuTask::create(const std::string& model_name) {
return std::unique_ptr<DpuTask>(new DpuTaskImp(model_name));
}
DpuTaskImp::DpuTaskImp(const std::string& model_name)
: model_name_{model_name},
dirname_{find_module_dir_name(model_name)},
runners_{vitis::ai::DpuRunner::create_dpu_runner(dirname_)},
mean_{std::vector<float>(3, 0.f)}, //
scale_{std::vector<float>(3, 1.f)}, //
do_mean_scale_{false} {}
Take demo_yolov3.cpp as an example, first step is to create a DpuTask by using the create method of class DpuTask.
The modules in red need to be developed. Inspect the green module, the DpuTask encapsulates the runner create, getting the tensor format, setting the mean and scale, running the DpuRunner. Obviously, the development of system-level pre-processing and algorithm-level post-processing, system-level post-processing functions are required.
auto kernel_name = "yolov3_voc";
// A image file.
auto image_file_name = argv[1];
// Create a dpu task object.
auto task = vitis::ai::DpuTask::create(kernel_name);
auto input_image = cv::imread(image_file_name);
if (input_image.empty()) {
cerr << "cannot load " << image_file_name << endl;
abort();
}
// Resize it if its size is not match.
cv::Mat image;
auto input_tensor = task->getInputTensor(0u);
CHECK_EQ((int)input_tensor.size(), 1)
<< " the dpu model must have only one input";
auto width = input_tensor[0].width;
auto height = input_tensor[0].height;
auto size = cv::Size(width, height);
if (size != input_image.size()) {
cv::resize(input_image, image, size);
} else {
image = input_image;
}
task->setMeanScaleBGR({0.0f, 0.0f, 0.0f},
{0.00390625f, 0.00390625f, 0.00390625f});
// Set the input image into dpu.
task->setImageRGB(image);
task->run(0u);
/* Post-process part */
// Get output.
auto output_tensor = task->getOutputTensor(0u);
// Create a config and set the correlating data to control post-process.
vitis::ai::proto::DpuModelParam config;
// Fill all the parameters.
auto ok =
google::protobuf::TextFormat::ParseFromString(yolov3_config, &config);
if (!ok) {
cerr << "Set parameters failed!" << endl;
abort();
}
// Execute the yolov3 post-processing.
auto results = vitis::ai::yolov3_post_process(
input_tensor, output_tensor, config, input_image.cols, input_image.rows);
message DpuModelParam {
optional string name = 1;
repeated DpuKernelParam kernel = 2;
enum ModelType {
UNKNOWN_TYPE = 0;
REFINEDET = 1;
SSD = 2;
YOLOv3 = 3;
CLASSIFICATION = 4;
DENSE_BOX = 5;
MULTI_TASK = 6;
OPENPOSE = 7;
ROADLINE = 8;
SEGMENTATION = 9;
POSEDETECT = 10;
LANE = 11;
BLINKER = 12;
SEGDET = 13;
ROADLINE_DEEPHI = 14;
FACEQUALITY5PT = 15;
REID = 16;
}
optional ModelType model_type = 3;
optional RefineDetParam refine_det_param = 4;
optional YoloV3Param yolo_v3_param = 5;
optional SSDParam ssd_param = 6;
optional ClassificationParam classification_param = 7;
optional DenseBoxParam dense_box_param = 8;
optional MultiTaskParam multi_task_param = 9;
optional RoadlineParam roadline_param = 10;
optional SegmentationParam segmentation_param = 11;
optional LaneParam lane_param = 12;
optional BlinkerParam blinker_param = 13;
optional SegdetParam segdet_param = 14;
optional RoadlineDeephiParam roadline_dp_param = 15;
optional bool is_tf = 16;
optional FaceQuality5ptParam face_quality5pt_param = 17;
optional TfssdParam tfssd_param = 18;
}
message YoloV3Param{
optional int32 num_classes = 1;
optional int32 anchorCnt = 2;
optional float conf_threshold = 3;
optional float nms_threshold = 4;
repeated float biases = 5;
optional bool test_mAP = 6;
repeated string layer_name = 7;
}
const string yolov3_config = {
" name: \"yolov3_voc_416\" \n"
" model_type : YOLOv3 \n"
" yolo_v3_param { \n"
" num_classes: 20 \n"
" anchorCnt: 3 \n"
" conf_threshold: 0.3 \n"
" nms_threshold: 0.45 \n"
" biases: 10 \n"
" biases: 13 \n"
" biases: 16 \n"
" biases: 30 \n"
" biases: 33 \n"
" biases: 23 \n"
" biases: 30 \n"
" biases: 61 \n"
" biases: 62 \n"
" biases: 45 \n"
" biases: 59 \n"
" biases: 119 \n"
" biases: 116 \n"
" biases: 90 \n"
" biases: 156 \n"
" biases: 198 \n"
" biases: 373 \n"
" biases: 326 \n"
" test_mAP: false \n"
" } \n"};
Vitis AI Library encapsulates all models according to the requirements of each model. Take facedect sample as an example, the source code refers to facedetect.cpp .
The inheritance relationship is shown following, the class Model, such as class Facedetect, encapsulates serials of interfaces related methods. Take facedetect model as an example.
Firstly, when the Facedetect object is created at the same time the DetectImp will be generated, The real implement run method own by DetecImp.
Second, in the constructor of DetectImp we can find the parent class TconfigurableDpuTask is constructed too, this class with a member variable ConfigureableDpuTask which is to get the specific parameters from the config file located at "/usr/shared/vitis_ai_library/models/desebox_640_360"
Thirdly, The run() function of DetectImp includes the pre-processing/ run the dpuTask/post-processing. The ConfigurableDpuTask is implemented by DpuTask finally.
std::unique_ptr<FaceDetect> FaceDetect::create(const std::string &model_name, bool need_preprocess) {
return std::unique_ptr<FaceDetect>(
new DetectImp(model_name, need_preprocess));
}
DetectImp::DetectImp(const std::string &model_name, bool need_preprocess)
: vitis::ai::TConfigurableDpuTask<FaceDetect>(model_name, need_preprocess),
det_threshold_(configurable_dpu_task_->getConfig()
.dense_box_param()
.det_threshold())
FaceDetectResult DetectImp::run(const cv::Mat &input_image) {
__TIC__(FACE_DETECT_E2E)
// Set input image into DPU Task
cv::Mat image;
auto size = cv::Size(getInputWidth(), getInputHeight());
if (size != input_image.size()) {
cv::resize(input_image, image, size, 0);
} else {
image = input_image;
}
__TIC__(FACE_DETECT_SET_IMG)
configurable_dpu_task_->setInputImageBGR(image);
__TOC__(FACE_DETECT_SET_IMG)
__TIC__(FACE_DETECT_DPU)
configurable_dpu_task_->run(0);
__TOC__(FACE_DETECT_DPU)
__TIC__(FACE_DETECT_POST_ARM)
auto ret = vitis::ai::face_detect_post_process(
configurable_dpu_task_->getInputTensor(),
configurable_dpu_task_->getOutputTensor(),
configurable_dpu_task_->getConfig(), det_threshold_);
__TOC__(FACE_DETECT_POST_ARM)
__TOC__(FACE_DETECT_E2E)
return ret[0];
}
Vitis AI Library APIs can separate the pre-processing, As shown below, if the need_preprocess is set to false, resize and mean/scale are need to be developed.
Otherwise, you only need to take care of the image decode and color convert if you set the need_preprocess to true.
// need_preprocess=true
static std::unique_ptr<FaceDetect> create(const std::string &model_name,
bool need_preprocess = true);
int main(int argc, char *argv[]) {
string model = argv[1];
return vitis::ai::main_for_jpeg_demo(
argc, argv,
[model] {
return vitis::ai::FaceDetect::create(model);
},
process_result, 2);
}
auto image_file_name = std::string{argv[i]};
auto image = cv::imread(image_file_name);
if (image.empty()) {
LOG(FATAL) << "cannot load " << image_file_name << std::endl;
abort();
}
auto result = model->run(image);
__TIC__(FACE_DETECT_DPU)
configurable_dpu_task_->run(0);
__TOC__(FACE_DETECT_DPU)
__TIC__(FACE_DETECT_POST_ARM)
auto ret = vitis::ai::face_detect_post_process(
configurable_dpu_task_->getInputTensor(),
configurable_dpu_task_->getOutputTensor(),
configurable_dpu_task_->getConfig(), det_threshold_);
__TOC__(FACE_DETECT_POST_ARM)
__TOC__(FACE_DETECT_E2E)
return ret[0];
With different levels of APIs, the requirements can be divided into three categories.
Dachang is a Sr. Technical Marketing Engineer working on Software and AI Platforms at AMD responsible for Vitis AI Runtime and Vitis AI Library. Before joining AMD in July 2018, he served as Product leads of Deephi business department. Presently he focuses on edge side AI and X+ML solution.