Introduction

This article aims to explain the difference between the 3 levels of Vitis AI Library APIs. The Vitis AI Library is based on the Xilinx Vitis Unified Software Platform. The Vitis AI Library contains three different levels of APIs, how to choose the one that is right for your development API, which is important for reducing development and improving performance. The Vitis AI Library's API is available on both the MPSoC and Alveo platforms. This article is based on Vtiti AI Library V1.1.


Overview

Process flow

The Vitis AI Libray is a set of high-level libraries and APIs. The development steps of neural network applications usually include:

  • Sensor output frames
  • System pre-processing
  • Algorithm pre-processing
  • DPU running NN model
  • Algorithm post-processing
  • System post-processing

Generally, image frames from the sensor may need to be color converted or resized, which is a system design requirement, but not Artificial Intelligence algorithm requirements. The data needs to be normalized, which is required by the AI algorithm. The output of the NN model needs to be post-processed, for example in the detection algorithm, the boundary frame needs to be decoded, which is also required by the AI algorithm. System-level and non-AI post-processing is also required throughout the development process, e.g. in an ADAS system, the results of AI algorithms are further processed for vehicle control or status updates.
Both system/non-AI pre-processing and post-processing are system-level, depending on scenarios, such as performing color conversion and resizing for metadata from the camera or video transfer devices, processing detection bounding boxes to capture photos or generate alerts. Vitis AI Library optimized codes for entire algorithmic flow open and flexible architecture for easy extension directly support models in Model Zoo.

 General Processing Flow
Figure 1. General Processing Flow

Hierarchy of APIs

Xilinx RunTime(XRT) is unified base APIs. Vitis AI RunTime(VART) is built on top of XRT, VART uses XRT to build the 5 unified APIs. The DpuTask APIs are built on top of VART, as apposed to VART, the DpuTask APIs encapsulate not only the DPU runner but also the algorithm-level pre-processing, such as mean and scale. The highest is the Vitis AI Library APIs, which are base on the DpuTask APIs, with this, you don't need to care about the implementation of the Algorithms and can focus on the system-level applications.

APIs_Level

VART APIs

VART APIs definition

The header files related to VART APIs are distributed in the following header files. We use these unified API to process the DPU running part. The DpuRunnerExt is defined to get the fixed info from the model.

The class inheritance relationship of the VART is described as below, Runner is the base class, the DpuRunner is derived from Runner, and generation the APIs but also implement the wait(). The DpuRunnerExt derived from DpuRunner with the extension of some APIs that are used to get the fixed information of the model after compiled into an elf format. So we can get a fixed position by calling the methods of get_input_scale and get_output_scale. Once we get the input scale information we can process the mean and scale the input image data. After executing the runner and getting the result, we need to convert the result from a fixed point to floating-point using the information from calling the get_output_scale function.

VART
Figure 3. VART APIs structure

How to program with VART

Take resnet50 as an example which is using the model resnet50, the source code you could refer to resnet50 sample . The APIs we will use is included in the following header files:

    #include <vart/dpu/dpu_runner_ext.hpp>
#include <vitis/dpu/dpu_runner.hpp>
#include "vart/dpu/vitis_dpu_runner_factory.hpp"
VART_CANDIDATE
Figure4. VART APIs develop workload

As shown in the figure above, developing applications based on the VART API, the red modules are the ones that need to be implemented by developers, the green modules encapsulate the DPU implementation. The scope of the developer's implementation includes color conversion, decoding, and algorithm level for video and image data, such as mean and scale, then create a DPU runner with the prepared data via VART. Copy the pre-processed data to the inputTendor pointer, run the runner, and then get the outputTensor. At last post-process the output results and implement the system post-processing.

  • The source code below refers to resnet50. First, we need to create a DPU runner, which contains all the information we need for the Tensor shape. Dimensional information can be extracted from the DPU runner such as height, width, number of channels, size.
     /*create runner*/
  auto runners = vitis::ai::DpuRunner::create_dpu_runner(argv[1]);
  auto runner = runners[0].get();
  // ai::XdpuRunner* runner = new ai::XdpuRunner("./");
  /*get in/out tensor*/
  auto inputTensors = runner->get_input_tensors();
  auto outputTensors = runner->get_output_tensors();

  /*get in/out tensor shape*/
  int inputCnt = inputTensors.size();
  int outputCnt = outputTensors.size();
  TensorShape inshapes[inputCnt];
  TensorShape outshapes[outputCnt];
  shapes.inTensorList = inshapes;
  shapes.outTensorList = outshapes;
  getTensorShape(runner, &shapes, inputCnt, outputCnt);
  • The next step is to prepare the input/output tensor buffers and fill the input buffer with the image data and tensor shape information. The struct CpuFlatTensorbuffer is the real image meta data container. Buffer preparation is also required for softmax and FCResult. The runner's aync_excute() function requires the data type should be TensorBuffer*, before you run the runner you need to convert the CpuFlatTensorBuffer* into TensorBuffer*.
      std::vector<vitis::ai::CpuFlatTensorBuffer> inputs, outputs;

  vector<Mat> imageList;
  float* imageInputs = new float[inSize * batchSize];

  float* softmax = new float[outSize];
  float* FCResult = new float[batchSize * outSize];
  std::vector<vitis::ai::TensorBuffer*> inputsPtr, outputsPtr;
  std::vector<std::shared_ptr<ai::Tensor>> batchTensors;
  • As shown in the source code below, image data needs to be pre-processed. Using the VART APIs, functions such as resize, mean, scale, etc. need to be implemented. Resize is the system level pre-processing, mean and scale are the algorithm-level pre-processing.
     for (unsigned int n = 0; n < images.size(); n += batchSize) {
    unsigned int runSize =
        (images.size() < (n + batchSize)) ? (images.size() - n) : batchSize;
    in_dims[0] = runSize;
    out_dims[0] = batchSize;
    for (unsigned int i = 0; i < runSize; i++) {
      Mat image = imread(baseImagePath + images[n + i]);

      /*image pre-process*/
      Mat image2 = cv::Mat(inHeight, inWidth, CV_8SC3);
      resize(image, image2, Size(inHeight, inWidth), 0, 0, INTER_NEAREST);
      if (runner->get_tensor_format() == DpuRunner::TensorFormat::NHWC) {
        for (int h = 0; h < inHeight; h++)
          for (int w = 0; w < inWidth; w++)
            for (int c = 0; c < 3; c++)
              imageInputs[i * inSize + h * inWidth * 3 + w * 3 + c] =
                  image2.at<Vec3b>(h, w)[c] - mean[c];
      } else {
        for (int c = 0; c < 3; c++)
          for (int h = 0; h < inHeight; h++)
            for (int w = 0; w < inWidth; w++)
              imageInputs[i * inSize + (c * inHeight * inWidth) +
                          (h * inWidth) + w] =
                  image2.at<Vec3b>(h, w)[c] - mean[c];
      }
      imageList.push_back(image);
    }
  • In this step, after pre-processing the image data, we need to copy the data to the class CpuFlatTensorBuffer, which is derived from the class TensorBuffer. When we push back the CpuFlatTensorBuffer pointer down to the TensorBuffer pointer's container, the preparation of the data is complete.
    inputs.push_back(
        ai::CpuFlatTensorBuffer(imageInputs, batchTensors.back().get()));
    batchTensors.push_back(std::shared_ptr<ai::Tensor>(new ai::Tensor(
        outputTensors[0]->get_name(), out_dims, ai::Tensor::DataType::FLOAT)));
    outputs.push_back(
        ai::CpuFlatTensorBuffer(FCResult, batchTensors.back().get()));

    /*tensor buffer input/output */
    inputsPtr.clear();
    outputsPtr.clear();
    inputsPtr.push_back(&inputs[0]);
    outputsPtr.push_back(&outputs[0]);
  • All that's left is to execute the dpu runner, get the results back, send the results to the post-processing function, display the results, and finish the whole process. For this part, show the result and TopK function are the system- level post-processing, CPUCalcSoftmax is the algorithm-level post-processing.
    auto job_id = runner->execute_async(inputsPtr, outputsPtr);
    runner->wait(job_id.first, -1);
    for (unsigned int i = 0; i < runSize; i++) {
      cout << "\nImage : " << images[n + i] << endl;
      /* Calculate softmax on CPU and display TOP-5 classification results */
      CPUCalcSoftmax(&FCResult[i * outSize], outSize, softmax);
      TopK(softmax, outSize, 5, kinds);
      /* Display the impage */
      cv::imshow("Classification of ResNet50", imageList[i]);
      cv::waitKey(10000);
    }
  • In summary, if you want to use the VART APIs.
    • First of all, you need to create runner and get the tensorshape needed for DPU runner, such as height, width, channel, and size.
    • Secondly, the image is pre-processed using sensor shape information. The mean and scale data also need to be set for the image.
    • Third, create the input tensor buffer and output tensor buffer to run the DPU.
      Fourth, wait for the call to complete for algorithm-level and system-level post-processing. In this case, all the pre/post-processing functions you need to be implemented by yourself.

DpuTask APIs

API definitons

When the instance of DpuTask is created by using the create method, the instance of DpuTaskImp is automatically generated. DpuTaskImp is the implementation of DpuTask. The relationship between DpuTask APIs and VART APIs is the DpuTaskImp hold a pointer of DpuRunner, when the DpuTaskImp run the Dpu, will call the run method of DpuRunner. It is clear from here that the implementation of DpupuTask highly depends on the implementation of DpuRunner.

    std::unique_ptr<DpuTask> DpuTask::create(const std::string& model_name) {
  return std::unique_ptr<DpuTask>(new DpuTaskImp(model_name));
}

DpuTaskImp::DpuTaskImp(const std::string& model_name)
    : model_name_{model_name},
      dirname_{find_module_dir_name(model_name)},
      runners_{vitis::ai::DpuRunner::create_dpu_runner(dirname_)},
      mean_{std::vector<float>(3, 0.f)},   //
      scale_{std::vector<float>(3, 1.f)},  //
      do_mean_scale_{false} {}

DpuTask_Candidate
Figure 5. DpuTask APIs structure

How to Program with DpuTask APIs

  • Take demo_yolov3.cpp as an example, first step is to create a DpuTask by using the create method of class DpuTask.

  • The modules in red need to be developed. Inspect the green module, the DpuTask encapsulates the runner create, getting the tensor format, setting the mean and scale, running the DpuRunner. Obviously, the development of system-level pre-processing and algorithm-level post-processing, system-level post-processing functions are required.

DpuTask
Figure 6 DpuTask APIs workload
  • Create the a DpuTask with the kernel name which is generated from compiler "dnnc".
      auto kernel_name = "yolov3_voc";
  // A image file.
  auto image_file_name = argv[1];
  // Create a dpu task object.
  auto task = vitis::ai::DpuTask::create(kernel_name);
  • Prepare image data the DpuTask needs. The width and height can be get from task.Use the width and height parameters to resize the image.
     auto input_image = cv::imread(image_file_name);
  if (input_image.empty()) {
    cerr << "cannot load " << image_file_name << endl;
    abort();
  }
  // Resize it if its size is not match.
  cv::Mat image;
  auto input_tensor = task->getInputTensor(0u);
  CHECK_EQ((int)input_tensor.size(), 1)
      << " the dpu model must have only one input";
  auto width = input_tensor[0].width;
  auto height = input_tensor[0].height;
  auto size = cv::Size(width, height);
  if (size != input_image.size()) {
    cv::resize(input_image, image, size);
  } else {
    image = input_image;
  }
  • Setting the mean and scale parameters into DpuTask. Filling the image data into DpuTask by using method setImageRGB().
      task->setMeanScaleBGR({0.0f, 0.0f, 0.0f},
                        {0.00390625f, 0.00390625f, 0.00390625f});
  // Set the input image into dpu.
  task->setImageRGB(image);
  • Run the task and do the post-processing. In this case, we are using the function in xnnpp. You need to be careful, the post-processing function defined in xnnpp required the specific config parameters. So you need to fill the post-processing parameters according to the format defined in DpuModelParam.proto message. Note that, if you use the post-processing function implemented on your own, you don't need to set the config parameters as xnnpp required.
     task->run(0u);

  /* Post-process part */
  // Get output.
  auto output_tensor = task->getOutputTensor(0u);
  // Create a config and set the correlating data to control post-process.
  vitis::ai::proto::DpuModelParam config;
  // Fill all the parameters.
  auto ok =
      google::protobuf::TextFormat::ParseFromString(yolov3_config, &config);
  if (!ok) {
    cerr << "Set parameters failed!" << endl;
    abort();
  }
  // Execute the yolov3 post-processing.
  auto results = vitis::ai::yolov3_post_process(
      input_tensor, output_tensor, config, input_image.cols, input_image.rows);
  • Use post-processing libs in xnnpp need to initialize the DpuModelParam config file. From message DpuModelParam, first, we need to set the specific value to the ModelType. The types for choose are shown below. Set to YOLOv3, the following parameters should be set to yolov3param accordingly.
    message DpuModelParam {
   optional string name = 1;
   repeated DpuKernelParam kernel = 2;
   enum ModelType {
     UNKNOWN_TYPE = 0;
     REFINEDET = 1;
     SSD = 2;
     YOLOv3 = 3;
     CLASSIFICATION = 4;
     DENSE_BOX = 5;
     MULTI_TASK = 6;
     OPENPOSE = 7;
     ROADLINE = 8;
     SEGMENTATION = 9;
     POSEDETECT = 10;
     LANE = 11;
     BLINKER = 12;
     SEGDET = 13;
     ROADLINE_DEEPHI = 14;
     FACEQUALITY5PT = 15; 
     REID = 16; 
   }
   optional  ModelType model_type = 3;
   optional  RefineDetParam refine_det_param = 4;
   optional  YoloV3Param yolo_v3_param = 5;
   optional  SSDParam ssd_param = 6;
   optional  ClassificationParam classification_param = 7;
   optional  DenseBoxParam dense_box_param = 8;
   optional  MultiTaskParam multi_task_param = 9;
   optional  RoadlineParam roadline_param = 10;
   optional  SegmentationParam segmentation_param = 11;
   optional  LaneParam lane_param = 12;
   optional  BlinkerParam blinker_param = 13;
   optional  SegdetParam segdet_param = 14;
   optional  RoadlineDeephiParam roadline_dp_param = 15;

   optional  bool is_tf = 16;
   optional  FaceQuality5ptParam face_quality5pt_param = 17;

   optional TfssdParam tfssd_param = 18;
}
  • The format of the config file needs to follow the message YoloV3Param in the protobuf below. Take the configuration file defined in the example demo_yolov3.cpp as an example. Using the DpuTask APIs along with the post-processing functions in xnnpp, the following parameters yolov3_config are used and it can be adapted to the requirements of the model and application.
    message YoloV3Param{
   optional int32 num_classes = 1;
   optional int32 anchorCnt = 2;
   optional float conf_threshold = 3;
   optional float nms_threshold = 4;
   repeated float biases = 5;
   optional bool test_mAP = 6;
   repeated string layer_name = 7;
}


const string yolov3_config = {
    "   name: \"yolov3_voc_416\" \n"
    "   model_type : YOLOv3 \n"
    "   yolo_v3_param { \n"
    "     num_classes: 20 \n"
    "     anchorCnt: 3 \n"
    "     conf_threshold: 0.3 \n"
    "     nms_threshold: 0.45 \n"
    "     biases: 10 \n"
    "     biases: 13 \n"
    "     biases: 16 \n"
    "     biases: 30 \n"
    "     biases: 33 \n"
    "     biases: 23 \n"
    "     biases: 30 \n"
    "     biases: 61 \n"
    "     biases: 62 \n"
    "     biases: 45 \n"
    "     biases: 59 \n"
    "     biases: 119 \n"
    "     biases: 116 \n"
    "     biases: 90 \n"
    "     biases: 156 \n"
    "     biases: 198 \n"
    "     biases: 373 \n"
    "     biases: 326 \n"
    "     test_mAP: false \n"
    "   } \n"};
  • DpuTask API workflow, note the post-processing part, and if you need to use the functions defined in xnnpp. We need to prepare the configuration file in DpuModelParam format.
DpuTask_Workflow
Figure 7. DpuTask APIs workflow

Vitis AI Library APIs

The API definition

  • Vitis AI Library encapsulates all models according to the requirements of each model. Take facedect sample as an example, the source code refers to facedetect.cpp .

  • The inheritance relationship is shown following, the class Model, such as class Facedetect, encapsulates serials of interfaces related methods. Take facedetect model as an example.
    Firstly, when the Facedetect object is created at the same time the DetectImp will be generated, The real implement run method own by DetecImp.

Vitis_AI_Library_APIs_Candidate
Figure 8. Vitis AI Library APIs structure

Second, in the constructor of DetectImp we can find the parent class TconfigurableDpuTask is constructed too, this class with a member variable ConfigureableDpuTask which is to get the specific parameters from the config file located at "/usr/shared/vitis_ai_library/models/desebox_640_360"

Thirdly, The run() function of DetectImp includes the pre-processing/ run the dpuTask/post-processing. The ConfigurableDpuTask is implemented by DpuTask finally.

    std::unique_ptr<FaceDetect> FaceDetect::create(const std::string &model_name, bool need_preprocess) {
  return std::unique_ptr<FaceDetect>(
      new DetectImp(model_name, need_preprocess));
}


DetectImp::DetectImp(const std::string &model_name, bool need_preprocess)
    : vitis::ai::TConfigurableDpuTask<FaceDetect>(model_name, need_preprocess),
    det_threshold_(configurable_dpu_task_->getConfig()
                         .dense_box_param()
                         .det_threshold())


FaceDetectResult DetectImp::run(const cv::Mat &input_image) {
  __TIC__(FACE_DETECT_E2E)
  // Set input image into DPU Task
  cv::Mat image;
  auto size = cv::Size(getInputWidth(), getInputHeight());
  if (size != input_image.size()) {
    cv::resize(input_image, image, size, 0);
  } else {
    image = input_image;
  }
  __TIC__(FACE_DETECT_SET_IMG)
  configurable_dpu_task_->setInputImageBGR(image);
  __TOC__(FACE_DETECT_SET_IMG)

  __TIC__(FACE_DETECT_DPU)
  configurable_dpu_task_->run(0);
  __TOC__(FACE_DETECT_DPU)

  __TIC__(FACE_DETECT_POST_ARM)
  auto ret = vitis::ai::face_detect_post_process(
      configurable_dpu_task_->getInputTensor(),
      configurable_dpu_task_->getOutputTensor(),
      configurable_dpu_task_->getConfig(), det_threshold_);
  __TOC__(FACE_DETECT_POST_ARM)

  __TOC__(FACE_DETECT_E2E)
  return ret[0];
}

How to Program with Vitis AI Library APIs

Vitis AI Library APIs can separate the pre-processing, As shown below, if the need_preprocess is set to false, resize and mean/scale are need to be developed.
Otherwise, you only need to take care of the image decode and color convert if you set the need_preprocess to true.

Vitis_AI_Library_APIs
Figure 9. Vitis AI Library workload
  • Take test_jpg_facedetect.cpp as an example, the default value of need_preprocess is true. So for this case, what you need to do is to read an image, and then send it to the run() method in the instance of class Facedetect.
    // need_preprocess=true
  static std::unique_ptr<FaceDetect> create(const std::string &model_name,
                                            bool need_preprocess = true);


int main(int argc, char *argv[]) {
  string model = argv[1];
  return vitis::ai::main_for_jpeg_demo(
      argc, argv,
      [model] {
        return vitis::ai::FaceDetect::create(model);
      },
      process_result, 2);
}
  • To load an image, the run() method checks the size of the image in the created Faceddetect object. If it doesn't fit the model, it resizes the image. It also automatically extracts the mean and scale parameters and use them to do the algorithm pre-processes. So you don't need to care about the pre-processing.
      auto image_file_name = std::string{argv[i]};
    auto image = cv::imread(image_file_name);
    if (image.empty()) {
      LOG(FATAL) << "cannot load " << image_file_name << std::endl;
      abort();
    }
    auto result = model->run(image);
  • Post-processing is also implemented in the encapsulated model. In the run() of DetectImp, the post-processing function will be called after the configurable_dpu_task_->run() finished.
     __TIC__(FACE_DETECT_DPU)
  configurable_dpu_task_->run(0);
  __TOC__(FACE_DETECT_DPU)

  __TIC__(FACE_DETECT_POST_ARM)
  auto ret = vitis::ai::face_detect_post_process(
      configurable_dpu_task_->getInputTensor(),
      configurable_dpu_task_->getOutputTensor(),
      configurable_dpu_task_->getConfig(), det_threshold_);
  __TOC__(FACE_DETECT_POST_ARM)

  __TOC__(FACE_DETECT_E2E)
  return ret[0];
  • For using Vitis AI Library APIs to build applications, developers don't need to implement the post-processing by themselves. The encapsulation of the model will call the post-processing libraries which have been implemented with high performance. The model parameters could be customized by modifying the config file dpu_model_param.proto.

Conclusion

With different levels of APIs, the requirements can be divided into three categories.

  • If you want to make a brand new model that is not included in the Vitis AI Library, we suggest that you handle the pre-processing and post-processing yourself, using the VART APIs for programming. The post-processing library can be used as a reference.
  • If you want to use a completely new model and don't want to pre-process the algorithm, the DpuTask API will do the trick! Furthermore, DpuTask allows you to independently implement and invoke the post-processing function.
  • if the models under the support network list of Vitis AI library. We suggest you use Vitis AI Library APIs that will help you get higher performance and speed up your development significantly. The API also provides the option of preprocessing separately, as long as you set need_preprocess to false. For post-processing, you can adapt the model and the application needs b

About Dachang Li

About Dachang Li

Dachang is a Sr. Technical Marketing Engineer working on Software and AI Platforms at AMD responsible for Vitis AI Runtime and Vitis AI Library.  Before joining AMD in July 2018, he served as Product leads of Deephi business department. Presently he focuses on edge side AI and X+ML solution.