# SDAccel Environment Profiling and Optimization Guide UG1207 (v2018.3) January 24, 2019 # **Revision History** The following table shows the revision history for this document. | General updates SDAccel Optimization Flow Overview Space of Optimization Flow Overview Wowed and split figures into: Baselining Functionalities and Performance Optimizing Data Movement Optimizing Data Movement Optimizing AXI Interfaces to PLRAM Added information regarding PLRAM. SDAccel Build Process General updates Updated figures and images throughout the document. SDAccel Build Process Added new figure and build target information. Overlapping Data Transfers with Kernel Computation Wowed the Buffer Memory Segmentation section into: Moved the Buffer Memory Segmentation section into: Multiple Compute Units (new topic) Compute Units (new topic) Compute Units (new topic) Compute Units (new topic) Compute Units (new topic) Compute Units (new topic) Added that the section also suggests corrective actions. 10/02/2018 Version 2018.2. xxlf Added a note regarding the need to align DDR assignment, and SLR placement of a kernel. Of/02/2018 Version 2018.2. Entire document Minor editorial changes for 2018.2. Entire document Space of the Waveform viewer with the School and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel® environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 Command Line Command Line Change to profile kernel option. | Section | Revision Summary | | | | | | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------|-------------------------------------------------------------------------------------|--|--|--|--|--| | SDAccel Optimization Flow Overview Moved and split figures into: Baselining Functionalities and Performance Optimizing Data Movement Optimizing Maries of Performance Optimizing Maries of Performance Optimizing Kernel Computation Using Multiple DDR Banks | 01/24/2019 Version 2018.3 | | | | | | | | # Baselining Functionalities and Performance | General updates | Updated figures and minor editorial changes. | | | | | | | Using Multiple DDR Banks Assigning AXI Interfaces to PLRAM 12/05/2018 Version 2018.3 General updates Updated figures and images throughout the document. SDAccel Build Process Added new figure and build target information. Overlapping Data Transfers with Kernel Computation Worlapping Data Transfers with Kernel Computation Overlapping Data Transfers with Kernel Computation Overlapping Data Transfers with Kernel Computation Added new figure and build target information. Moved the Buffer Memory Segmentation section into: Multiple Compute Units (new topic) Compute Units (new topic) Compute Units Scheduling (existing topic) Added and the section also suggests corrective actions. 10/02/2018 Version 2018.2.xaff Assigning Kernels to SLR regions Added a note regarding the need to align DDR assignment, and SLR placement of a kernel. 10/02/2018 Version 2018.2 Entire document Minor editorial changes for 2018.2. This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel® environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Optimizing Computational Units Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | SDAccel Optimization Flow Overview | Moved and split figures into: | | | | | | | Using Multiple DDR Banks Assigning AXI Interfaces to PLRAM Toverlapping Data Transfers with Kernel Computation Moved the Buffer Memory Segmentation section into: Moved the Buffer Memory Segmentation section into: Moved the Buffer Memory Segmentation section into: Moved the Buffer Memory Segmentation section into: Moved the Buffer Memory Segmentation section into: Moved the Buffer Memory Segmentation section into: Multiple Compute Units (new topic) Compute Units (new topic) Compute Units (new topic) Compute Unit Scheduling (existing topic) Added that the section also suggests corrective actions. Total Survey of Stranger Section and Surgests corrective actions. Minor editorial changes for 2018.2. Entire document Minor editorial changes for 2018.2. This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profilling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccell™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Optimizing Computational Parallelism Optimizing Computational Units Optimizing Computational Units Optimizing Memory Architecture Minor editorial edits for 2018.1 | | Baselining Functionalities and Performance | | | | | | | Using Multiple DDR Banks Assigning AXI Interfaces to PLRAM 12/05/2018 Version 2018.3 General updates Updated figures and images throughout the document. Added new figure and build target information. Moved the Buffer Memory Segmentation section into: Multiple Compute Units (new topic) Compute Unit Scheduling (existing topic) Chapter 5: Topological Optimization Added that the section also suggests corrective actions. 10/02/2018 Version 2018.2.xdf Assigning Kernels to SLR regions Added a note regarding the need to align DDR assignment, and SLR placement of a kernel. 05/06/2018 Version 2018.2 Entire document Minor editorial changes for 2018.2. This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | | Optimizing Data Movement | | | | | | | Assigning AXI Interfaces to PLRAM 12/05/2018 Version 2018.3 General updates Updated figures and images throughout the document. Added new figure and build target information. Moved the Buffer Memory Segmentation section into: 'Multiple Compute Units (new topic) 'Compute Unit Scheduling (existing topic) Chapter 5: Topological Optimization Added that the section also suggests corrective actions. 10/02/2018 Version 2018.2.xdf Assigning Kernels to SLR regions Added a note regarding the need to align DDR assignment, and SLR placement of a kernel. 07/02/2018 Version 2018.2 Entire document Minor editorial changes for 2018.2. This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | | Optimizing Kernel Computation | | | | | | | Seneral updates Updated figures and images throughout the document. | | Added information regarding PLRAM. | | | | | | | General updates Updated figures and images throughout the document. SDAccel Build Process Added new figure and build target information. Overlapping Data Transfers with Kernel Computation Moved the Buffer Memory Segmentation section into: • Multiple Compute Units (new topic) • Compute Unit Scheduling (existing topic) Chapter 5: Topological Optimization Added that the section also suggests corrective actions. 10/02/2018 Version 2018.2.Xdf Assigning Kernels to SLR regions Added a note regarding the need to align DDR assignment, and SLR placement of a kernel. 07/02/2018 Version 2018.2 Entire document Minor editorial changes for 2018.2. This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | | | | | | | | | SDAccel Build Process Added new figure and build target information. Overlapping Data Transfers with Kernel Computation Woved the Buffer Memory Segmentation section into: • Multiple Compute Units (new topic) • Compute Units Scheduling (existing topic) Chapter 5: Topological Optimization Added that the section also suggests corrective actions. 10/02/2018 Version 2018.2.xdf Assigning Kernels to SLR regions Added a note regarding the need to align DDR assignment, and SLR placement of a kernel. 07/02/2018 Version 2018.2 Entire document Minor editorial changes for 2018.2. This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | | | | | | | | | Overlapping Data Transfers with Kernel Computation Moved the Buffer Memory Segmentation section into: Multiple Compute Units (new topic) Compute Unit Scheduling (existing topic) Added that the section also suggests corrective actions. 10/02/2018 Version 2018.2.xdf Assigning Kernels to SLR regions Added a note regarding the need to align DDR assignment, and SLR placement of a kernel. 07/02/2018 Version 2018.2 Entire document Minor editorial changes for 2018.2. This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel <sup>120</sup> environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Optimizing Computational Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | General updates | Updated figures and images throughout the document. | | | | | | | • Multiple Compute Units (new topic) • Compute Units Scheduling (existing topic) Chapter 5: Topological Optimization Added that the section also suggests corrective actions. 10/02/2018 Version 2018.2.xdf Assigning Kernels to SLR regions Added a note regarding the need to align DDR assignment, and SLR placement of a kernel. 07/02/2018 Version 2018.2 Entire document Minor editorial changes for 2018.2. This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAcce <sup>IM</sup> environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Optimizing Computational Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | SDAccel Build Process | Added new figure and build target information. | | | | | | | Compute Unit Scheduling (existing topic) Chapter 5: Topological Optimization Added that the section also suggests corrective actions. 10/02/2018 Version 2018.2.xdf Assigning Kernels to SLR regions Added a note regarding the need to align DDR assignment, and SLR placement of a kernel. 07/02/2018 Version 2018.2 Entire document Minor editorial changes for 2018.2. This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | Overlapping Data Transfers with Kernel Computation | | | | | | | | Added that the section also suggests corrective actions. 10/02/2018 Version 2018.2.xdf Assigning Kernels to SLR regions Added a note regarding the need to align DDR assignment, and SLR placement of a kernel. 07/02/2018 Version 2018.2 Entire document Minor editorial changes for 2018.2. 1 This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | | | | | | | | | 10/02/2018 Version 2018.2.xdf Assigning Kernels to SLR regions Added a note regarding the need to align DDR assignment, and SLR placement of a kernel. 07/02/2018 Version 2018.2 Entire document Minor editorial changes for 2018.2. 06/06/2018 Version 2018.2 This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Optimizing Computational Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | | Compute Unit Scheduling (existing topic) | | | | | | | Assigning Kernels to SLR regions Added a note regarding the need to align DDR assignment, and SLR placement of a kernel. O7/02/2018 Version 2018.2 Entire document Minor editorial changes for 2018.2. O6/06/2018 Version 2018.2 This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | Chapter 5: Topological Optimization | Added that the section also suggests corrective actions. | | | | | | | and SLR placement of a kernel. 07/02/2018 Version 2018.2 Entire document Minor editorial changes for 2018.2. 06/06/2018 Version 2018.2 This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | 10/02/2018 Ve | ersion 2018.2.xdf | | | | | | | Entire document Minor editorial changes for 2018.2. O6/06/2018 Version 2018.2 This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture O4/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | Assigning Kernels to SLR regions | | | | | | | | This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | 07/02/2018 | Version 2018.2 | | | | | | | This document has gone through an extensive reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | Entire document | Minor editorial changes for 2018.2. | | | | | | | reorganization and rewrite. Some background material has been removed to focus on profiling and optimization. Guidance Discusses the use of the Guidance view for improving project performance. Discusses the use of the Waveform viewer with the SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | 06/06/2018 | Version 2018.2 | | | | | | | project performance. Waveform Viewer Discusses the use of the Waveform viewer with the SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | | reorganization and rewrite. Some background material has | | | | | | | SDAccel™ environment. Using Implementation Tools Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. Interface Optimization Using AXI4Data Width Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | Guidance | | | | | | | | With the Vivado Design Suite.Interface OptimizationUsing AXI4Data WidthOptimizing Computational ParallelismLoop Parallelism and Task ParallelismOptimizing Computational UnitsOptimizing Memory Architecture04/04/2018 Version 2018.1Entire documentMinor editorial edits for 2018.1 | Waveform Viewer | | | | | | | | Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | Using Implementation Tools | Added a discussion of Controlling FPGA Implementation with the Vivado Design Suite. | | | | | | | Optimizing Computational Units Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | Interface Optimization | Using AXI4Data Width | | | | | | | Optimizing Memory Architecture 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | Optimizing Computational Parallelism | - | | | | | | | 04/04/2018 Version 2018.1 Entire document Minor editorial edits for 2018.1 | | Optimizing Computational Units | | | | | | | Entire document Minor editorial edits for 2018.1 | | Optimizing Memory Architecture | | | | | | | | | | | | | | | | Command Line Change to profile_kernel option. | Entire document | Minor editorial edits for 2018.1 | | | | | | | | Command Line | Change to profile_kernel option. | | | | | | | Section | Revision Summary | | | |----------------------------------|---------------------------------------------|--|--| | Assigning Kernels to SLR regions | RTL Kernel Wizard kernel naming convention. | | | # **Table of Contents** | Revision History | 2 | |--------------------------------------------------------|----| | Chapter 1: Introduction | 6 | | Execution Model of an SDAccel Application | 6 | | SDAccel Build Process | 8 | | SDAccel Optimization Flow Overview | 11 | | Chapter 2: SDAccel Profiling and Optimization Features | 17 | | System Estimate | 17 | | HLS Report | 22 | | Profile Summary Report | 25 | | Application Timeline | 31 | | Waveform Viewer | | | Guidance | | | Using Implementation Tools | 48 | | Chapter 3: Kernel Optimization | 51 | | Interface Attributes (Detailed Kernel Trace) | 51 | | Optimizing Computational Parallelism | 60 | | Optimizing Compute Units | 72 | | Optimizing Memory Architecture | 74 | | Chapter 4: Host Optimization | 80 | | Reducing Overhead of Kernel Enqueing | 80 | | Data Transfers | 81 | | Compute Unit Scheduling | 84 | | Using clEnqueueMigrateMemObjects to Transfer Data | 86 | | Chapter 5: Topological Optimization | 88 | | Multiple Compute Units | 88 | | Using Multiple DDR Banks | 88 | | Appendix A: Examples | 93 | | Appendix B: Additional Resources and Legal Notices | 94 | |----------------------------------------------------|----| | Xilinx Resources | | | Documentation Navigator and Design Hubs | | | References | | | Please Read: Important Legal Notices | 95 | # Introduction This guide presents all SDx™ development environment features related to performance analysis of the design. It is also logically structured to assist in the actual performance improvement effort. Dedicated sections are available for the main components of the SDAccel™ environment performance bottlenecks, namely Accelerator, PCle® bus transfer, and Host code. Each of these sections is structured to guide the developer from recognizing bottlenecks all the way to solution approaches to increase overall system performance. **Note:** Performance optimization assumes, as a starting point, a working design intended for performance improvement. If erroneous behavior is encountered, look for guidance in SDAccel Environment Debugging Guide (UG1281). Similarly, the general concepts regarding coding of host code or accelerator kernels are not explained here; these concepts are introduced in the SDAccel Environment Programmers Guide (UG1277). # **Execution Model of an SDAccel Application** The SDAccel environment is designed to provide a simplified development experience for FPGA-based software acceleration platforms. The general structure of the acceleration platform is shown in the following figure. Figure 1: Architecture of an SDAccel Application The custom application is running on the host x86 server and uses OpenCL API calls to interact with the FPGA accelerators. The Xilinx runtime (XRT) manages those interactions. The application is written in C/C++ using OpenCL APIs. The custom kernels are running within a Xilinx FPGA with the XRT managing interactions between the host application and the accelerator. Communication between the host x86 machine and the accelerator board occurs across the PCIe bus. The SDAccel hardware platform contains global memory banks. The data transfer between the host machine and kernels, in either direction, occurs through these global memory banks. The kernels running on the FPGA can have one or more memory interfaces. The connection from the memory banks to those memory interfaces is programmable and determined by linking options of the compiler. The SDAccel execution model follows these steps: - 1. The host application writes the data needed by a kernel into the global memory of the attached device through the PCIe interface. - 2. The host application programs the kernel with its input parameters. - 3. The host application triggers the execution of the kernel function on the FPGA. - 4. The kernel performs the required computation while reading and writing data from global memory, as necessary. - 5. The kernels write data back to the memory banks, and notify the host that it has completed its task. - 6. The host application reads data back from global memory into the host memory space, and continues processing as needed. The FPGA can accommodate multiple kernel instances at one time; this can occur between different types of kernels or multiple instances of the same kernel. The XRT transparently orchestrates the communication between the host application and the kernels in the accelerator. The number of instances of a kernel is determined by compilation options. #### **SDAccel Build Process** The SDAccel environment offers all of the features of a standard software development environment: - Optimized compiler for host applications - Cross-compilers for the FPGA - Robust debugging environment to help identify and resolve issues in the code - Performance profilers to identify bottlenecks and optimize the code Within this environment, the build process uses a standard compilation and linking process for both the software elements, and the hardware elements of the project. As shown in the following figure, the host application is built through one process using standard GCC compiler, and the FPGA binary is built through a separate process using the Xilinx xocc compiler. Figure 2: Software/Hardware Build Process 1. Host application build process using GCC: - Each host application source file is compiled to an object file (. o). - The object files (.o) are linked with the Xilinx SDAccel runtime shared library to create the executable (.exe). - 2. FPGA build process is highlighted in the following figure: - Each kernel is independently compiled to a Xilinx object (.xo) file. - $_{\circ}$ C/C++ and OpenCL C kernels are compiled for implementation on an FPGA using the xocc compiler. This step leverages the Vivado® HLS compiler. Pragmas and attributes supported by Vivado HLS can be used in C/C++ and OpenCL C kernel source code to specify the desired kernel micro-architecture and control the result of the compilation process. - RTL kernels are compiled using the package\_xo utility. The RTL kernel wizard in the SDAccel environment can be used to simplify this process. - The kernel .xo files are linked with the hardware platform (shell) to create the FPGA binary (.xolbin). Important architectural aspects are determined during the link step. In particular, this is where connections from kernel ports to global memory banks are established and where the number of instances for each kernel is specified. - When the build target is software or hardware emulation, as described below, xocc generates simulation models of the device contents. When the build target is the system (actual hardware), xocc generates the FPGA binary for the device leveraging the Vivado Design Suite to run synthesis and implementation. Figure 3: FPGA Build Process **Note:** The xocc compiler automatically uses the Vivado HLS and Vivado Design Suite tools to build the kernels to run on the FPGA platform. It uses these tools with predefined settings which have proven to provide good quality of results. Using the SDAccel environment and the xocc compiler does not require knowledge of these tools; however, hardware-savvy developers can fully leverage these tools and use all their available features to implement kernels. #### **Build Targets** The SDAccel tool build process generates the host application executable (.exe) and the FPGA binary (.xclbin). The SDAccel build target defines the nature of FPGA binary generated by the build process. The SDAccel tool provides three different build targets, two emulation targets used for debug and validation purposes, and the default hardware target used to generate the actual FPGA binary: - Software Emulation (sw\_emu): Both the host application code and the kernel code are compiled to run on the x86 processor. This allows iterative algorithm refinement through fast build-and-run loops. This target is useful for identifying syntax errors, performing source-level debugging of the kernel code running together with application, and verifying the behavior of the system. - Hardware Emulation (hw\_emu): The kernel code is compiled into a hardware model (RTL) which is run in a dedicated simulator. This build and run loop takes longer but provides a detailed, cycle-accurate, view of kernel activity. This target is useful for testing the functionality of the logic that will go in the FPGA and for getting initial performance estimates. - System (hw): The kernel code is compiled into a hardware model (RTL) and is then implemented on the FPGA device, resulting in a binary that will run on the actual FPGA. ## **SDAccel Optimization Flow Overview** SDAccel environment is a complete software development environment for creating, compiling, and optimizing C/C++/OpenCL applications to be accelerated on Xilinx FPGAs. The SDAccel environment includes the three recommended flows for optimizing an application. For information on each flow, refer to the following: - Baselining Functionalities and Performance - Optimizing Data Movement - Optimizing Kernel Computation #### **Baselining Functionalities and Performance** It is very important to understand the performance of your application before you start any optimization effort. This is achieved by baselining the application in terms of functionalities and performance. Run application on processor Profile application to identify bottlenecks and select functions to be accelerated. Convert host code to use OpenCL APIs Convert target functions to CL or C/C++ kernels Run Software Emulation Verify Function Correctness Run Hardware Emulation Analyze Kernel Compilation Reports, Profile Summary, Timeline Trace, Device **HW Transactions** Build and Run application on FPGA acceleration card Analyze Profile Summary Analyze Timeline Trace Function/Performance baselined X22238-012219 Figure 4: Baselining Functionalities and Performance Flow #### **Identify Bottlenecks** The first step is to identify the bottlenecks of the current application running on your existing platform. The most effective way is to run the application with profiling tools, like <code>valgrind</code>, <code>callgrind</code>, and GNU <code>gprof</code>. The profiling data generated by these tools show the call graph with the number of calls to all functions and their execution time. The functions that consume the most execution time are good candidates to be offloaded and accelerated onto FPGAs. #### **Convert Target Functions** After the target functions are selected, convert them to OpenCL C kernels or C/C++ kernels without any optimization. The application code calling these kernels will also need to be converted to use OpenCL APIs for data movement and task scheduling. **TIP:** Keep everything as simple as possible and minimize changes to the existing code in this step so you can quickly generate a working design on the FPGA and get the baselined performance and resource number. #### **Run Software and Hardware Emulation** Next, run software and hardware emulation to verify the function correctness and generate profiling data on the host code and the kernels. Analyze the kernel compilation reports, profile summary, timeline trace, and device hardware transactions to understand the baselined performance estimate such as timing, interval, and latency and resource utilization, such as DSP and block RAM. #### **Build and Run the Application** The last step in baselining is building and running the application on an FPGA acceleration card. Analyze the reports from the system compilation and the profiling data from application execution to see the actual performance and resource utilization. **TIP:** Save all the reports during baselining, so that you can reference and compare results during optimization. #### **Optimizing Data Movement** Figure 5: Optimizing Data Movement Flow In the OpenCL API, all data is transferred from the host memory to the global memory on the device first and then from the global memory to the kernel for computation. The computation results are written back from the kernel to the global memory and lastly from the global memory to the host memory. A key factor in determining strategies for kernel optimization is understanding how data can be efficiently moved around. **RECOMMENDED:** Optimize the data movement in the application before optimizing computation. During data movement optimization, it is important to isolate data transfer code from computation code because inefficiency in computation might cause stalls in data movement. Xilinx recommends that you modify the host code and kernels with data transfer code only for this optimization step. The goal is to maximize the system level data throughput by maximizing PCIe bandwidth usage and DDR bandwidth usage. It usually takes multiple iterations of running software emulation, hardware emulation, as well as execution on FPGAs, to achieve the goal. #### **Optimizing Kernel Computation** Figure 6: Optimizing Kernel Computation Flow One of the key benefits of an FPGA is that you can create custom logic for your specific application. The goal of kernel computation optimization is to create processing logic that can consume all the data as soon as they arrive at kernel interfaces. The key metric during this step is the initiation interval (II). This is generally achieved by expanding the processing code to match the data path with techniques such as function pipelining, loop unrolling, array partitioning, data flowing, etc. The SDAccel environment produces various compilation reports and profiling data during hardware emulation and system run to assist your optimization effort. Refer to Chapter 2: SDAccel Profiling and Optimization Features for details on the compilation and profiling report. # SDAccel Profiling and Optimization Features The SDAccel™ environment generates various reports on the kernel resource and performance during compilation. It also collects profiling data during application execution in emulation mode and on the FPGA acceleration card. The reports and profiling data provide you with information on performance bottlenecks in the application and optimization techniques that can be used to improve performance. This chapter describes how to generate the reports and collect, display, and read the profiling results in the SDAccel environment. # System Estimate Generating FPGA programming files is the step in the SDAccel development environment with the longest execution time. It is also the step in which the execution time is most affected by the target device and the number of compute units placed on the FPGA fabric. Therefore, it is essential for the application programmer to have a quicker way to understand the performance of the application before running it on the target device, so they can spend more time iterating and optimizing their applications instead of waiting for the FPGA programming file to be generated. The system estimate in the SDAccel development environment takes into account the target hardware device and each compute unit in the application. Although an exact performance metric can only be measured on the target device, the estimation report in the SDAccel environment provides an accurate representation of the expected behavior. #### **GUI Flow** This report is automatically generated during the hardware emulation flow. There is one report generated for each kernel and a top report for the complete binary container. It is easy to access the reports from the Assistant window in the Emulation-HW folder. The following figure shows the Assistant window with a System Estimate report for the binary\_container\_1 and the kernel with the name run. Figure 7: System Estimate Report in the Assistant Window #### **Command Line** The following command generates the system performance estimate report system\_estimate.xtxt for all kernels in kernel.cl: ``` xocc - c - t \ hw\_emu --platform \ xilinx:adm-pcie-7v3:1ddr:3.0 --report estimate kernel.cl ``` The performance estimate report generated by the xocc -report estimate option provides information on every binary container in the application, as well as every compute unit in the design. The report is structured as follows: - Target device information - Summary of every kernel in the application - Detailed information on every binary container in the solution #### **Data Interpretation** The following example report file represents the information generated for the estimate report: ``` OpenCL Binary: xcl_xocc Kernels mapped to: clc_region Timing Information (MHz) Compute Unit Kernel Name Module Name Target Frequency smithwaterman_1 smithwaterman smithwaterman 200 Estimated Frequency 202.020203 Latency Information (clock cycles) Compute Unit Kernel Name Module Name Start Interval smithwaterman_1 smithwaterman smithwaterman 29468 Best Case Avg Case Worst Case 29467 29467 29467 Area Information Compute Unit Kernel Name Module Name FF LUT DSP BRAM smithwaterman_1 smithwaterman smithwaterman 2925 4304 1 10 ``` #### **Design and Target Device Summary** All design estimate reports begin with an application summary and information about the target device. The device information is provided in the following section of the report: ``` Design Name: __xocc_compile_kernel_bin.dir Target Device: xilinx:adm-pcie-ku3:2ddr-xpr:3.3 Target Clock: 200MHz Total number of kernels: 1 ``` For the design summary, the only information that is provided is the design name and the selection of the target device. The other information provided in this section is the target board and the clock frequency. - **Target Board:** The name of the board that runs the application compiled by the SDAccel development environment. - **Clock Frequency:** Defines how fast the logic runs for compute units mapped to the FPGA fabric. Both of these parameters are fixed by the device developer. These parameters cannot be modified from within the SDAccel environment. #### **Kernel Summary** The Kernel Summary section lists all of the kernels defined for the current SDAccel solution. The following example shows the kernel summary: In addition to the kernel name, the summary also provides the execution target and type of the input source. Because there is a difference in compilation and optimization methodology for $\mathsf{OpenCL}^{\mathsf{TM}}$ , C, and $\mathsf{C/C++}$ source files, the type of kernel source file is specified. The Kernel Summary section is the last summary information in the report. From here, detailed information on each compute unit binary container is presented. #### **Timing Information** For each binary container, the detail section begins with the execution target of all compute units. It also provides timing information for every compute unit. As a general rule, if an estimated frequency is higher than that of the device target, the compute unit will be able to run in the device. If the estimated frequency is below the target frequency, the kernel code for the compute unit needs to be further optimized for the compute unit to run correctly on the FPGA fabric. This information is shown in the following example: ``` OpenCL Binary: xcl_xocc Kernels mapped to: clc_region Timing Information (MHz) Compute Unit Kernel Name Module Name Target Frequency smithwaterman_1 smithwaterman smithwaterman 200 Estimated Frequency 202.020203 ``` It is important to understand the difference between the target and estimated frequencies. Compute units are not placed in isolation into the FPGA fabric. Compute units are placed as part of a valid FPGA design that can include other components defined by the device developer to support a class of applications. Because the compute unit custom logic is generated one kernel at a time, an estimated frequency that is higher than the device target indicates to the developer using the SDAccel environment that there should not be any timing problems during the creation of the FPGA programming files. #### **Latency Information** The latency information presents the execution profile of each compute unit in the binary container. When analyzing this data, it is important to keep in mind that all values are measured from the compute unit boundary through the custom logic. In-system latencies associated with data transfers to global memory are not reported as part of these values. Also, the latency numbers reported are only for compute units targeted at the FPGA fabric. Following is an example of the latency report: ``` Latency Information (clock cycles) Compute Unit Kernel Name Module Name Start Interval Best Case smithwaterman_1 smithwaterman smithwaterman 29468 29467 Avg Case Worst Case 29467 29467 ``` The latency report is divided into the following fields: - Start interval - Best case latency - Average case latency - Worst case latency The start interval defines the amount of time that has to pass between invocations of a compute unit for a given kernel. The best, average, and worst case latency numbers refer to how much time it takes the compute unit to generate the results of one ND Range data tile for the kernel. For cases where the kernel does not have data dependent computation loops, the latency values will be the same. Data dependent execution of loops introduces data specific latency variation that is captured by the latency report. The interval or latency numbers will be reported as "undef" for kernels with one or more conditions listed below: - OpenCL kernels that do not have explicit reqd\_work\_group\_size(x,y,z) - Kernels that have loops with variable bounds **Note:** The latency information reflects estimates based on the analysis of the loop transformations and exploited parallelism of the model. These advanced transformations such as pipelining and data flow can heavily change the actual throughput numbers. Therefore, latency can only be used as relative guides between different runs. #### **Area Information** Although the FPGA can be thought of as a blank computational canvas, there are a limited number of fundamental building blocks available in each FPGA. These fundamental blocks (FF, LUT, DSP, block RAM) are used by SDAccel development environment to generate the custom logic for each compute unit in the design. The number of each fundamental resource needed to implement the custom logic in a compute unit determines how many compute units can be simultaneously loaded into the FPGA fabric. The following example shows the area information reported for a compute unit: | Area Information | | | | | | | |------------------|---------------|---------------|------|------|-----|------| | Compute Unit | Kernel Name | Module Name | FF | LUT | DSP | BRAM | | | | | | | | | | smithwaterman_1 | smithwaterman | smithwaterman | 2925 | 4304 | 1 | 10 | # **HLS Report** After compiling a kernel using the SDx<sup>™</sup> development environment GUI or the XOCC command line, the Vivado<sup>®</sup> High-Level Synthesis (HLS) tool HLS report is available. The HLS report includes details about the performance and logic usage of the custom-generated hardware logic from user kernel code. These details provide advanced users many insights into the kernel compilation results to guide kernel optimization. #### **GUI Flow** After compiling a kernel using the SDx environment GUI, you can view the HLS Report in the Assistant window. The report is under the Emulation-HW or System build configuration, and has the <br/>binary container> name, and the <kernel> name. This is illustrated in the following Assistant window: Figure 8: Assistant Window #### **Command Line** The HLS Report is designed to be viewed by the SDAccel environment GUI. However, for command line users, a textual representation of this report is also published. This report can be found inside the report directory situated under the kernel synthesis directory in the Vivado High-Level Synthesis (HLS) tool solution directory. Because the xocc command generates several additional levels of hierarchy above this synthesis directory, it is best to simply locate the file by name: ``` find . -name <module>_csynth.rpt ``` Where <module> is the name of the kernel. **Note:** The find command also supports the lookup using wildcards such that the following command will lookup all synthesis reports in any subdirectory: ``` find . -name "*_csynth.rpt" ``` #### **Data Interpretation** The left pane of the HLS Report shows the module hierarchy. Each module generated as part of the high level synthesis run is represented in this hierarchy. You can select any of these modules to present the synthesis details of the module in the right side of the Synthesis Report window. Figure 9: HLS Report Window The Synthesis Report is separated into several sections, namely: - General Information - Performance Estimates (timing and latency) - Utilization Estimates - Interface Information If this information is part of a hierarchical block, it will sum up the information of the blocks contained in the hierarchy. Due to this fact, the hierarchy can also be navigated from within the report, when it is clear which instance contributes what to the overall design. **CAUTION!** Regarding the absolute counts of cycles and latency, these numbers are based on estimates identified during synthesis, especially with advanced transformations, such as pipelining and dataflow; these numbers might not accurately reflect the final results. If you encounter question marks in the report, this might be due to variable bound loops, and you are encouraged to set trip counts for such loops to have some relative estimates presented in this report. ## **Profile Summary Report** The SDAccel runtime automatically collects profiling data on host applications. After the application finishes execution, the profile summary is saved in HTML, .csv, and Google Protocol Buffer formats in the solution report directory or working directory. These reports can be reviewed in a web browser, spreadsheet viewer, or the integrated Profile Summary Viewer in the SDAccel environment. The profile reports are generated in both SDAccel GUI and XOCC command line flows. #### **GUI Flow** When you compile and execute an application from SDAccel environment, the profile summary is automatically generated. To control the generation of profile information, simply edit the run configuration through the context menu of the build configuration, and select $Run \rightarrow Run$ Configurations. After the configuration is run, the Assistant window enables easy access to the report from below the Run Configuration item. After the run configuration has executed, modifying the configuration can now be initiated directly through the context menu of the run configuration item in the Assistant window. Figure 10: Profile Summary access in SDAccel GUI Flow Double-click the report to open it. #### **Command Line** Command Line users execute standalone applications outside the SDAccel environment. To generate the profile summary data, you can compile your design without any additional options. However, linking the bitstream file (xclbin) requires the --profile\_kernel option. The argument provided through the --profile\_kernel option can be used to limit data collection, which might be required in large systems. The general syntax for the profile\_kernel option with respect to the profile summary report is: ``` --profile_kernel <[data]:<[kernel_name|all]:[compute_unit_name|all]: [interface_name|all]:[counters|all]> ``` Three fields, kernel\_name, compute\_unit\_name, and interface\_name can be specified to determine the interface which the performance monitor is applied to. However, you can also specify the keyword all to apply the monitoring to all existing kernels, compute units, and interfaces with a single option. The last option, <counters|all> allows you to restrict the information gathering to just counters for large designs, while all (default) will include the collection of actual trace information. **Note:** The profile\_kernel option is additive and can be used multiple times on the link line. Executing the program creates an sdaccel\_profile\_summary.csv file, if profile = true is specified in the sdaccel.ini file. ``` [Debug] profile = true ``` The .osv file needs to be manually converted to Google Protocol Buffer format (.xprf) before the profiling result can be viewed in the integrated Profile Summary Viewer. The following command line example generates an .xprf file from the .osv input file: ``` sdx_analyze profile sdaccel_profile_summary.csv ``` #### Display the Profile Summary Use the following methods to display the SDAccel environment Profile Summary view created from the command line. #### **Web Browser** Before the HTML profile summary can be displayed in a web browser the following command needs to be executed to create an HTML file representing the data. ``` sdx_analyze profile -i sdaccel_profile_summary.csv -f html ``` This creates an HTML file that can be opened by the web browser of your choice. The file contains the same profiling result as presented in GUI Flow. #### **Profile Summary View** Use the integrated Profile Summary view to display the profile summary generated by the command line flow. Follow these steps to open the profile summary in the Profile Summary view: 1. Convert the .csv data file into the protobuf format. ``` sdx_analyze profile -i sdaccel_profile_summary.csv -f protobuf ``` 2. Start SDAccel tool GUI by running the sdx command: \$sdx - 3. Choose the default workspace when prompted. - 4. Select File → Open File. - 5. Browse to and then open the .xprf file created by the sdx\_analyze command run in step 1. The following graphic shows the Profile Summary view that displays OpenCL API calls, kernel executions, data transfers, and profile rule checks (PRCs). #### **Profile Summary Window** #### **Data Interpretation** The profile summary includes a number of useful statistics for your OpenCL application. This can provide you with a general idea of the functional bottlenecks in your application. The profile summary consists of the following sections: Top Operations - Top Data Transfer: Kernels and Global Memory: This table displays the profile data for top data transfers between FPGA and device memory. - **Device**: Name of device - Compute Unit: Name of compute unit - Number of Transfers: Sum of write and read AXI transactions monitored on device - Average Bytes per Transfer: (Total Read Bytes + Total Write Bytes) / (Total Read AXI Transactions + Total Write AXI Transactions) - Transfer Efficiency (%): (Average Bytes per Transfer) / min(4K, (Memory Bit Width/8 \* 256)) AXI4 specification limits the max burst length to 256 and max burst size to 4K bytes. - Total Data Transfer (MB): (Total Read Bytes + Total Write Bytes) / 1.0e6 - Total Write (MB): (Total Write Bytes) / 1.0e6 - Total Read (MB): (Total Read Bytes) / 1.0e6 - Transfer Rate (MB/s): (Total Data Transfer) / (Compute Unit Total Time) - Top Kernel Execution - **Kernel Instance Address**: Host address of kernel instance (in hex) - **Kernel**: Name of kernel - Context ID: Context ID on host - Command Queue ID: Command queue ID on host - Device: Name of device where kernel was executed (format: <device>-<ID>) - Start Time (ms): Start time of execution (in ms) - **Duration (ms)**: Duration of execution (in ms) - Global Work Size: NDRange of kernel - Local Work Size: Work group size of kernel - Top Memory Writes: Host and Device Global Memory - Buffer Address: Host address of buffer (in hex) - Context ID: Context ID on host - Command Queue ID: Command queue ID on host - Start Time (ms): Start time of write transfer (in ms) - **Duration (ms)**: Duration of write transfer (in ms) - **Buffer Size (KB)**: Size of write transfer (in KB) - Writing Rate (MB/s): Writing Rate = (Buffer Size) / (Duration) - Top Memory Reads: Host and Device Global Memory - **Buffer Address**: Host address of buffer (in hex) - Context ID: Context ID on host - Command Queue ID: Command queue ID on host - Start Time (ms): Start time of read transfer (in ms) - **Duration (ms)**: Duration of read transfer (in ms) - Buffer Size (KB): Size of read transfer (in KB) - Reading Rate (MB/s): Reading Rate = (Buffer Size) / (Duration) - Kernels & Compute Units - **Kernel Execution (includes estimated device times)**: This table displays the profile data summary for all kernel functions scheduled and executed. - Kernel: Name of kernel - Number of Enqueues: Number of times kernel is enqueued - Total Time (ms) Sum of runtimes of all enqueues (measured from START to END in OpenCL execution model) - Minimum Time (ms) Minimum runtime of all enqueues - Average Time (ms) (Total Time) / (Number of Enqueues) - Maximum Time (ms) Maximum runtime of all enqueues - Compute Unit Utilization (includes estimated device times): This table displays the summary profile data for all compute units on the FPGA. - Device: Name of device (format: <device>-<ID>) - Compute Unit: Name of Compute Unit - **Kernel**: Kernel this Compute Unit is associated with - Global Work Size: NDRange of kernel (format is x:y:z) - Local Work Size: Local work group size (format is x:y:z) - Number of Calls: Number of times the Compute Unit is called - Total Time (ms): Sum of runtimes of all calls - Minimum Time (ms): Minimum runtime of all calls - Average Time (ms): (Total Time) / (Number of Work Groups) - Maximum Time (ms): Maximum runtime of all calls - Clock Frequency (MHz): Clock frequency used for a given accelerator (in MHz) - Data Transfers - Data Transfer: Host and Global Memory: This table displays the profile data for all read and write transfers between the host and device memory via PCI Express<sup>®</sup> link. - Context:Number of Devices: Context ID and number of devices in context - Transfer Type: READ or WRITE - **Number of Transfers**: Number of host data transfers **Note:** May contain printf transfers - Transfer Rate (MB/s) (Total Bytes Sent) / (Total Time in usec) where Total Time includes software overhead - Average Bandwidth Utilization (%): (Transfer Rate) / (Max. Transfer Rate) where Max. Transfer Rate = (256/8 bytes) \* (300 MHz) = 9.6 GBps - Average Size (KB): (Total KB sent) / (number of transfers) - Total Time (ms): Sum of transfer times - Average Time (ms): (Total Time) / (number of transfers) - Data Transfer: Kernels and Global Memory: This table displays the profile data for all read and write transfers between the FPGA and device memory. - **Device**: Name of device - Compute Unit/Port Name: <Name of Compute Unit>/<Name of Port> - Kernel Arguments: List of arguments connected to this port - DDR Bank: DDR bank number this port is connected to - Transfer Type: READ or WRITE - Number of Transfers: Number of AXI transactions monitored on device **Note:** Might contain printf transfers) - Transfer Rate (MB/s): (Total Bytes Sent) / (Compute Unit Total Time) - Compute Unit Total Time = Total execution time of compute unit - Total Bytes Sent = sum of bytes across all transactions - Average Bandwidth Utilization (%): (Transfer Rate) / (0.6 \*Max. Transfer Rate) where Max. Transfer Rate = (512/8 bytes) \* (300 MHz) = 19200 MBps - Average Size (KB): (Total KB sent) / (number of AXI transactions) - Average Latency (ns): (Total latency of all transaction) / (number of AXI transactions) - OpenCL API Calls: This table displays the profile data for all OpenCL host API function calls executed in the host application. - API Name: Name of API function (e.g., clCreateProgramWithBinary, clEnqueueNDRangeKernel) - Number of Calls: Number of calls to this API - Total Time (ms): Sum of runtimes of all calls - Minimum Time (ms): Minimum runtime of all calls - Average Time (ms): (Total Time) / (Number of Calls) - Maximum Time (ms): Maximum runtime of all calls # **Application Timeline** Application Timeline collects and displays host and device events on a common timeline to help you understand and visualize the overall health and performance of your systems. These events include: - OpenCL API calls from the host code. - Device trace data including AXI transaction start/stop, kernel start/stop, etc. While useful for debugging and profiling the application, timeline and device trace data are not collected by default because the runtime needs to periodically unload the trace data from the FPGA, which can add additional time to the overall application execution. However, the device data is collected with dedicated hardware inside the FPGA, so the data collection does not affect kernel functionality on the FPGA. The following sections describe setups required to enable time and device data collection. Turning on device profiling is intrusive and can negatively affect overall performance. This feature should be used for system performance debugging only. **Note:** Device profiling can be used in Emulation-HW without negative impact. #### **GUI Flow** Timeline and device trace data collection is part of run configuration for an SDAccel™ project created from the integrated SDAccel environment. Follow the steps below to enable it: 1. Instrumenting the code is required for System execution. This is done through the Hardware Function Settings dialog box. In the Assistant window, right-click the kernel under the System [Hardware] configuration, and select the Settings Command. With respect to application timeline functionality, you can enable Data Transfer, Execute Profile, and Stall Profiling. These options are instrumenting all ports of each instance of any kernel. As these options insert additional hardware, instrumenting all ports might be too much. Towards that end, more control is available through command line options as detailed in the Command Line section. These options are only valid for system runs. During hardware emulation, this data is generated by default. - Data Transfer: This option enables monitoring of data ports. - **Execute Profiling:** This option provides minimum port data collection during system run. This option records the execution times of the compute units. Execute profiling is enabled by default for data and stall profiling. - Stall Profiling: This option includes the stall monitoring logic in the bitstream. - 2. Specify what information is actually going to be reported during a run. **Note:** Only information actually exposed from the hardware during system execution is reported. To configure reporting, click the down arrow next to the Debug or Run button, and then select **Run Configurations** to open the Run Configurations window. 3. In the Run Configurations window, click the Profile tab. Ensure the Enable profiling check box is selected. This enables basic profiling support. With respect to trace data, ensure that Generate timeline trace report actually gathers the information in the build config you are running. Default implies that no trace data capturing is supported in system execution, but enabled by default in hardware emulation. Additionally, you can select the amount of information to gather during runtime. Select the granularity for trace data collection independently for Data Transfer Trace and Stall Trace. The Data Transfer Trace options are as follows: - **Coarse:** Show compute unit transfer activity from beginning of first transfer to end of last transfer (before compute unit transfer ends). - Fine: Show all AXI-level burst data transfers. - **None:** Turn off reading and reporting of device-level trace during runtime. The Stall Trace Options are as follows: - None: Turn off any stall trace information gathering. - All: Record all stall trace information. - External Memory Stall: Memory stalls to DDR (for example, AXI4 read from DDR). - Internal Dataflow Stall: Intra-kernel streams (for example, writing to a full FIFO between data flow blocks). - Inter CU Pipe Stall: Inter-kernel pipe (for example, writing to a full OpenCL™ pipe between kernels). If you have multiple run configurations for the same project, you must change the profile settings for each run configuration. 4. After running configurations, in the Assistant window, double-click **Application Timeline** to open the Application Timeline window. #### **Command Line** Complete the following steps to enable timeline and device trace data collection in the Command Line flow: 1. This step is responsible for the FPGA bitstream instrumentation with SDx Accel Monitors (SAM) and SDx Performance Monitors (SPMs). The instrumentation is performed through the --profile\_kernel, which has three distinct instrumentation options (data, stall, exec). **Note:** The --profile\_kernel option is ignored except for system compilation and linking. During hardware emulation, this data is generated by default. The --profile\_kernel option has three fields that are required to determine the specific kernel interface to which the monitors are applied. However, if resource usage is not an issue, the keyword all enables you to apply the monitoring to all existing kernels, compute units, and interfaces with a single option. Otherwise, you can specify the kernel\_name, compute\_unit\_name, and interface\_name explicitly to limit instrumentation. The last option, <counters |all> allows you to restrict the information gathering to just counters for large designs, while all (default) includes the collection of actual trace information. Note: The --profile\_kernel option is additive and can be used multiple times on the link line. • data: This option enables monitoring of data ports through SAM and SPM IPs. This option needs to be set only during linking. ``` -1 --profile_kernel <[data]:<[kernel_name|all]:[compute_unit_name|all]: [interface_name|all]:[counters|all]> ``` • stall: This option needs to be applied during compilation: ``` -c --profile_kernel <[stall]:<[kernel_name|all]:[compute_unit_name| all]:[counters|all]> ``` #### and during linking: ``` -1 --profile_kernel <[stall]:<[kernel_name|all]:[compute_unit_name|all]:[counters|all]> ``` This option includes the stall monitoring logic (using SAM IP) in the bitstream. However, it does require that stall ports are present on the kernel interface. To facilitate this, the option is required for compilation of the C/C++/OpenCL kernel modules. exec: This option provides minimum port data collection during system run. It simply records the execution times of the kernel through the use of SAM IP. This feature is by default enabled on any port that uses the data or stall data collection. This option needs to be provided only during linking. ``` -1 --profile_kernel <[exec]:<[kernel_name|all]:[compute_unit_name| all]>:[counters|all] ``` 2. After the kernels are instrumented, data gathering must be enabled during runtime execution. Do this through the use of the sdaccel.ini file that is in the same directory as the host executable. The following sdaccel.ini file will enable maximum information gathering during runtime: ``` [Debug] profile=true timeline_trace=true data_transfer_trace=coarse stall_trace=all ``` - profile=<true | false>: When this option is specified as true, basic profile monitoring is enabled. Without any additional options, this implies that the host runtime logging profile summary is enabled. However, without this option enabled, no monitoring is performed at all. - timeline\_trace=<true|false>: This option will enable timeline trace information gathering of the data. Without adding profile IP into the FPGA (data), it will only show host information. At a minimum, to get more compute unit start and end execution times in the timeline trace, the compute unit needs to be linked with --profile\_kernel exec. - data\_transfer\_trace=<coarse|fine|off>: This option enables device-level AXI data transfers trace: - coarse: Show compute unit transfer activity from beginning of first transfer to end of last transfer (before compute unit transfer ends). - fine: Show all AXI-level burst data transfers. - off: Turn off reading and reporting of device-level trace during runtime. - stall\_trace=<dataflow|memory|pipe|all|off>: Specify what types of stalls to capture and report in timeline trace. The default is off. - off: Turn off any stall trace information gathering. **Note:** Enabling stall tracing can often fill the trace buffer, which results in incomplete and potentially corrupt timeline traces. This can be avoided by setting trace\_stall=off. - all: Record all stall trace information. - dataflow: Intra-kernel streams (for example, writing to full FIFO between dataflow blocks). - memory: External memory stalls (for example, AXI4 read from the DDR. - pipe: Inter-kernel pipe (for example, writing to full OpenCL pipe between kernels). - 3. In command line mode, CSV files are generated to capture the trace data. These CSV reports need to be converted to the Application Timeline format using the sdx\_analyze utility before they can be opened and displayed in the SDAccel environment GUI. $\verb|sdx_analyze| trace | \verb|sdaccel_timeline_trace.csv|$ This creates the sdaccel\_timeline\_trace.wdb file by default, which can be opened from the GUI. - 4. To view the timeline report host and device waveforms, do the following: - a. Start the SDx environment by running the command: \$sdx - b. Choose a workspace when prompted. - c. Select **File** $\rightarrow$ **Open File**, browse to the .wdb file generated during hardware emulation or system run, and open it. #### **Data Interpretation** The following figure shows the Application Timeline window that displays host and device events on a common timeline. This information helps you to understand details of application execution and identify potential areas for improvements. Figure 11: Application Timeline Window Application timeline trace has two main sections, Host and Device. The host section shows the trace of all the activity originating from the host side. The device section shows the activity of the compute-units on the FPGA. Under the host different activities are categorized as OpenCL™ API calls, Data Transfer, and the Kernels. The complete tree has the following structure: #### Host - OpenCL API Calls: All OpenCL API calls are traced here. The activity time is measured from the host perspective. - **General**: All general OpenCL API calls such as clCreateProgramWithBinary(), clCreateContext(), clCreateCommandQueue etc are traced here - Queue: OpenCL API calls that are associated with a specific command queue are traced here. This includes commands such as clEnqueueMigrateMemObjects, clEnqueueNDRangeKernel etc. If the user application creates multiple command queues, then this section show as many queues and activities under it. - Data Transfer: In this section the DMA transfers from the host to the device memory are traced. There are multiple DMA threads implemented in the OpenCL runtime and there is typically an equal number of DMA channels. The DMA transfer is initiated by the user application by calling OpenCL APIs such as clEnqueueMigrateMemObjects. These DMA requests are forwarded to the runtime which delegates to one of the threads. The data transfer from the host to the device appear under Write, and the transfers from device to host appear under Read. - Kernel Enqueues: The active kernel executions are shown here. The kernels here should not be confused with your kernels/compute-unit on the device. By kernels here we mean the NDRangeKernels and the Tasks created by APIs clEnqueueNDRangeKernels() and clEnqueueTask() and these are plotted against the time measured from the host's perspective. Multiple kernels can be scheduled to be executed at the same time and they are traced from the point they are scheduled to run until the end of kernel execution. This is the reason for multiple entries. The number of rows depend on the number of overlapping kernel executions. **Note:** Overlapping of the kernels should not be mistaken for actual real parallel execution on the device as the process might not be ready to actually execute right away. #### • Device "name" - Binary Container "name" - **Accelerator "name"**: This is the name of the compute unit (aka. Accelerator) on the FPGA. User Functions: In the case of the Vivado High-Level Synthesis (HLS) tool kernels, functions that are implemented as data flow processes are traced here. The trace for these functions show the number of active instances of these functions that are currently executing in parallel. These names are generated in hw emulation when waveform is enabled. Note: Function level activity is only possible in Hardware Emulation. - Function: "name a" - Function: "name b" - **Read**: A compute unit reads from the DDR over AXI-MM ports. The trace of data a read by a compute unit is shown here. The activity is shown as transaction and the tool-tip for each transaction shows more details of the AXI transaction. These names are generated when --profile\_kernel data is used. - m\_axi\_<bundle name>(port) - **Write**: A compute unit writes to the DDR over AXI-MM ports. The trace of data written by a compute unit is shown here. The activity is shown as transactions and the tool-tip for each transaction shows more details of the AXI transaction. This is generated when --profile\_kernel data is used. - m\_axi\_<bundle name>(port) ### **Waveform Viewer** The SDx development environment can generate a waveform view and launch a live waveform viewer when running hardware emulation. It displays in-depth details on the emulation results at system level, compute unit level, and at function level. The details include data transfers between the kernel and global memory, data flow via inter-kernel pipes as well as data flow via intrakernel pipes. They provide many insights into the performance bottleneck from the system level down to individual function call to help developers optimize their applications. By default, the waveform and live waveform viewers are not enabled. This is because the viewers require that the runtime generates a simulation waveform during hardware emulation, which consumes more time and disk space. The following sections describe the setup required to enable data collection. **Note:** The waveform view allows you to look directly at the device transactions from within the SDx development environment. In contrast, the live waveform capability actually spawns the simulation waveform viewer that visualizes the hardware transactions in addition to potentially user selected internal signals. ### **GUI Flow** Follow the steps below to enable waveform data collection and to open the viewer: 1. Open the Application Project Settings window, and select the Kernel debug check box. 2. Click the down arrow next to the Run button, and select **Run Configurations** to open the Run Configurations window. 3. On the Run Configurations window, click the Main tab, and select the **Use waveform for kernel debugging** check box. Optionally, you can select **Launch live waveform** to bring up the Simulation window to view the Live Waveform while the hardware emulation is running. 4. In the Run Configurations window, click the **Profile** tab, and ensure the **Enable profiling** check box is selected. This enables basic profiling support. If you have multiple run configurations for the same project, you must change the profile settings for each run configuration. 5. If you have not selected the Live Waveform viewer to be launched automatically, open the Waveform view from the SDx Development Environment. In the SDx Development Environment, double-click **Waveform** in the Assistant window to open the Waveform view window. ### **Command Line** Follow these instructions to enable waveform data collection from the Command Line during hardware emulation and open the viewer: 1. Turn on debug code generation during kernel compilation. ``` xocc -g -t hw_emu ... ``` 2. Create an sdaccel.ini file in the same directory as the host executable with the contents below: ``` [Debug] profile=true timeline_trace=true ``` This enables maximum observability. The options in detail are: - **profile=<true|false>:** Setting this option to true, enables profile monitoring. Without any additional options, this implies that the host runtime logging profile summary is enabled. However, without this option enabled, no monitoring is performed at all. - **timeline\_trace=<true|false>:** This option enables timeline trace information gathering of the data. - 4. To see the live waveform and additional simulation waveforms, add the following to the emulation section in the sdaccel.ini: ``` [Emulation] launch_waveform=gui ``` A Live Waveform viewer is spawned during the execution of the hardware emulation, which allows you to examine the waveforms in detail. - 5. If no Live Waveform viewer was requested, follow the steps below to open the Waveform view: - a. Start the SDx IDE by running the following command: \$sdx. - b. Choose a workspace when prompted. - c. Select File $\rightarrow$ Open File, browse to the .wdb file generated during hardware emulation. ### **Data Interpretation Waveform View** The following image shows the Waveform view: ♦ Q Q ½ ♥ K N ₾ ± + N % % % = ■ Waveform ☎ Hardware Emulation (Simulation Tim Report name: Waveform Project name: example Build configuration: Emulation-HW Created: 01 Jun 2018 09:50 Launch configuration: binary\_container\_1 | 0.02 ms | 0.04 ms | 0.06 ms | 0.08 ms | 0.10 ms | 0.12 ms → Binary Container "binary\_container\_1" ▼ Memory Data Transfers ♥ DDR Bank 0 P Read Address P Read Data ▶ Write Address Mrite Data DDR Bank 1 DDR Bank 2 DDR Bank 3 ∇ Kernel "run" 1:1:1 ▽ Compute Unit: run\_1 D CU Stalls (%) ▽ Data Transfers ¬ m\_axi\_gmem Read Address P Read Data ▶ Write Address Mrite Data ▷ m\_axi\_gmem1 ∇ User Functions ▼ Function: dat...w in loop U( ▶ Function Stalls Intra-Kernel Dataflow ▶ Function I/O Function: dat.../processB\_U( Function: dat...rocessA16\_U Figure 12: Waveform View The waveform view is organized hierarchically for easy navigation. **Note:** This viewer is based on the actual waveforms generated during hardware emulation (Kernel Trace). This allows this viewer to descend all the way down to the individual signals responsible for the abstracted data. However, as it is post processing the data, no additional signals can be added, and some of the runtime analysis such as DATAFLOW transactions cannot be visualized. The hierarchy tree and descriptions are: - Device "name": Target device name - Binary Container "name": Binary container name. - **Memory Data Transfers:** For each DDR Bank, this shows the trace of all the read and write request transactions arriving at the bank from the host. - **Kernel "name" 1:1:1:** For each kernel and for each compute unit of that kernel, this section breaks down the activities originating from the compute unit. - Compute Unit: "name": Compute unit name. - CU Stalls (%): Stall signals are provided by the HLS tool to inform you when a portion of their circuit is stalling because of external memory accesses, internal streams (i.e., dataflow), or external streams (i.e., OpenCL pipes). The stall bus, shown in detailed kernel trace, compiles all of the lowest level stall signals and reports the percentage that are stalling at any point in time. This provides a factor of how much of the kernel is stalling at any point in the simulation. For example: If there are 100 lowest level stall signals, and 10 are active on a given clock cycle, then the CU Stall percentage is 10%. If one goes inactive, then it would be 9%. - Data Transfers: This shows the read/write data transfer accesses originating from each Master AXI port of the compute unit to the DDR. - **User Functions:** This information is available for the HLS tool kernels and shows the user functions. Function: <name> - Function Stalls: Shows the different type stalls experienced by the process. It contains External Memory and Internal-Kernel Pipe stalls. The number of rows is dynamically incremented to accommodate the visualization of any concurrent execution. - Intra-Kernel Dataflow: FIFO activity internal to the kernel. - Function I/O: Actual interface signals. ### **Data Interpretation Live Waveform** The following figure shows the live waveform viewer while running hardware emulation. The live waveform viewer is organized hierarchically for easy navigation. Below are the hierarchy tree and descriptions. **Note:** As the live waveform viewer is presented only as part of the actual hardware simulation run (xsim), you can annotate extra signals and internals of the register transfer (RTL) to the same view. Also, all grouped and combined groups can be expanded all the way to the actual contributing signals. - **Device "name"**: Target device name. - Binary Container "name": Binary container name. - **Memory Data Transfers**: For each DDR Bank this shows the trace of all the read and write request transactions arriving at the bank from the host. - **Kernel "name" 1:1:1**: For each kernel and for each compute unit of that kernel this section breaks down the activities originating from the compute unit. - Compute Unit: "name": Compute unit name. - **CU Stalls (%)**: Stall signals are provided by the Vivado High-Level Synthesis (HLS) tool to inform you when a portion of the circuit is stalling because of external memory accesses, internal streams (i.e., dataflow), or external streams (i.e., OpenCL™ pipes). The stall bus shown in detailed kernel trace compiles all of the lowest level stall signals and reports the percentage that are stalling at any point in time. This provides a factor of how much of the kernel is stalling at any point in the simulation. - For example: If there are 100 lowest level stall signals, and 10 are active on a given clock cycle, then the CU Stall percentage is 10%. If one goes inactive, then it would be 9%. - **Data Transfers**: This shows the read/write data transfer accesses originating from each Master AXI port of the compute unit to the DDR. - User Functions: This information is available for the HLS kernels and shows the user functions. - Function: "name" - Dataflow/Pipeline Activity This shows the number of parallel executions of the function if the function is implemented as a dataflow process - **Active Iterations**: This shows the currently active iterations of the dataflow. The number of rows is dynamically incremented to accommodate the visualization of any concurrent execution. - **StallNoContinue**: This is a stall signal that tells if there were any output stalls experienced by the dataflow processes (function is done, but it has not received a continue from the adjacent dataflow process). - **RTL Signals**: These are the underlying RTL control signals that were used to interpret the above transaction view of the dataflow process. - **Function Stalls**: Shows the different types of stalls experienced by the process. - **External Memory**: Stalls experienced while accessing the DDR memory. - **Internal-Kernel Pipe**: If the compute units communicated between each other through pipes, then this will show the related stalls. - Intra-Kernel Dataflow: FIFO activity internal to the kernel. - **Function I/O**: Actual interface signals. - Function: "name" - Function: "name" ### **Guidance** The Guidance view is designed to provide feedback to users throughout the development process. It presents in a single location all issues encountered from building the actual design all the way through runtime analysis. It is crucial to understand that the Guidance view is intended to help you to identify potential issues in the design. These issues might be source code related or due to missed tool optimizations. Also, the rules are generic rules based on experiences on a vast set of reference designs. Nevertheless, these rules might not be applicable for a specific design. Therefore, it is up to you to understand the specific guidance rules, and take appropriate action based on your specific algorithm and requirements. ### **GUI Flow** The Guidance view is automatically populated and displayed in the lower central tab view. After running hardware emulation, the Guidance view might look like the following: Figure 13: Guidance View **Note:** You can produce the Guidance view through the Vivado High-Level Synthesis (HLS) tool post compilation as well, but you will not get Profile Rule Checks. To simplify visualizing the guidance information, the GUI flow allows you to search, and filter the Guidance view to locate specific guidance rule entries. It is also possible to collapse or expand the tree view or even suppress the hierarchical tree representation and visualize a condensed representation of the guidance rules. Finally, it is possible to select what is shown in the Guidance view. You can enable or disable the visualization of warnings, as well as met rules, and restrict the specific content based on the source of the messages such as build and emulation. By default, the Guidance view shows all guidance information for the project selected in the drop down. To restrict the content to an individual build or run step, do the following: - 1. Use the command Window → Preferences - 2. Select the category Xilinx Sdx → Guidance. - 3. Deselect Group guidance rule checks by project. ### **Command Line** The Guidance data is best analyzed through the GUI, which consolidates all guidance information for the flow. Nevertheless, the tool automatically generates HTML files containing the guidance information. As guidance information is generated throughout the tool flow, several guidance files are generated. The simplest way to locate the guidance reports is to search for the guidance.html files. ``` find . -name "*guidance.html" -print ``` This command lists all guidance files generated, which can be opened with any web-browser. ### **Data Interpretation** The Guidance view places each entry in a separate row. Each row might contain the name of the guidance rule, threshold value, actual value, and a brief but specific description of the rule. The last field provides a link to reference material intended to assist in understanding and resolving any of the rule violations. In the GUI Guidance view, guidance rules are grouped by categories and unique IDs in the Name column and annotated with symbols representing the severity. These are listed individually in the HTML report. In addition, as the HTML report does not show tooltips, a full Name column is included in the HTML report as well. The following list describes all fields and their purpose as included in the HTML guidance reports. • **Id:** Each guidance rule is assigned a unique id. Use this id to uniquely identify a specific message from the guidance report. - Name: The Name column displays a mnemonic name uniquely identifying the guidance rule. These names are designed to assist in memorizing specific guidance rules in the view. - **Severity:** The Severity column allows the easy identification of the importance of a guidance rule. - **Full Name:** The Full Name provides a less cryptic name compared to the mnemonic name in the Name column. - Categories: Most messages are grouped within different categories. This allows the GUI to display groups of messages within logical categories under common tree nodes in the Guidance view. - Threshold: The Threshold column displays an expected threshold value, which determines whether or not a rule is met. The threshold values are determined from many applications that follow good design and coding practices. - **Actual:** The Actual column displays the values actually encountered on the specific design. This value is compared against the expected value to see if the rule is met. - Details: The Details column provides a brief but specific message describing the specifics of the current rule. - **Resolution:** The Resolution column provides a pointer to common ways the model source code or tool transformations can be modified to meet the current rule. Clicking the link brings up a pop-up window or the documentation with tips and code snippets that you can apply to the specific issue. ## **Using Implementation Tools** ## **Exploring Kernel Optimizations Using Vivado HLS** All kernel optimizations using OpenCL or C/C++ can be performed from within the SDAccel environment. The primary performance optimizations, such as those discussed in this chapter (pipelining function and loops, applying dataflow to enable greater concurrency between functions and loops, unrolling loops, etc.), are performed by the Xilinx® FPGA design tool, Vivado® High-Level Synthesis (HLS) tool. The SDAccel environment automatically calls the HLS tool. However, to use the GUI analysis capabilities, you must launch the HLS tool directly from within the SDAccel environment. Using the HLS tool in standalone mode enables the following enhancements to the optimization methodology: Focusing solely on the kernel optimization, there is no requirement to execute emulation. - The ability to create multiple solutions, compare their results, and explore the solution space to find the most optimum design. - The ability to use the interactive Analysis Perspective to analyze the design performance. **IMPORTANT!** Only the kernel source code is incorporated back into the SDAccel environment. After exploring the optimization space, ensure that all optimizations are applied to the kernel source code as OpenCL attributes or C/C++ pragmas. To open the HLS tool in standalone mode, from the Assistant window, right-click the hardware function object, and select **Open HLS Project**, as shown in the following figure. Figure 14: Open HLS Project # Controlling FPGA Implementation with the Vivado Design Suite SDx development environment provides a smooth flow from an OpenCL/C/C++ model all the way to an FPGA accelerated implementation. In most cases, this flow completely abstracts away the underlying fact that the programmable region in the FPGA is configured to implement the kernel functionality. This fully isolates the developer from typical hardware constraints such as routing delays and kernel placement. However, in some cases these concerns will have to be looked at especially when large designs are to be implemented. Towards this end, SDx development environment allows you to fully control the Vivado Design Suite backend tool. The SDAccel environment calls the Vivado Design Suite to automatically run RTL synthesis and implementation. You also have the option of launching the design suite directly from within the SDAccel environment. When invoking the Vivado Integrated Design Environment (IDE) in standalone mode in the SDAccel environment, you can open the Vivado synthesis project or the Vivado implementation project to edit, manage, and control the project. The Vivado project can be opened in the SDAccel environment after the build targeting the System configuration has completed. To open Vivado IDE in standalone mode, from the Xilinx drop-down menu, select **Vivado Integration** and **Open Vivado Project**. Choose between the Vivado synthesis and implementation projects, and click **OK**. Using the Vivado IDE in standalone mode enables the exploration of various synthesis and implementation options for further optimizing the kernel for performance and area. Familiarity with the design suite is recommended to make the most use of these features. **IMPORTANT!** The optimization switches applied in the standalone project are not automatically incorporated back into the SDAccel environment. After exploring the optimization space, ensure that all optimization parameters are passed to the SDAccel environment using the --xp option for xocc. For example: $--xp \ "vivado\_prop:run.imp1\_1. \{STEPS.PLACE\_DESIGN.ARGS.TCL.POST\} = \{<File \ and \ path>\} "$ This optimization flow is supported in the command line flow by calling xocc - interactive to bring up the Vivado IDE, on the current project. In the IDE, generate a DCP, which can be saved and reused during linking with xocc. The specific options are: - --interactive allows the Vivado IDE to be launched from within the xocc environment, with the right project loaded. - --reuse\_synth allows a pre-synthesized Vivado Design Suite tool design checkpoint (.dep) file to be brought in and used directly in SDx environment flow to complete implementation and xclbin generation. - --reuse\_impl allows a pre-implemented and timing closed Vivado tool design checkpoint (.dcp) file to be brought in and used directly in SDx environment flow for xclbin generation. # **Kernel Optimization** One of the key advantages of an FPGA is its flexibility and capacity to create customized designs specifically for your algorithm. This enables various implementation choices to trade off algorithm throughput vs. power consumption. The downside of creating custom logic is that the design needs to go through the traditional FPGA design flow. The following guidelines help manage the design complexity and achieve the design goals. # Interface Attributes (Detailed Kernel Trace) The detailed kernel trace provides easy access to the AXI transactions and their properties. The AXI transactions are presented for the global memory, as well as the Kernel side (Kernel "pass" 1:1:1) of the AXI interconnect. The following figure illustrates a typical kernel trace of a newly accelerated algorithm. Figure 15: Accelerated Algorithm Kernel Trace Most interesting with respect to performance are the fields: - Burst Length: Describes how many packages are sent within one transaction - Burst Size: Describes the number of bytes being transferred as part of one package Given a burst length of 1 and just 4 Bytes per package, it will require many individual AXI transactions to transfer any reasonable amount of data. **Note:** The SDSoC<sup>™</sup> environment never creates burst sizes less than 4 bytes, even if smaller data is transmitted. In this case, if consecutive items are accessed without AXI bursts enabled, it is possible to observe multiple AXI reads to the same address. Small burst lengths, as well as burst sizes, considerably less than 512-bits are therefore good opportunities to optimize interface performance. The following sections show improved implementations: - Using Burst Data Transfers - Using Full AXI Data Width ### **Using Burst Data Transfers** Transferring data in bursts hides the memory access latency and improves bandwidth usage and efficiency of the memory controller. **RECOMMENDED:** Infer burst transfers from successive requests of data from consecutive address locations. Refer to "Inferring Burst Transfer from/to Global Memory" in SDAccel Environment Programmers Guide (UG1277) for more details. If burst data transfers occur, the detailed kernel trace will reflect the higher burst rate as a larger burst length number: Figure 16: Burst Data Transfer with Detailed Kernel Trace In the previous figure, it is also possible to observe that the memory data transfers following the AXI interconnect are actually implemented rather differently (shorter transaction time). Hover over these transactions, you would see that the AXI interconnect has packed the 16x4 Byte transaction into a single package transaction of 1x64 Bytes. This effectively uses the AXI4 bandwidth which is even more favorable. The next section focuses on this optimization technique in more detail. Burst inference is heavily dependent on coding style and access pattern. To avoid potential modeling pitfalls, refer to the *SDAccel Environment Programmers Guide* (UG1277). However, you can ease burst detection and improve performance by isolating data transfer and computation, as shown in the following code snippet: ``` void kernel(T in[1024], T out[1024]) { T tmpIn[1024]; T tmpOu[1024]; read(in, tmpIn); process(tmpIn, tmpOut); write(tmpOut, out); } ``` In short, the function read is responsible for reading from the AXI input to an internal variable (tmpIn). The computation is implemented by the function process working on the internal variables tmpIn and tmpOut. The function write takes the produced output and writes to the AXI output. The isolation of the read and write function from the computation results in: - Simple control structures (loops) in the read/write function which makes burst detection simpler. - The isolation of the computational function away from the AXI interfaces, simplifies potential kernel optimization. See the Kernel Optimization chapter for more information. - The internal variables are mapped to on-chip memory, which allow faster access compared to AXI transactions. Acceleration platforms supported in SDAccel environment can have as much as 10 MB on-chip memories that can be used as pipes, local memories, and private memories. Using these resources effectively can greatly improve the efficiency and performance of your applications. ### **Using Full AXI Data Width** The user data width between the kernel and the memory controller can be configured by the SDAccel compiler based on the data types of the kernel arguments. To maximize the data throughput, Xilinx recommends that you choose data types map to the full data width on the memory controller. The memory controller in all supported acceleration cards supports 512-bit user interface, which can be mapped to OpenCL<sup>TM</sup> vector data types, such as int16 or C/C++ arbitrary precision data type $ap\_int<512>$ . As shown on the following figure, you can observe burst AXI transactions (Burst Length 16) and a 512 bit package size (Burst Size 64 Bytes). Figure 17: Burst AXI Transactions This example shows good interface configuration as it maximizes AXI data width as well as it shows actual burst transactions. Complex structs or classes, used to declare interfaces, can lead to very complex hardware interfaces due to memory layout and data packing differences. This can introduce potential issues that are very difficult to debug in a complex system. **RECOMMENDED:** Use simple structs for kernel arguments that can be packed to 32-bit boundary. Refer to the Custom Data Type Example in kernel\_to\_gmem category at Xilinx On-boarding Example GitHub for the recommended way to use structs. ### **OpenCL Attributes** The OpenCL API provides attributes to support a more automatic approach to incrementing AXI data width usage. The change of the interface data types, as stated above is supported in the API as well but will require the same code changes as C/C++ to the algorithm to accommodate the larger input vector. To eliminate manual code modifications, the following OpenCL attributes are interpreted to perform data path widening and vectorization of the algorithm. A detailed description can be found in the SDx Pragma Reference Guide (UG1253). - vec\_type\_hint - reqd\_work\_group\_size - xcl\_zero\_global\_work\_offset Examine the combined functionality on the following case: ``` __attribute__((reqd_work_group_size(64, 1, 1))) __attribute__((vec_type_hint(int))) __attribute__((xcl_zero_global_work_offset)) __kernel void vector_add(__global int* c, __global const int* a, __global const int* b) { size_t idx = get_global_id(0); c[idx] = a[idx] + b[idx]; } ``` In this case, the hard coded interface is a 32-bit wide data path (int \*c, int\* a, int \*b), which drastically limits the memory throughput if implemented directly. However, the automatic widening and transformation is applied, based on the values of the three attributes. - \_\_attribute\_\_((vec\_type\_hint(int))): Declares that int is the main type used for computation and memory transfer (32 bit). This knowledge is used to calculate the vectorization/widening factor based on the target bandwidth of the AXI interface (512 bits). In this example the factor would be 16 = 512 bits / 32 bit. This implies that in theory, 16 values could be processed if vectorization can be applied. - \_\_attribute\_\_((reqd\_work\_group\_size(X, Y, Z))): Defines the total number of work items (where X, Y, and Z are positive constants). X\*Y\*Z is the maximum number of work items therefore defining the maximum possible vectorization factor which would saturate the memory bandwidth. In this example, the total number of work items is 64\*1\*1=64. The actual vectorization factor to be applied will be the greatest common divider of the vectorization factor defined by the actual coded type or the vec\_type\_hint, and the maximum possible vectorization factor defined through reqd\_work\_group\_size. The quotient of maximum possible vectorization factor divided by the actual vectorization factor provides the remaining loop count of the OpenCL description. As this loop is pipelined, it can be advantageous to have several remaining loop iterations to take advantage of a pipelined implementation. This is especially true if the vectorized OpenCL code has long latency. There is one optional parameter that is highly recommended to be specified for performance optimization on OpenCL interfaces. • The \_\_attribute\_\_((xcl\_zero\_global\_work\_offset)) instructs the compiler that no global offset parameter is used at runtime, and all accesses are aligned. This gives the compiler valuable information with regard to alignment of the work groups, which in turn usually propagate to the alignment of the memory accesses (less hardware). It should be noted, that the application of these transformations changes the actual design to be synthesized. Partially unrolled loops require reshaping of local arrays in which data is stored. This usually behaves nicely, but can interact poorly in rare situations. ### For example: - For partitioned arrays, when the partition factor is not divisible by the unrolling/vectorization factor. - The resulting access requires a lot of multiplexers and will create a difficult problem for the scheduler (might severely increase memory usage and compilation time), Xilinx recommends that you use partitioning factors that are powers of two (as the vectorization factor is always a power of two). - If the loop being vectorized has an unrelated resource constraint, the scheduler complains about II not being met. - This is not necessarily correlated with a loss of performance (usually it is still performing better) since the II is computed on the unrolled loop (which has therefore a multiplied throughput for each iteration). - The scheduler informs you of the possible resources constraints and resolving those will further improve the performance. - Note that a common occurrence is that a local array does not get automatically reshaped (usually because it is accessed in a later section of the code in non-vectorizable way). # Reducing Kernel to Kernel Communication Latency with OpenCL Pipes The OpenCL API 2.0 specification introduces a new memory object called a pipe. A pipe stores data organized as a FIFO. Pipe objects can only be accessed using built-in functions that read from and write to a pipe. Pipe objects are not accessible from the host. Pipes can be used to stream data from one kernel to another inside the FPGA without having to use the external memory, which greatly improves the overall system latency. In the SDAccel development environment, pipes must be statically defined outside of all kernel functions. Dynamic pipe allocation using the <code>OpenCL 2.x clCreatePipe</code> API is not currently supported. The depth of a pipe must be specified by using the <code>xcl\_reqd\_pipe\_depth</code> attribute in the pipe declaration. The valid depth values are as follows: - 16 - 32 - 64 - 128 - 256 - 512 - 1024 - 2048 - 4096 - 8192 - 16384 - 32768 A given pipe can have one and only one producer and consumer in different kernels. ``` pipe int p0 __attribute__((xcl_reqd_pipe_depth(32))); ``` Pipes can be accessed using standard OpenCL <code>read\_pipe()</code> and <code>write\_pipe()</code> built-in functions in non-blocking mode or using the Xilinx extended <code>read\_pipe\_block()</code> and <code>write\_pipe\_block()</code> functions in blocking mode. The status of pipes can be queried using OpenCL <code>get\_pipe\_num\_packets()</code> and <code>get\_pipe\_max\_packets()</code> built-in functions. See the OpenCL C Specification, Version 2.0 from Khronos Group for more details on these built-in functions. The following function signatures are the currently supported pipe functions, where gentype indicates the built-in OpenCL C scalar integer or floating-point data types. ``` int read_pipe_block (pipe gentype p, gentype *ptr) int write_pipe_block (pipe gentype p, const gentype *ptr) ``` The following "Blocking Pipes Example" from SDAccel Getting Started Examples on GitHub uses pipes to pass data from one processing stage to the next using blocking read\_pipe\_block() and write\_pipe\_block() functions: ``` pipe int p0 __attribute__((xcl_reqd_pipe_depth(32))); pipe int p1 __attribute__((xcl_reqd_pipe_depth(32))); // Input Stage Kernel : Read Data from Global Memory and write into Pipe P0 kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void input_stage(__global int *input, int size) { ``` ``` __attribute__((xcl_pipeline_loop)) mem_rd: for (int i = 0 ; i < size ; i++) //blocking Write command to pipe P0 write_pipe_block(p0, &input[i]); } } // Adder Stage Kernel: Read Input data from Pipe PO and write the result // into Pipe P1 kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void adder_stage(int inc, int size) __attribute__((xcl_pipeline_loop)) execute: for(int i = 0 ; i < size ; int input_data, output_data; //blocking read command to Pipe P0 read_pipe_block(p0, &input_data); output_data = input_data + inc; //blocking write command to Pipe P1 write_pipe_block(p1, &output_data); } } // Output Stage Kernel: Read result from Pipe P1 and write the result to Global // Memory kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void output_stage(__global int *output, int size) __attribute__((xcl_pipeline_loop)) mem_wr: for (int i = 0 ; i < size ; i++) //blocking read command to Pipe P1 read_pipe_block(p1, &output[i]); } } ``` The Device Traceline view shows the detailed activities and stalls on the OpenCL pipes after hardware emulation is run. This info can be used to choose the correct FIFO sizes to achieve the optimal application area and performance. Figure 18: Device Traceline View ## **Optimizing Computational Parallelism** By default, C/C++ does not model computational parallelism, as it always executes any algorithm sequentially. On the other hand, the OpenCL API does model computational parallelism with respect to work groups, but it does not use any additional parallelism within the algorithm description. However, fully configurable computational engines like FPGAs allow more freedom to exploit computational parallelism. ## **Coding Data Parallelism** To leverage computational parallelism during the implementation of an algorithm on the FPGA, it should be mentioned that the synthesis tool will need to be able to recognize computational parallelism from the source code first. Loops and functions are prime candidates for reflecting computational parallelism and compute units in the source description. However, even in this case, it is key to verify that the implementation takes advantage of the computational parallelism as in some cases the SDx tool might not be able to apply the desired transformation due to the structure of the source code. It is quite common, that some computational parallelism might not be reflected in the source code to begin with. In this case, it will need to be added. A typical example is a kernel that might be described to operate on a single input value, while the FPGA implementation might execute computations more efficiently in parallel on multiple values. This kind of parallel modeling is described in the Using Full AXI Data Width section. A 512-bit interface can be created using OpenCL vector data types such as int16 or C/C++ arbitrary precision data type ap\_int<512>. **Note:** These vector types can also be used as a powerful way to model data parallelism within a kernel, with up to 16 data paths operating in parallel in case of int16. Refer to the *Median Filter Example* in the vision category at SDAccel Getting Started Examples on GitHub for the recommended way to use vectors. ## **Loop Parallelism** Loops are the basic C/C++/OpenCL<sup>™</sup> API method of representing repetitive algorithmic code. The following example illustrates various implementation aspects of a loop structure: ``` for(int i = 0; i<255; i++) { out[i] = in[i]+in[i+1]; } out[255] = in[255];</pre> ``` This code iterates over an array of values and adds consecutive values, except the last value. If this loop is implemented as written, each loop iteration requires two cycles for implementation, which results in a total of 510 cycles for implementation. This can be analyzed in detail through the Schedule Viewer in the HLS Project: Figure 19: Implemented Loop Structure in Schedule Viewer This can also be analyzed in terms of total numbers and latency through the Vivado synthesis results: Performance Estimates □ Timing (ns) Summary Clock Target Estimated Uncertainty 0.62 ap clk 5.00 3.123 □ Latency (clock cycles) Summary Latency Interval min max min max Type 511 511 511 511 none Detail **∃** Instance Loop **Utilization Estimates** □ Summary BRAM 18K DSP48E FF LUT Name DSP Expression 47 FIFO Instance 1362 Memory Multiplexer 1145 Register 52 Total 0 52 2554 Available 5520 1326720 663360 4320 Available SLR 2160 2760 663360 331680 Utilization (%) 0 0 ~0 0 Utilization SLR (%) 0 ~0 ~0 Figure 20: Synthesis Results Performance Estimates The key numbers here are the latency numbers and total LUT usage. For example, depending on the configuration, you could get latency of 511 and total LUT usage of 47. As you will see, these values can widely vary based on the implementation choices. While this implementation will require very little area, it results in significant latency. ### **Unrolling Loops** Unrolling a loop enables the full parallelism of the model to be exploited. To do this, you can simply mark a loop to be unrolled and the tool will create the implementation with the most parallelism possible. To mark a loop to unroll, an OpenCL loop can be marked with the UNROLL attribute: \_\_attribute\_\_((opencl\_unroll\_hint)) or a C/C++ loop can utilize the unroll pragma: #pragma HLS UNROLL When applied to this specific example, the Schedule Viewer in the HLS Project will be: Synthesis(solution) ☐ Schedule Viewer(solution) ☐ Current Module : run Operation\Control Step 1 in 0 V read(read) in\_1\_V\_read(read) tmp\_3(+) node\_520(write) in\_2\_V\_read(read) tmp\_3\_1(+) node\_524(write) in\_3\_V\_read(read) tmp\_3\_2(+) node\_528(write) in\_4\_V\_read(read) tmp\_3\_3(+) node\_532(write) in\_5\_V\_read(read) $tmp_3_4(+)$ node\_536(write) in\_6\_V\_read(read) $tmp_3_5(+)$ node\_540(write) Figure 21: Schedule Viewer With an estimated performance of: in\_7\_V\_read(read) tmp\_3\_6(+) node\_544(write) in\_8\_V\_read(read) tmp\_3\_7(+) node\_548(write) Figure 22: Performance Estimates As you can see, the total latency was considerably improved to now be 127 cycles and as expected the computational hardware was increased to 4845 LUTs, to perform the same computation in parallel. However, if you analyze the for-loop, you might ask why this algorithm cannot be implemented in a single cycle, as each addition is completely independent of the previous loop iteration. The reason is the memory interface to be utilized for the variable out. SDx™ environment uses dual port memory by default for an array. However, this implies that at most two values can be written to the memory per cycle. Thus to see a fully parallel implementation, you must specify that the variable out should be kept in registers as in this example: #pragma HLS array\_partition variable= out complete dim= 0 For more information see the *pragma HLS array\_partition* section in *SDx Pragma Reference Guide* (UG1253). The results of this transformation can be observed in the following Schedule Viewer: ☐ Schedule Viewer(solution) ☐ Synthesis(solution) Current Module : run Operation\Control Step in\_0\_V\_read(read) in\_1\_V\_read(read) tmp\_3(+) node\_1029(write) in\_2\_V\_read(read) tmp\_3\_1(+) node\_1032(write) in\_3\_V\_read(read) tmp\_3\_2(+) node\_1035(write) in\_4\_V\_read(read) tmp\_3\_3(+) node 1038(write) in\_5\_V\_read(read) tmp\_3\_4(+) node\_1041(write) in\_6\_V\_read(read) tmp\_3\_5(+) node\_1044(write) in\_7\_V\_read(read) tmp\_3\_6(+) node\_1047(write) in\_8\_V\_read(read) tmp\_3\_7(+) node\_1050(write) Figure 23: Tranformation Results in Schedule Viewer The associated estimates are: **Performance Estimates** ☐ Timing (ns) Summary Clock Target Estimated Uncertainty 5.00 0.62 ap clk 1.352 Latency (clock cycles) ■ Summary Latency Interval min max min max Type 0 0 none 0 □ Detail ■ Instance Loop **Utilization Estimates** □ Summarv BRAM 18K DSP48E FF Name LUT DSP Expression 4845 FIFO -Instance Memory Multiplexer Register Total 0 0 4845 0 Available 5520 1326720 663360 4320 Available SLR 2160 2760 663360 331680 0 Utilization (%) 0 0 0 0 Utilization SLR (%) 0 1 Figure 24: Transformation Results Performance Estimates As you can see, this code can be implemented as a combinatorial function requiring only a fraction of the cycle to complete. ### **Pipelining Loops** Pipelining loops allows you to overlap iterations of a loop in time. Allowing iterations to operate concurrently is often a good compromise, as resources can be shared between iterations (less resource utilization), while requiring less execution time compared to loops that are not unrolled. Pipelining is enabled in C/C++ via the following pragma: #pragma HLS PIPELINE While the OpenCL API uses the following attribute: ``` __attribute__((xcl_pipeline_loop)) ``` **Note:** The OpenCL API has an additional way of specifying loop pipelining. This has to do with the fact that work item loops are not explicitly stated and pipelining these loops requires the attribute: ``` __attribute__((xcl_pipeline_workitems)) ``` More details to any of these specifications are provided in the SDx Pragma Reference Guide (UG1253) and the SDAccel Environment Programmers Guide (UG1277). In this example, the Schedule Viewer in the HLS Project produces the following information: Figure 25: Pipelining Loops in Schedule Viewer With the overall estimates being: Figure 26: Performance Estimates Because each iteration of a loop consumes only two cycles of latency, there can only be a single iteration overlap. This enables the total latency to be cut into half compared to the original, resulting in 257 cycles of total latency. However, this reduction in latency was achieved using fewer resources when compared to unrolling. In most cases, loop pipelining by itself can improve overall performance. However, the effectiveness of the pipelining will depend on the structure of the loop. Some common limitations are: - Resources with limited availability such as memory ports or process channels can limit the overlap of the iterations (Initiation Interval). - Similarly, loop-carried dependencies such as those created by variables conditions computed in one iteration affecting the next, might increase the initial interval of the pipeline. These are reported by the tool during high-level synthesis and can be observed and examined in the Schedule Viewer. For the best possible performance, the code might have to be modified to eliminate these limiting factors, or the tool needs to be instructed to eliminate some dependency by restructuring the memory implementation of an array or breaking the dependencies all together. ### **Task Parallelism** Task parallelism allows you to take advantage of data flow parallelism. In contrast to loop parallelism, when task parallelism is deployed, full execution units (tasks) are allowed to operate in parallel taking advantage of extra buffering introduced between the tasks. Look at the following example: When this code is executed, the function processA and processB are executed sequentially 128 times in a row. Given the combined latency for processA and processB in the loop is 278, the total latency can be estimated as: Performance Estimates □ Timing (ns) □ Summary Clock Target Estimated Uncertainty ap\_clk 3.33 2.433 0.90 □ Latency (clock cycles) □ Summary Latency Interval min max min max Type 35585 35585 35585 35585 none □ Detail □ Instance Figure 27: Performance Estimates The extra cycle is due to loop setup and can be observed in the Schedule Viewer. **⊞** Loop For C/C++ code, Task Parallelism is performed by adding the DATAFLOW pragma into the forloop: ``` #pragma HLS DATAFLOW ``` For OpenCL API code, add the attribute before the for-loop: ``` __attribute__ ((xcl_dataflow)) ``` Refer to SDx Pragma Reference Guide (UG1253) and SDAccel Environment Programmers Guide (UG1277) for more details regarding the specifics and limitations of these modifiers. As illustrated by the estimates in the HLS Report, applying the transformation will considerably improve the overall performance effectively using a double (ping pong) buffer scheme between the tasks: Figure 28: Performances Estimates The overall latency of the design has almost halved in this case due to concurrent execution of the different tasks of the different iterations. Given the 139 cycles per processing function and the full overlap of the 128 iterations, this allows the total latency to be: ``` (1x only processA + 127x both processes + 1x only processB) * 139 cycles = 17931 cycles ``` Using task parallelism is a very powerful way improve performance when it comes to implementation. However, the effectiveness of applying the DATAFLOW pragma to a specific and arbitrary piece of code might vary vastly. The coding guidelines for applying DATAFLOW effectively are provided in SDx Pragma Reference Guide (UG1253) and SDAccel Environment Programmers Guide (UG1277). However, it is often necessary to actually look at the execution pattern of the individual tasks to understand the final implementation of the DATAFLOW pragma. Towards that end, the SDAccel environment provides the Detailed Kernel Trace, which nicely illustrates concurrent execution. Figure 29: **Detailed Kernel Trace** For this Detailed Kernel Trace, the tool displays the start of the dataflowed loop, as shown in the previous figure. It illustrates how processA is starting up right away with the beginning of the loop, while processB waits until the completion of the processA before it can start up its first iteration. However, while processB completes the first iteration of the loop, processA begins operating on the second iteration and so forth. A more abstract representation of this information is presented in the Application Timeline (Host & Device) and Device Hardware Transaction View (device-only during hardware emulation). ## **Optimizing Compute Units** #### **Data Width** One, if not the most important, aspect for performance is the data width required for the implementation. The tool propagates port widths throughout the algorithm. In some cases, especially when starting out with an algorithmic description, the C/C++/OpenCL™ API code might only utilize large data types such as integers even at the ports of the design. However, as the algorithm gets mapped to a fully configurable implementation, smaller data types such as 10-or 12-bit might often suffice. Towards that end it is beneficial to check the size of basic operations in the HLS Synthesis report during optimization. In general, when the SDx™ environment maps an algorithm onto the FPGA, much processing is required to comprehend the C/C++/OpenCL API structure and extract operational dependencies. Therefore, to perform this mapping the SDx environment generally partitions the source code into operational units which are then mapped onto the FPGA. Several aspects influence the number and size of these operational units (ops) as seen by the tool. In the following table, the basic operations and their bitwidth are reported. Figure 30: Operations Utilization Estimates | <b>∃</b> Summary | | | | | | | | | | |----------------------|---------------|------|----------|------|-------|----|------|-------------|------------| | Name | BRAM | 18K | C DSP48E | | FF | | LU' | Т | | | DSP | - | | - | | - | | - | | | | Expression | - | | - | | | 0 | 1 | .02 | | | FIFO | - | | - | | - | | - | | | | Instance | - | | - | | - | | - | | | | Memory | 0 | | - | | 24 | | | 12 | | | Multiplexer | - | | - | | - | | | 80 | | | Register | - | | - | | 51 | | - | | | | Total | 0 | | 0 | | | 75 | 1 | .94 | | | Available | 4320 | | 552 | 20 1 | L3267 | 20 | 6633 | 60 | | | Available SLR | 216 | | 0 2760 | | 6633 | 60 | 3316 | 80 | | | Utilization (%) | 0 | | | 0 | | ~0 | | ~0 | | | Utilization SLR (%) | | 0 | | 0 | ~0 | | | ~0 | | | Detail | | | | | | | | | | | ■ Instance | | | | | | | | | | | <b>■ DSP48</b> | | | | | | | | | | | Memory | | | | | | | | | | | <b>⊞ FIFO</b> | | | | | | | | | | | <b>■ Expression</b> | | | | | | | | | | | Variable Nar | Variable Name | | ration | DS | P48E | FF | LUT | Bitwidth P0 | Bitwidth P | | i_1_fu_124_p2 | | | + | | 0 | 0 | 11 | 3 | | | i_2_fu_148_p2 | | | + | | 0 | 0 | 15 | 7 | | | i_3_fu_179_p2 | | + | | | 0 | 0 | 15 | 7 | | | sum_i9_fu_194_p2 | | + | | | 0 | 0 | 15 | 8 | | | sum_i_fu_158_p2 | | + | | | 0 | 0 | 15 | 8 | | | exitcond_fu_118 | | icmp | | 0 | 0 | 9 | 3 | | | | exitcond_i6_fu_1 | | icmp | | 0 | 0 | | 7 | | | | exitcond_i_fu_142_p2 | | | icmp | | 0 | 0 | 11 | 7 | | | Total | | | | | | | 102 | 50 | | Look for bit widths of 16, 32, and 64 bits commonly used in algorithmic descriptions, and verify that the associated operation from the C/C++/OpenCL API source actually requires the bit width to be this large. This can considerably improve the implementation of the algorithm, as smaller operations require less computation time. #### **Fixed Point Arithmetic** Some applications use floating point computation only because they are optimized for other hardware architecture. As explained in Deep Learning with INT8 Optimization on Xilinx Devices, using fixed point arithmetic for applications like deep learning can save the power efficiency and area significantly while keeping the same level of accuracy. It is recommended to explore fixed point arithmetic for your application before committing to using floating point operations. Register #### **Macro Operations** It is sometimes advantageous to think about larger computational elements. The tool will operate on the source code independently of the remaining source code, effectively mapping the algorithm without consideration of surrounding operations onto the FPGA. When applied, SDx tool keeps operational boundaries, effectively creating macro operations for specific code. This utilizes the following principles: - Operational locality to the mapping process. - Reduction in complexity for the heuristics. This might create vastly different results when applied. In C/C++ macro operations are created with the help of ``` #pragma HLS inline off ``` While in the OpenCL API, the same kind of macro operation can be generated by *not* specifying the following attribute, when defining a function.: ``` __attribute__((always_inline)) ``` #### **Using Optimized Libraries** The OpenCL specification provides many math built-in functions. All math built-in functions with the $native_-$ prefix are mapped to one or more native device instructions and will typically have better performance compared to the corresponding functions (without the $native_-$ prefix). The accuracy and in some cases the input ranges of these functions is implementation-defined. In SDAccel<sup>TM</sup> environment these $native_-$ built-in functions use the equivalent functions in the Vivado<sup>®</sup> Hight-Level Synthesis (HLS) tool Math library, which are already optimized for Xilinx<sup>®</sup> FPGAs in terms of area and performance. Xilinx recommends that you use $native_-$ built-in functions or the HLS tool Math library if the accuracy meets the application requirement. # **Optimizing Memory Architecture** Memory architecture is a key aspect of implementation. Due to the limited access bandwidth, it can heavily impact the overall performance, as shown in the following example.: ``` for( int j=0; j<256; j++) { #pragma HLS PIPELINE OFF ap_uint<16> sum = 0; for( int i = 0; i<4; i++) { sum += inMem[j][i]; } outMem[j] = sum; } ... Postprocess write local memory to output }</pre> ``` This code adds the four values associated with the inner dimension of the two dimensional input array. If implemented without any additional modifications, it results in the following estimates: Figure 31: Performance Estimates | Perfor | mance Est | timat | es | | | | | | |--------|---------------------|---------|-------|-------------------|--------------|---------|------------|-----------| | ⊡ Timi | ing (ns) | | | | | | | | | - S | ummary | | | | | | | | | Clo | ck Target | Estim | nated | Uncertainty | | | | | | ap_ | clk 3.33 | 2 | 2.433 | 0.90 | | | | | | _ Late | ency (cloc | k cyc | les) | | | | | | | - S | ummary | | | | | | | | | mi | tency In<br>n max m | | ax Ty | | | | | | | - D | etail | | | | | | | | | + | Instance | | | | | | | | | - | Loop | | | | | | | | | | | Latency | | | Initiation I | nterval | | | | L | oop Name | min | max | Iteration Latency | achieved | target | Trip Count | Pipelined | | - | Loop 1 | 1034 | 1034 | 12 | 1 | 1 | 1024 | yes | | - | Loop 2 | 4608 | 4608 | 18 | - | - | 256 | no | | - | + Loop 2.1 | 16 | 16 | 4 | - | - | 4 | no | | - | Loop 3 | 257 | 257 | 3 | 1 | 1 | 256 | yes | The overall latency of 4608 (Loop 2) is due to 256 iterations of 18 cycles (16 cycles spent in the inner loop, plus the reset of sum, plus the output being written). This is can be observed in the Schedule Viewer in the HLS Project. The estimates become considerably better when unrolling the inner loop. Figure 32: Performance Estimates However, this improvement is largely due to the fact that this process uses both ports of a dual port memory. This can be seen from the Schedule Viewer in the HLS Project: Operation\Control Step 13 14 15 16 17 18 - Loop 2 j5(phi\_mux) exitcond2(icmp) j\_3(+) tmp\_s(|) inMem\_V\_load(read) inMem\_V\_load\_1(read) tmp 6(|) tmp\_11(|) inMem\_V\_load\_2(read) inMem\_V\_load\_3(read) tmp1(+)tmp2(+)sum\_V\_3(+) node\_81(write) Figure 33: Schedule Viewer As you can see, two read operations are performed per cycle to access all the values from the memory to calculate the sum. This is often an undesired result as this completely blocks the access to the memory. To further improve the results, the memory can be split into four smaller memories along the second dimension: ``` #pragma HLS ARRAY_PARTITION variable=inMem complete dim=2 ``` This results in four array reads, all executed on different memories using a single port: Figure 34: Executed Four Arrays Results Using a total of 256 \* 4 cycles = 1024 cycles for loop 2. Figure 35: Performance Estimates Alternatively, the memory can be reshaped into to a single memory with four words in parallel. This is performed through the pragma: #pragma HLS array\_reshape variable=inMem complete dim=2 This results in the same latency as when the array partitioning, but with a single memory using a single port: Operation\Control Step 15 16 17 18 ▼ Loop 2 - Loop 2 j5(phi\_mux) - Loop 2 exitcond2(icmp) - Loop 2 j\_3(+) - Indicate the property of prope Figure 36: Latency Result Although, either solution creates comparable results with respect to overall latency and utilization, reshaping the array results in cleaner interfaces and less routing congestion making this the preferred solution. **Note:** This completes array optimization, in a real design the latency could be further improved by exploiting loop parallelism (see the Loop Parallelism section). ``` void run (ap_uint<16> in[256][4], ap_uint<16> out[256] ) [ ap_uint<16> inMem[256][4]; ap_uint<16> outMem[256]; #pragma HLS array_reshape variable=inMem complete dim=2 ... Preprocess input to local memory for( int j=0; j<256; j++) { #pragma HLS PIPELINE OFF ap\_uint<16> sum = 0; for( int i = 0; i < 4; i + +) { #pragma HLS UNROLL sum += inMem[j][i]; } outMem[j] = sum; ... Postprocess write local memory to output } ``` # **Host Optimization** This section focuses on Host Code optimization. The host code uses the OpenCL™ API to schedule the individual compute unit executions and data transfers from and to the FPGA board. As a result, you need to be thinking about concurrent execution through the OpenCL queue(s). This section discusses in detail common pitfalls and how to recognize and address them. # **Reducing Overhead of Kernel Enqueing** The OpenCL API execution model supports data parallel and task parallel programming models. Kernels are usually enqueued by the OpenCL Runtime multiple times and then scheduled to be executed on the device. You must send the command to start the kernel in one of two ways: - Using clEnqueueNDRange API for the data parallel case. - Using clEnqueueTask for the task parallel case. The dispatching process is executed on the host processor and the actual commands and kernel arguments need to be sent to the FPGA via PCle<sup>®</sup> link. In the current Xilinx runtime (XRT), the overhead of dispatching the command and arguments to the FPGA is between 30us and 60us, depending the number of arguments on the kernel. You can reduce the impact of this overhead by minimizing the number of times the kernel needs to be executed. For the data parallel case, Xilinx® recommends that you carefully choose the global and local work sizes for your host code and kernel so that the global work size is a small multiple of the local work size. Ideally, the global work size is the same as the local work size as shown in the code snippet below: For the task parallel case, Xilinx recommends that you minimize the calls to clEnqueueTask. Ideally, you should finish all the work load in a single call to clEnqueueTask. #### **Data Transfers** # Overlapping Data Transfers with Kernel Computation Applications, such as database analytics, have a much larger data set than the available memory on the acceleration device. They require the complete data to be transferred and processed in blocks. Techniques that overlap the data transfers with the computation are critical to achieve high performance for these applications. Below is the vector add kernel from the OpenCL Overlap Data Transfers with Kernel Computation Example in the host category from Xilinx On-boarding Example GitHub. There are four tasks to perform in the host application for this example: - 1. Write buffer a (Wa) - 2. Write buffer b (Wb) - 3. Execute vadd kernel - 4. Read buffer c (Rc) The asynchronous nature of OpenCL data transfer and kernel execution APIs allows overlap of data transfers and kernel execution as illustrated in the figure below. In this example, double buffering is used for all buffers so that the compute unit can process one set of buffers while the host can operate on the other set of buffers. The OpenCL event object provides an easy way to set up complex operation dependencies and synchronize host threads and device operations. The arrows in the figure below show how event triggering can be set up to achieve optimal performance. Figure 37: Event Triggering Set Up The host code snippet below enqueues the four tasks in a loop. It also sets up event synchronization between different tasks to ensure that data dependencies are met for each task. The double buffering is set up by passing different memory objects values to clEnqueueMigrateMemObjects API. The event synchronization is achieved by having each API call wait for other event as well as trigger its own event when the API completes. ``` for (size_t iteration_idx = 0; iteration_idx < num_iterations; iteration_idx++) { int flag = iteration_idx % 2; if (iteration_idx >= 2) { clWaitForEvents(1, &map_events[flag]); OCL_CHECK(clReleaseMemObject(buffer_a[flag])); OCL_CHECK(clReleaseMemObject(buffer_b[flag])); OCL_CHECK(clReleaseMemObject(buffer_c[flag])); OCL_CHECK(clReleaseEvent(read_events[flag])); OCL_CHECK(clReleaseEvent(kernel_events[flag])); } buffer_a[flag] = clCreateBuffer(world.context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, bytes_per_iteration, &A[iteration_idx * elements_per_iteration], NULL); buffer_b[flag] = clCreateBuffer(world.context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, bytes_per_iteration, &B[iteration_idx * elements_per_iteration], NULL); buffer_c[flag] = clCreateBuffer(world.context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, bytes_per_iteration, &device_result[iteration_idx * elements_per_iteration], NULL); array<cl_event, 2> write_events; printf("Enqueueing Migrate Mem Object (Host to Device) calls\n"); // These calls are asynchronous with respect to the main thread because // are passing the CL_FALSE as the third parameter. Because we are passing // the events from the previous kernel call into the wait list, it will wait // for the previous operations to complete before continuing OCL_CHECK(clEnqueueMigrateMemObjects( world.command_queue, 1, &buffer_a[iteration_idx % 2], 0 /* flags, 0 means from host */, O, NULL, &write_events[0])); set_callback(write_events[0], "ooo_queue"); OCL_CHECK(clEnqueueMigrateMemObjects( world.command_queue, 1, &buffer_b[iteration_idx % 2], 0 /* flags, 0 means from host */, O, NULL, &write_events[1])); ``` ``` set_callback(write_events[1], "ooo_queue"); xcl_set_kernel_arg(kernel, 0, sizeof(cl_mem), &buffer_c[iteration_idx % 2]); xcl_set_kernel_arg(kernel, 1, sizeof(cl_mem), &buffer_a[iteration_idx % 2]); xcl_set_kernel_arg(kernel, 2, sizeof(cl_mem), &buffer_b[iteration_idx % 2]); xcl_set_kernel_arg(kernel, 3, sizeof(int), &elements_per_iteration); printf("Enqueueing NDRange kernel.\n"); // This event needs to wait for the write buffer operations to complete // before executing. We are sending the write_events into its wait list // ensure that the order of operations is correct. OCL_CHECK(clEnqueueNDRangeKernel(world.command_queue, kernel, 1, nullptr, &global, &local, 2, write_events.data(), &kernel_events[flag])); set_callback(kernel_events[flag], "ooo_queue"); printf("Enqueueing Migrate Mem Object (Device to Host) calls\n"); // This operation only needs to wait for the kernel call. This call will // potentially overlap the next kernel call as well as the next read // operations OCL_CHECK( clEnqueueMigrateMemObjects(world.command_queue, 1, &buffer_c[iteration_idx % 2], CL_MIGRATE_MEM_OBJECT_HOST, 1, &kernel_events[flag], &read_events[flag])); set_callback(read_events[flag], "ooo_queue"); clEnqueueMapBuffer(world.command_queue, buffer_c[flag], CL_FALSE, CL_MAP_READ, 0, bytes_per_iteration, 1, &read_events[flag], &map_events[flag], 0); set_callback(map_events[flag], "ooo_queue"); OCL_CHECK(clReleaseEvent(write_events[0])); OCL_CHECK(clReleaseEvent(write_events[1])); } ``` The Application Timeline view below clearly shows that the data transfer time is completely hidden, while the compute unit vadd\_1 is running constantly. Figure 38: Data Transfer Time Hidden in Application Timeline View #### **Buffer Memory Segmentation** Allocation and deallocation of memory buffers can lead to memory segmentation in the DDRs. This might result in sub-optimal performance of compute units, even if they could theoretically execute in parallel. This problem occurs most often when multiple pthreads for different compute units are used, and the threads allocate and release many device buffers with different sizes every time they enqueue the kernels. In this case, the timeline trace will exhibit gaps between kernel executions and it just seems the processes are sleeping. Each buffer allocated by runtime should be continuous in hardware. For large memory, it might take a lot of time to wait for that space to be freed, when many buffers are allocated and deallocated. This can be resolved by allocating device buffer, and reusing it between different enqueues of a kernel. # **Compute Unit Scheduling** Scheduling kernel operations is key to overall system performance. This becomes even more important when implementing multiple compute units (of the same kernel or of different kernels). This section examines the different command queues responsible for scheduling the kernels. #### **Multiple In-Order Command Queues** The following figure shows an example with two in-order command queues, CQ0 and CQ1. The scheduler dispatches commands from each queue in order, but commands from CQ0 and CQ1 can be pulled out by the scheduler in any order. You must manage synchronization between CQ0 and CQ1 if required. Figure 39: Example with Two In-Order Command Queues Below is the code snippet from the *Concurrent Kernel Execution Example* in host category from SDAccel Getting Started Examples on GitHub that sets up multiple in-order command queues and enqueues commands into each queue: #### **Single Out-of-Order Command Queue** The following figure shows an example with a single out-of-order command queue. The scheduler can dispatch commands from the queue in any order. You must set up event dependencies and synchronizations explicitly if required. Figure 40: Example with Single Out-of-Order Command Queue Below is the code snippet from the *Concurrent Kernel Execution Example* from SDAccel Getting Started Examples on GitHub that sets up a single out-of-order command queue and enqueues commands: The Application Timeline view (as shown in the following figure) that the compute unit mmult\_1 is running in parallel with the compute units mscale\_1 and madd\_1, using both multiple inorder queues and single out-of-order queue methods. Figure 41: Application Timeline View Showing mult\_1 Running with mscale\_1 and madd 1 # Using clEnqueueMigrateMemObjects to Transfer Data The OpenCL framework provides a number of APIs for transferring data between the host and the device. Typically, data movement APIs, such as clEnqueueWriteBuffer and clEnqueueReadBuffer, implicitly migrate memory objects to the device after they are enqueued. They do not guarantee when the data is transferred. This makes it difficult for the host application to overlap the placements of the memory objects onto the device with the computation carried out by kernels. OpenCL 1.2 framework introduced a new API, clEnqueueMigrateMemObjects. Using this API, memory migration can be explicitly performed ahead of the dependent commands. This allows the application to preemptively change the association of a memory object, through regular command queue scheduling, to prepare for another upcoming command. This also permits an application to overlap the placement of memory objects with other unrelated operations before these memory objects are needed, potentially hiding transfer latencies. After the event associated by clEnqueueMigrateMemObjects has been marked CL\_COMPLETE, the memory objects specified in mem\_objects have been successfully migrated to the device associated with command\_queue. The clEnqueueMigrateMemObjects API can also be used to direct the initial placement of a memory object after creation, possibly avoiding the initial overhead of instantiating the object on the first enqueued command to use it. Another advantage of clEnqueueMigrateMemObjects is that it can migrate multiple memory objects in a single API call. This reduces the overhead of scheduling and calling functions for transferring data for more than one memory object. Below is the code snippet showing the usage of clEnqueueMigrateMemObjects from Vector Multiplication for XPR Device example in the host category from SDAccel Getting Started Examples on GitHub. # **Topological Optimization** This section focuses on the topological optimization. It looks at the attributes related to the rough layout and implementation of multiple compute units and their impact on performance. ## **Multiple Compute Units** Depending on available resources on the target device, multiple compute units of the same kernel (or different kernels) can be created to run in parallel, which improves the system processing time and throughput. Different kernels are provided as separate .xo files on the xocc link line. Multiple compute units of a kernel can be added by using the --nk option: ``` xocc -1 --nk <kernel_name:number(:compute_unit_name1.compute_unit_name2...)> ``` Note: Each of the individual kernels will have to be individually driven by the host code. # **Using Multiple DDR Banks** Acceleration cards supported in SDAccel<sup>™</sup> environment provide one, two, or four DDR banks, and up to 80 GB/s raw DDR bandwidth. For kernels moving large amount of data between the FPGA and the DDR, Xilinx<sup>®</sup> recommends that you direct the SDAccel compiler and runtime library to use multiple DDR banks. In addition to DDR banks, the host application can access PLRAM to transfer data directly to a kernel. This feature is enabled using the xocc --sp option with compatible platforms. To take advantage of multiple DDR banks, you need to assign CL memory buffers to different banks in the host code as well as configure the xclbin file to match the bank assignment in xocc command line. The following block diagram shows the *Global Memory Two Banks Example* in "kernel\_to\_gmem" category on SDAccel Getting Started Examples on GitHub that connects the input pointer to DDR bank 0 and output pointer to DDR bank 1. Figure 42: Global Memory Two Banks Example #### **Assigning DDR Bank in Host Code** Bank assignment in host code is supported by Xilinx vendor extension. The following code snippet shows the header file required, as well as assigning input, and output buffers to DDR bank 0 and bank 1 respectively: ``` #include <CL/cl_ext.h> int main(int argc, char** argv) cl_mem_ext_ptr_t inExt, outExt; // Declaring two extensions for both buffers inExt.flags = 0|XCL_MEM_TOPOLOGY; // Specify BankO Memory for input memory outExt.flags = 1|XCL_MEM_TOPOLOGY; // Specify Bank1 Memory for output inExt.obj = 0 ; outExt.obj = 0; // Setting Obj and Param to Zero inExt.param = 0 ; outExt.param = 0; //Allocate Buffer in BankO of Global Memory for Input Image using Xilinx Extension cl_mem buffer_inImage = clCreateBuffer(world.context, CL_MEM_READ_ONLY | CL_MEM_EXT_PTR_XILINX, image_size_bytes, &inExt, &err); if (err != CL_SUCCESS) { std::cout << "Error: Failed to allocate device Memory" << std::endl;</pre> return EXIT_FAILURE; //Allocate Buffer in Bank1 of Global Memory for Input Image using Xilinx Extension cl_mem buffer_outImage = clCreateBuffer(world.context, CL_MEM_WRITE_ONLY | CL_MEM_EXT_PTR_XILINX, image_size_bytes, &outExt, NULL); if (err != CL_SUCCESS) { std::cout << "Error: Failed to allocate device Memory" << std::endl; return EXIT_FAILURE; } } ``` cl\_mem\_ext\_ptr\_t is a struct as defined below: ``` typedef struct{ unsigned flags; void *obj; void *param; } cl_mem_ext_ptr_t; ``` - Valid values for flags are: - . XCL\_MEM\_DDR\_BANKO - XCL\_MEM\_DDR\_BANK1 - XCL\_MEM\_DDR\_BANK2 - . XCL\_MEM\_DDR\_BANK3 - <id>| XCL\_MEM\_TOPOLOGY **Note:** The <id> is determined by looking at the Memory Configuration section in the xxx.xclbin.info file generated next to the xxx.xclbin file. In the xxx.xclbin.info file, the global memory (DDR, PLRAM, etc.) is listed with an index representing the <id>. - obj is the pointer to the associated host memory allocated for the CL memory buffer only if CL\_MEM\_USE\_HOST\_PTR flag is passed to clCreateBuffer API, otherwise set it to NULL. - param is reserved for future use. Always assign it to 0 or NULL. #### **Assigning Global Memory for Kernel Code** #### Creating Multiple AXI Interfaces OpenCL<sup>™</sup> kernels, C/C++ kernels, and RTL kernels have different methods for assigning function parameters to AXI interfaces. • For OpenCL kernels, the --max\_memory\_ports option is required to generate one AXI4 interface for each global pointer on the kernel argument. The AXI4 interface name is based on the order of the global pointers on the argument list. The following code is taken from the example gmem\_2banks\_oc1 in the kernel\_to\_gmem category from the SDAccel Getting Started Examples on GitHub: ``` __kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void apply_watermark(__global const TYPE * __restrict input, __global TYPE * __restrict output, int width, int height) { ... } ``` In this example, the first global pointer input is assigned an AXI4 name M\_AXI\_GMEM0, and the second global pointer output is assigned a name M\_AXI\_GMEM1. For C/C++ kernels, multiple AXI4 interfaces are generated by specifying different "bundle" names in the HLS INTERFACE pragma for different global pointers. Refer to the SDAccel Environment Programmers Guide (UG1277) for more information. The following is a code snippet from the <code>gmem\_2banks\_c</code> example that assigns the <code>input</code> pointer to the bundle <code>gmem0</code> and the <code>output</code> pointer to the bundle <code>gmem1</code>. The bundle name can be any valid C string, and the AXI4 interface name generated will be <code>M\_AXI\_<bundle\_name></code>. For this example, the input pointer will have AXI4 interface name as <code>M\_AXI\_gmem0</code>, and the output pointer will have <code>M\_AXI\_gmem1</code>. ``` #pragma HLS INTERFACE m_axi port=input offset=slave bundle=gmem0 #pragma HLS INTERFACE m_axi port=output offset=slave bundle=gmem1 ``` • For RTL kernels, the port names are generated during the import process by the RTL kernel wizard. The default names proposed by the RTL kernel wizard are m00\_axi and m01\_axi. If not changed, these names have to be used when assigning a DDR bank through the --sp option. #### Assigning AXI Interfaces to DDR Banks **IMPORTANT!** When using more than one DDR interface, Xilinx requires you to specify the DDR memory bank for each kernel/CU using the --sp option, and specify in which SLR the kernel is placed. Refer to the XOCC command in the SDx Command and Utility Reference Guide (UG1279) for details of the --sp command option, and the SDAccel Environment User Guide (UG1023) for details on SLR placement. AXI4 interfaces are connected to DDR banks using the --sp option. The --sp option value is in the format of <kernel\_instance\_name>.<interface\_name>:<DDR\_bank\_name>. Valid DDR bank names for the --sp option are: - DDR[0] - DDR[1] - DDR[2] - DDR[3] The following is the command line example that connects the input pointer (M\_AXI\_GMEM0) to DDR bank 0 and the output pointer (M\_AXI\_GMEM1) to DDR bank 1: ``` xocc --max_memory_ports apply_watermark --sp apply_watermark_1.m_axi_gmem0:DDR[0] --sp apply_watermark_1.m_axi_gmem1:DDR[1] ``` You can use the Device Hardware Transaction view to observe the actual DDR Bank communication, and to analyze DDR usage. Figure 43: Device Hardware Transaction View Transactions on DDR Bank #### Assigning AXI Interfaces to PLRAM Some platforms support PLRAMs. In these cases, use the same <code>--sp</code> option as described in Assigning AXI Interfaces to DDR Banks, but use the name, PLRAM[id]. Valid names supported by specific platforms can be found in the Memory Configuration section of the <code>xclibin.info</code> file generated alongside <code>xclbin</code>. #### Assigning Kernels to SLR regions Assigning ports to DDR banks requires that the kernel will have to be physically routed on the FPGA to connect to the assigned DDR. Currently, large FPGAs use stacked silicon devices with several Super Logic Regions (SLRs). By default, the SDAccel environment will place the compute units in the same SLR as the shell. This might not always be desirable, especially when specific DDR banks are used that might be in another SLR region. As a result, Xilinx recommends to use the --slr option to map kernels to be close to the used DDR memory. For example, the $apply_watermark_1$ kernel above can be mapped to SLR 1 by applying the following link option: ``` xocc -1 --slr apply_watermark_1:SLR1 ``` To better understand the platform attributes, such as the number of DDRs and SLR regions, use platforminfo. For more information, refer to the SDx Command and Utility Reference Guide (UG1279). ### Appendix A # **Examples** To help users quickly get started with the SDAccel™ environment, SDAccel Getting Started Examples on GitHub hosts many examples to demonstrate good design practices, coding guidelines, design pattern for common applications, and most importantly optimization techniques to maximize application performance. The on-boarding examples are divided into several main categories. Each category has various key concepts that are illustrated by individual examples in both OpenCL™ C and C/C++, when applicable. All examples include Makefile for running software emulation, hardware emulation, and running on hardware and a README.md file that explains the example in detail. # Additional Resources and Legal Notices #### Xilinx Resources For support resources such as Answers, Documentation, Downloads, and Forums, see Xilinx Support. # **Documentation Navigator and Design Hubs** Xilinx® Documentation Navigator (DocNav) provides access to Xilinx documents, videos, and support resources, which you can filter and search to find information. DocNav is installed with the SDSoC™ and SDAccel™ development environments. To open it: - On Windows, select Start → All Programs → Xilinx Design Tools → DocNav. - At the Linux command prompt, enter docnav. Xilinx Design Hubs provide links to documentation organized by design tasks and other topics, which you can use to learn key concepts and address frequently asked questions. To access the Design Hubs: - In DocNav, click the **Design Hubs View** tab. - On the Xilinx website, see the Design Hubs page. Note: For more information on DocNav, see the Documentation Navigator page on the Xilinx website. #### References - 1. SDAccel Environment Release Notes, Installation, and Licensing Guide (UG1238) - 2. SDAccel Environment User Guide (UG1023) - 3. SDAccel Environment Profiling and Optimization Guide (UG1207) - 4. SDAccel Environment Getting Started Tutorial (UG1021) - 5. SDAccel™ Development Environment web page - 6. Vivado® Design Suite Documentation - 7. Vivado Design Suite User Guide: Designing IP Subsystems using IP Integrator (UG994) - 8. Vivado Design Suite User Guide: Creating and Packaging Custom IP (UG1118) - 9. Vivado Design Suite User Guide: Partial Reconfiguration (UG909) - 10. Vivado Design Suite User Guide: High-Level Synthesis (UG902) - 11. UltraFast Design Methodology Guide for the Vivado Design Suite (UG949) - 12. Vivado Design Suite Properties Reference Guide (UG912) - 13. Khronos Group web page: Documentation for the OpenCL standard - 14. Xilinx® Virtex® UltraScale+™ FPGA VCU1525 Acceleration Development Kit - 15. Xilinx® Kintex® UltraScale™ FPGA KCU1500 Acceleration Development Kit # **Please Read: Important Legal Notices** The information disclosed to you hereunder (the "Materials") is provided solely for the selection and use of Xilinx products. To the maximum extent permitted by applicable law: (1) Materials are made available "AS IS" and with all faults, Xilinx hereby DISCLAIMS ALL WARRANTIES AND CONDITIONS, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT, OR FITNESS FOR ANY PARTICULAR PURPOSE; and (2) Xilinx shall not be liable (whether in contract or tort, including negligence, or under any other theory of liability) for any loss or damage of any kind or nature related to, arising under, or in connection with, the Materials (including your use of the Materials), including for any direct, indirect, special, incidental, or consequential loss or damage (including loss of data, profits, goodwill, or any type of loss or damage suffered as a result of any action brought by a third party) even if such damage or loss was reasonably foreseeable or Xilinx had been advised of the possibility of the same. Xilinx assumes no obligation to correct any errors contained in the Materials or to notify you of updates to the Materials or to product specifications. You may not reproduce, modify, distribute, or publicly display the Materials without prior written consent. Certain products are subject to the terms and conditions of Xilinx's limited warranty, please refer to Xilinx's Terms of Sale which can be viewed at https://www.xilinx.com/legal.htm#tos; IP cores may be subject to warranty and support terms contained in a license issued to you by Xilinx. Xilinx products are not designed or intended to be fail-safe or for use in any application requiring fail-safe performance; you assume sole risk and liability for use of Xilinx products in such critical applications, please refer to Xilinx's Terms of Sale which can be viewed at https://www.xilinx.com/legal.htm#tos. #### **AUTOMOTIVE APPLICATIONS DISCLAIMER** AUTOMOTIVE PRODUCTS (IDENTIFIED AS "XA" IN THE PART NUMBER) ARE NOT WARRANTED FOR USE IN THE DEPLOYMENT OF AIRBAGS OR FOR USE IN APPLICATIONS THAT AFFECT CONTROL OF A VEHICLE ("SAFETY APPLICATION") UNLESS THERE IS A SAFETY CONCEPT OR REDUNDANCY FEATURE CONSISTENT WITH THE ISO 26262 AUTOMOTIVE SAFETY STANDARD ("SAFETY DESIGN"). CUSTOMER SHALL, PRIOR TO USING OR DISTRIBUTING ANY SYSTEMS THAT INCORPORATE PRODUCTS, THOROUGHLY TEST SUCH SYSTEMS FOR SAFETY PURPOSES. USE OF PRODUCTS IN A SAFETY APPLICATION WITHOUT A SAFETY DESIGN IS FULLY AT THE RISK OF CUSTOMER, SUBJECT ONLY TO APPLICABLE LAWS AND REGULATIONS GOVERNING LIMITATIONS ON PRODUCT LIABILITY. #### Copyright © Copyright 2016-2019 Xilinx, Inc. Xilinx, the Xilinx logo, Alveo, Artix, ISE, Kintex, Spartan, Versal, Virtex, Vivado, Zynq, and other designated brands included herein are trademarks of Xilinx in the United States and other countries. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. HDMI, HDMI logo, and High-Definition Multimedia Interface are trademarks of HDMI Licensing LLC. AMBA, AMBA Designer, Arm, ARM1176JZ-S, CoreSight, Cortex, PrimeCell, Mali, and MPCore are trademarks of Arm Limited in the EU and other countries. All other trademarks are the property of their respective owners.