![]() |
AI Engine-ML Intrinsics User Guide (v2024.2)
|
Intrinsics allowing you to perform MUL/MAC operations and a few of their variants. More...
Intrinsics allowing you to perform MUL/MAC operations and a few of their variants.
For integer datatypes, a matrix A of size MxN is multiplied with a matrix B of size NxP. The naming convention for these operations is: [operation][_MxN_NxP]{_Cch}{_conf} or [operation]_conv_MxN{_Cch}{_conf}. Properties in [] are mandatory, properties in {} are optional. In this naming, conv indicates a convolutional operation, conf indicates the use of sub, zero or shift masks and C gives the number of channels.
For an MxN vector multiply convolution operation, the calculation performed is:
\[ \text{mul_conv_MxN}(F,G) = \sum_{u=0}^{\text{N}-1}{G(u) F(x+u)} \]
where the vector \(F\) has length \(\text{M}+\text{N}-1\), and the vector \(G\) has length \(\text{N}\).
For element-wise operations, the naming is [operation_elem_C]{_N}. Here, C is the number of channels and N is the number of columns of matrix A/rows of matrix B. N is either two or it is omitted. The element-wise operations are executed channel by channel. The output will also be a matrix of with C channels.
For complex datatypes, a multiplication of two matrices with complex elements is performed. The naming convention for these operations is [operation_elem_8]{_conf} for Multiply-accumulate of 32b x 16b complex integer datatypes and [operation_elem_8_2]{_conf} for Multiply-accumulate of 16b x 16b complex integer datatypes. Here, eight is the number of channels and the two is the number columns of matrix A/rows of matrix B. The matrix multiplication is performed indvidually for each channel of the input matrices. The output will also be a matrix with eight channels.
The following table shows the matrix multiplications that can be completed within a single cycle.
Precision Mode | Channels | Matrix A | Matrix B | Matrix C |
---|---|---|---|---|
8-bit x 4-bit = 32-bit | 1 | 4x16 | 16x8 | 4x8 |
8-bit x 4-bit = 32-bit | 1 | 4x32 | 32x8 (sparse) | 4x8 |
8-bit x 8-bit = 32-bit | 1 | 4x8 | 8x8 | 4x8 |
8-bit x 8-bit = 32-bit | 32 | 1x2 | 2x1 | 1x1 |
8-bit x 8-bit = 32-bit | 8 | 4x4 (convolution) | 4x1 | 4x1 |
8-bit x 8-bit = 32-bit | 4 | 8x8 (convolution) | 8x1 | 8x1 |
8-bit x 8-bit = 32-bit | 1 | 32x8 (convolution) | 8x1 | 32x1 |
8-bit x 8-bit = 32-bit | 1 | 4x16 | 16x8 (sparse) | 4x8 |
16-bit x 8-bit = 32-bit | 1 | 4x4 | 4x8 | 4x8 |
16-bit x 8-bit = 32-bit | 2 | 4x4 | 4x4 | 4x4 |
16-bit x 16-bit = 32-bit | 1 | 4x2 | 2x8 | 4x8 |
16-bit x 16-bit = 32-bit | 32 | 1x1 | 1x1 | 1x1 |
16-bit x 8-bit = 64-bit | 1 | 2x8 | 8x8 | 2x8 |
16-bit x 8-bit = 64-bit | 1 | 4x8 | 8x4 | 4x4 |
16-bit x 8-bit = 64-bit | 1 | 2x16 | 16x8 (sparse) | 2x8 |
16-bit x 16-bit = 64-bit | 1 | 2x4 | 4x8 | 2x8 |
16-bit x 16-bit = 64-bit | 1 | 4x4 | 4x4 | 4x4 |
16-bit x 16-bit = 64-bit | 16 | 1x2 | 2x1 | 1x1 |
16-bit x 16-bit = 64-bit | 1 | 16x4 (convolution) | 4x1 | 16x1 |
Complex 16-bit x Complex 16-bit = 64-bit | 8 | 1x2 | 2x1 | 1x1 |
16-bit x 16-bit = 64-bit | 1 | 2x8 | 8x8 (sparse) | 2x8 |
32-bit x 16-bit = 64-bit | 1 | 4x2 | 2x4 | 4x4 |
Complex 32-bit x Complex 16-bit = 64-bit | 8 | 1x1 | 1x1 | 1x1 |
bfloat16 x bfloat16 = fp32 | 1 | 4x8 | 8x4 | 4x4 |
bfloat16 x bfloat16 = fp32 | 16 | 1x2 | 2x1 | 1x1 |
bfloat16 x bfloat16 = fp32 | 1 | 4x16 | 16x4 (sparse) | 4x4 |
bfloat16 x cbfloat16 = fp32 | 1 | 2x8 | 8x2 | 2x2 |
cbfloat16 x bfloat16 = fp32 | 1 | 2x8 | 8x2 | 2x2 |
cbfloat16 x cbfloat16 = fp32 | 1 | 2x8 | 8x2 | 2x2 |
cbfloat16 x cbfloat16 = fp32 | 8 | 1x2 | 2x1 | 1x1 |
bfloat16 x cbfloat16 = fp32 | 8 | 1x2 | 2x1 | 1x1 |
cbfloat16 x bfloat16 = fp32 | 8 | 1x2 | 2x1 | 1x1 |
We can summarize the MUL and the MAC operation like this:
The 'x' operator being the matrix multiplication operator. The same way we can summarize the MSC, NEGMUL, MACMUL and MAC/MSC variants with additional acc_in2 input operations as this:
The convolve variants of these intrinsics differs as they apply a convolution product on the vectors instead of a matrix multiplication. The '*' operator being the vector convolution operator. Therefore, the X_vec is the matrix, and Y_vec the kernel.
Some variant allow the passing of masks that are used to determine sign, zeroing and negation of vector or accumulator lanes. These masks are the following:
Complex multiplications require some terms to be negated in order to implement conjugation and minus j multiplication. This is done through the sub_mask. The following examples show how this mask is used when two complex numbers, X and Y, are multiplied to get an output O. For Multiply-accumulate of 16b x 16b complex integer datatypes there are two complex numbers post-added. They are indicated by the postfix 0/1:
For Multiply-accumulate of 32b x 16b complex integer datatypes there is no postadding and only four unique terms are needed. However, all 8 bit must be specified apropriately. In the following equation the index bits used for one term must be the same value.
Some intrinsics are used for multiplications of matrices with a given number of channels. Each MxN matrix is stored in row-major and channel-minor fashion. The following example shows the resulting layout of elements in the vector for a 4x4 matrix with two channels. The indexes for each element are given as (m,n,c)
The elem variants allow you to perform element-wise operations. The operations are performed along the channels. For example, if you perform a (1x1x32) x (1x1x32) operation a multiplication will be done between the elements of the same channel. So, the elements of channel zero will be multiplied, the elements of channel one will be multiplied etc... The end result would again have 32 channels.
Some of the elem variants perform matrix multiplications along the channels. For those cases the multiplication (1x2xC) x (2x1xC) is performed. The end result is a (1x1x32) matrix. Despite the name, this is not a true element-wise multiplication.
Convolutional operations work similar to element-wise multiplication. In every step the kernel will be multiplied with the matrix before it is shifted to the next position. The same is done for each channel. The difference to a regular element-wise multiplication is that after the multiplications for each channel have been completed the resulting matrices are added together so that the final result will have only one channel.
When multiplying with a scalar bfloat16 it will be internally cast to float which influences the rounding behaviour with negation. The following example shows how this behaviour affects the multiplication. As the cast involves a rounding operation it matters if the negation is performed before or after the cast. In the first case, the rounding happens to the positive result before the negation. For the second and third case the rounding happens before that which will lead to a different result.
elementwise multiplication and matrix multiplication intrinsics for FP32 input type are emulated using bfloat16 data-path. There are 3 options to chose from. Default option (Most accurate but slow):
Fast and accurate option:
Fastest option with loss of accuracy:
Modules | |
Emulated Multiply-accumulate of 16b x 32b datatypes | |
Matrix multiplications in which matrix A has data elements of 16 bit and matrix B has data elements of 32 bit. These operations are emulated on top of Multiply-accumulate of 16b x 16b integer datatypes and might not have optimal performance. | |
Emulated Multiply-accumulate of 32b x 16b datatypes | |
Matrix multiplications in which matrix A has data elements of 32 bit and matrix B has data elements of 16 bit. These operations are emulated on top of Multiply-accumulate of 16b x 16b integer datatypes and might not have optimal performance. | |
Emulated Multiply-accumulate of 32b x 32b datatypes | |
Matrix multiplications in which matrix A has data elements of 32 bit and matrix B has data elements of 32 bit. These operations are emulated on top of Multiply-accumulate of 32b x 16b integer datatypes and Multiply-accumulate of 16b x 16b integer datatypes and might not have optimal performance. | |
Emulated Multiply-accumulate of Complex 32b x Complex 32b datatypes | |
Matrix multiplications in which matrix A has data elements of complex 32 bit and matrix B has data elements of complex 32 bit. These operations are emulated on top of Multiply-accumulate of 32b x 16b complex integer datatypes and might not have optimal performance. | |
Emulated Multiply-accumulate of fp32 x fp32 datatypes | |
Elementwise-multiplication and matrix multiplication using bfloat16 datapath. 2 options available. With or without set_rnd(0) for truncation before using these intrinsics. Use flag AIE_FP32_EMULATION_SET_RND_MODE flag to set rnd mode to truncation. For an explanation how these operations works see Multiply Accumulate. | |
Multiply-accumulate of 16b x 16b complex integer datatypes | |
Matrix multiplications in which matrix A and matrix B have complex data elements of 16 bit. For an explanation how these operations works see Multiply Accumulate. | |
Multiply-accumulate of 16b x 16b integer datatypes | |
Matrix multiplications in which matrix A and matrix B have data elements of 16 bit. | |
Multiply-accumulate of 16b x 8b integer datatypes | |
Matrix multiplications in which matrix A has data elements of 16 bit and matrix B has data elements of 8 bit. | |
Multiply-accumulate of 32b x 16b complex integer datatypes | |
Matrix multiplications in which matrix A has complex data elements of 32 bit and matrix B has complex data elements of 16 bit. | |
Multiply-accumulate of 32b x 16b integer datatypes | |
Matrix multiplications in which matrix A has data elements of 32 bit and matrix B has data elements of 16 bit. | |
Multiply-accumulate of 8b x 4b datatypes | |
Matrix multiplications in which matrix A has data elements of 8 bit and matrix B has data elements of 4 bit. These operations are emulated on top of int8 x int8. | |
Multiply-accumulate of 8b x 8b integer datatypes | |
Matrix multiplications in which matrix A and matrix B have data elements of 8 bit. | |
Multiply-accumulate of bfloat16 datatypes | |
Matrix multiplications in which matrix A and B have bfloat16 data elements. | |
Multiply-accumulate with a sparse matrix | |
Matrix multiplications in which matrix B is a sparse matrix. | |
Negation control in complex multiplication modes | |
In order to do complex multiplications, some terms need to be negated. | |