AI Engine Intrinsics User Guide
(AIE) r2p23
|
Intrinsics that operate on vectors but don't perform a multiplication follow a reduced or modified lane selection scheme with respect to macs/muls. Such operations are adds, subs, abs, vector compares or vector selections/shuffles. Since those instructions share the initial part of the integer macs/mults datapath, they operate mostly on fixed point numbers. The only exception are float select and shuffles because no arithmetic is performed. Floating point arithmetic is always done in the floating point datapath, more information here. The next table summarizes the lane selection scheme.
RA: Regular selection scheme.
RA_16: Selection scheme used for 16 bits numbers.
na: Not implemented at intrinsics level.
fpdp: Floating point datapath.
i32 | i16 | ci32 | ci16 | float | cfloat | |
---|---|---|---|---|---|---|
select/shuffle | RA | RA_16. Offsets are relative to 32 bit. Start is relative to 16 bit but must be multiple of 2. Square relative to 16 bits. | RA. Start and offsets are relative to a full ci32 (64bits). | RA. Real and Imag are never split. | RA | RA. Start and offsets are relative to full cfp32 (64 bits) |
add/sub | RA | RA_16. Offsets are relative to 32 bit. Start is relative to 16 bit but must be multiple of 2. Square relative to 16 bits. | RA. Start and offsets are relative to a full ci32. (64bits). | RA_16 modified. Start is doubled to represent a full ci16. (32bits). Offset follows the RA_16 scheme. 16 bits permute is disabled. | fpdp | fpdp |
abs | RA | RA_16. Offsets are relative to 32 bit. Start is relative to 16 bit but must be multiple of 2. Square relative to 16 bits. | na | na | fpdp | fpdp |
cmp | RA | RA_16. Offsets are relative to 32 bit. Start is relative to 16 bit but must be multiple of 2. Square relative to 16 bits. | na | na | fpdp | fpdp |
The basic functionality of these intrinsics performs vector comparisons between data from two buffers, the X and Y buffers, with the other parameters and options allowing flexibility (data selection within the vectors). When a single input buffer is used both X and Y inputs are obtained (with the respective start/offsets/square parameters) from the input buffer.
Doing "+1" always mean to advance by one lane in the input buffer. The bit width of the datatype is irrelevant.
for i in 0,rows: id[i] = start + offset[i] %input samples out[i] = f( in[id[i]] ) //f can be add, abs, sel ...
//in and out are always treated as 16bits vectors, in[i] and in[i+1] are 16bits apart // First permutation stage The concepts are simple: - output_lanes = N - N/2 offsets covers N output lanes -> 2*idx - This means that each offset is used to move two adjacent values -> perm_idx + 1 - The parity of the idx selects the perm_idx formula for (idx = 0 ; idx < N/2; idx += 1) if even idx: perm_idx = start + 2*offset[idx] else //odd idx perm_idx = start + 2*offset[idx] + 2*(offset[idx – 1] + 1 ) data[2*idx ] = input[ perm_idx ] data[2*idx+1] = input[ perm_idx + 1] //This is just the adjacent one // Second permutation stage for ( idx = 0 ; idx < N; idx += 4) // Square is used to permute on a smaller granularity output[idx] = data[ idx + square [ 0 ] ] output[idx+1] = data[ idx + square [ 1 ] ] output[idx+2] = data[ idx + square [ 2 ] ] output[idx+3] = data[ idx + square [ 3 ] ]
Visually, what happens is the following (example for the first two idx):
- Assume that the even offset selects [c,d] and the odd offsets selets [g,h] (as an example) in = | a | b [ c | d ] e | f [ g | h ] i | l | m | f(offset_[idx0])--^ ^ g(offset_[idx1])------------------| Here the functions f,g represents the ones described in the previous pseudocode. - Then, data is shaped like this data = | c | d | g | h | ..... The next ones are selected by idx2, idx3 ... - The square parameter finalizes the permutations, assume square 0x0123 out[0] = data[ square[0] ] = data[ 3 ] = h out[1] = data[ square[1] ] = data[ 2 ] = g out[2] = data[ square[2] ] = data[ 1 ] = d out[3] = data[ square[3] ] = data[ 0 ] = c .. - And hence out = | h | g | d | c | .....
The general naming convention for the integer vector intrinsics is shown below:
{ge|gt|le|lt|max|maxdiff|min}{16|32}
The general naming convention for the floating vector compare intrinsics is shown below:
fp{ge|gt|le|lt|max|min}
When the output has more than 8 lanes (e.g. 16) there are extra offset parameters. Apart from the usual 'offsets' parameter there is an additional 'offsets_hi' parameter for the extra lanes. This extra parameter allows selecting the data that will be placed into the upper input lanes (8-16) of the multiplier.
Here is an example of Matrix Transpose. Input Matrix:
This matrix is loaded in to v64int16 to compute its transpose in 2x2 tiles.
Note this is real 16-bit so we will use the real data scheme described above with the select32 intrinsic (Vector Lane Selection).
Our input data is packed as 2x2 tiles in vector registers and we would also like to output in this same format. Input:
Data : 00 01 10 11 02 03 12 13 04 05 14 15 06 07 16 17 20 21 30 31 22 23 32 33 24 25 34 35 26 27 36 37 Index into v64int16: 0 8 16 24 Data : 40 41 50 51 42 43 52 53 44 45 54 55 46 47 56 57 60 61 70 71 62 63 72 73 64 65 74 75 66 67 76 77 Index into v64int16: 32 40 48 56
In this case we would use the following indexing for the matrix transpose in “2x2 tiles”
select32 Intrinsic settings // -- are used to show dont cares // For offsets, // xoffset and yoffset are used for first 16 output lanes (out[15:0]).xoffset_hi and yoffset_hi are used for last 16 output lanes (out[31:16]) // 4 bits would chose 2 output lanes since it is 16 bit output lane. // For even position in xoffset (bits 0:3 or nibble=0, bits 8:11 or nibble=2 etc) each 4 bits value is mutiplied by 2 (nibble*2) // For odd position in xoffset (bits 4:7 or nibble=1, bits 12:15 or nibble=3 etc) add two adjacent nibbles. Increment it by 1 and mutiply by 2. (ex.(nibble0+nibble1+1)*2 xstart=0, // Since the 1st element at the output is located at index 0 xsquare=0x3120, xoffset=0x----0800, // bits[7:0] would select 00 01 10 11 from input buffer.xsquare would flip these 4 samples to 00 10 01 11. // bits[15:7] would select 20 21 30 31 from input buffer.xsquare would flip these 4 samples to 30 40 31 41. xoffset_hi=0x----0a02, // bits[7:0] would select 02 03 12 13 from input buffer.xsquare would flip these 4 samples to 02 12 03 13. // bits[15:7] would select 22 23 32 33 from input buffer.xsquare would flip these 4 samples to 22 32 23 33. ystart=32, // since ystart is 32, yoffset values would have +32 ysquare=0x3120, yoffset=0x0800----, // bits[7:0] would select 40 41 50 51 from input buffer.xsquare would flip these 4 samples to 40 50 41 51. // bits[15:7] would select 60 61 70 71 from input buffer.xsquare would flip these 4 samples to 60 70 61 71. yoffset_hi=0x0a02----, // bits[7:0] would select 42 43 52 53 from input buffer.xsquare would flip these 4 samples to 42 52 43 53. // bits[15:7] would select 62 63 72 73 from input buffer.xsquare would flip these 4 samples to 62 72 63 73. select = b11111111000000001111111100000000 // 0 would select lanes computed by xstart,xoffset and xsquare, 1 would select from y
32 outputs in 2x2 tiles:
Constituting the first 4 rows of the transposed matrix w/ 2x2 packing:
If the user doesn't want 2x2 packing in the output vector as it might not conform to the input requirements of the subsequent kernel, it is possible to generate a “row-major” transpose using a 2nd select32.
32 outputs of the 1st select32 in first example (output in 2x2 tiles format):
select32 settings to change output vector format. Input is 2x2 packed output vector from step1 above. xstart=0, xsquare=0x3210, xoffset=0x15111410, // bits[7:0] would select 00 10 20 30 from input buffer.xsquare would keep these 4 samples to 00 10 20 30. // bits[15:8] would select 40 50 60 70 from input buffer.xsquare would keep these 4 samples to 40 50 60 70. // bits[31:16] would similarly select samples 01 11 21 31 41 51 61 71 xoffset_hi=0x1d191c18, // bits[7:0] would select 02 12 22 32 from input buffer.xsquare would keep these 4 samples to 02 12 22 32. // bits[15:8] would select 42 52 62 72 from input buffer.xsquare would keep these 4 samples to 42 52 62 72. ystart= don't care select = b00000000000000000000000000000000
32 outputs of the 2nd select32 generating the “row-major” transpose:
Which is the first 4 rows of the “row-major” transpose:
These steps above show how to get first 4 rows of the transposed matrix. it would take one more select32 to get to next 4 rows of the rest of the matrix.