AI Engine Intrinsics User Guide
(v2023.2)
|
These are the intrinsic functions used for implementing a peak cancellation based crest factor reduction (PC-CFR) application. The functionality for this application is split between AIE and programmable logic (PL), where the PL carries out the peak detections and AIE computes the aggregate cancellation signal for the detected peaks. The cancellation signal samples computed by the AIE are subtracted in the PL from the delayed original signal, to cancel the peaks.
The AIE computes the cancellation signal samples by scaling the cancellation pulse (CP) coefficients (which are stored in the AIE memory) for different peaks and summing them up. The two input stream interfaces of the AI Engine are used to receive the following information from the PL: 1) Metadata for LUT indices to read CP coefficients + configuration information for the vectorized mul/mac operations, 2) Complex scaling factors for the detected peaks. The output stream interface of the AI Engine is employed to send the computed cancellation signal samples to the PL.
Typically the AIE program computing the aggregate cancellation signal for N detected peaks comprises the following steps :
Functions | |
void | split (int a, unsigned n, int &d0, unsigned &d1) |
Intrinsic used to split the 32 bit input data into two resulting variables at the n-th bit. | |
CFR Multiplication Intrinsics | |
v8cacc48 | mul8_cfr (v16cint16 xbufa, v16cint16 xbufb, int rev_xstart, int xrot, v8cint16 zbuf, unsigned int zstart) |
Complex multiply intrinsic function for cancellation signal calculations in peak-cancellation crest factor reduction algorithm. | |
v8cacc48 | mac8_cfr (v8cacc48 acc, v16cint16 xbufa, v16cint16 xbufb, int rev_xstart, int xrot, v8cint16 zbuf, unsigned int zstart) |
Complex multiply intrinsic function for cancellation signal calculations in peak-cancellation crest factor reduction algorithm. | |
v8cacc48 mac8_cfr | ( | v8cacc48 | acc, |
v16cint16 | xbufa, | ||
v16cint16 | xbufb, | ||
int | rev_xstart, | ||
int | xrot, | ||
v8cint16 | zbuf, | ||
unsigned int | zstart | ||
) |
Complex multiply intrinsic function for cancellation signal calculations in peak-cancellation crest factor reduction algorithm.
acc | Running accumulation vector (8 x cint48 lanes). Only in mac variant. | Valid bits: All. |
xbufa | First input buffer of 16 complex samples of type cint16 | Valid bits: All. |
xbufb | Second input buffer of 16 complex samples of type cint16 | Valid bits: All. |
rev_xstart | MSB : Flag for backwards input selection / 4b LSB : select starting point within input data. | Valid bits: 5b LSB. |
xrot | Selects which 256b lanes of 8 complex samples from bufa and bufb to use. This must be a compile time constant. | Valid bits: 2b LSB. |
zbuf | Buffer of scaling factors for each qualified peak | Valid bits: All. |
zstart | Selects which of the 8 scaling factor values is used. This must be a compile time constant. | Valid bits: 3b LSB. |
The input data provided by xbufa and xbufb can be seen as a concatenation of 8 cancellation pulse (CP) coefficients of type cint16 from xbufa followed by the next 8 coefficients from xbufb as selected by xrot. The resulting 16 samples will be referred to as "CP" in this document. The CP coefficients are loaded to xbufa and xbufb from memory. zbuf contains the 8 scaling factor values, one for each qualified peak. The zstart parameter is used to select the scaling factor for each mul operation.
Selects the first or second set of 8 values to be used from both buffers A and B :
xrot value | selection in xbufa | selection in xbufb |
---|---|---|
0x0 | coefficients 0 to 7 | coefficients 0 to 7 |
0x1 | coefficients 8 to 15 | coefficients 0 to 7 |
0x2 | coefficients 0 to 7 | coefficients 8 to 15 |
0x3 | coefficients 8 to 15 | coefficients 8 to 15 |
Examples :
If you have updated previously updated xbufa with upd_w(0) (values 0 to 7 have been replaced), and xbufb with upd_w(1) (values 8 to 15), you would chose xrot=0x2
CP(0) | CP(1) | CP(2) | CP(3) | CP(4) | CP(5) | CP(6) | CP(7) | CP(8) | CP(9) | CP(10) | CP(11) | CP(12) | CP(13) | CP(14) | CP(15) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
xbufa(0) | xbufa(1) | xbufa(2) | xbufa(3) | xbufa(4) | xbufa(5) | xbufa(6) | xbufa(7) | xbufb(8) | xbufb(9) | xbufb(10) | xbufb(11) | xbufb(12) | xbufb(13) | xbufb(14) | xbufb(15) |
If you have xrot=0x1 :
CP(0) | CP(1) | CP(2) | CP(3) | CP(4) | CP(5) | CP(6) | CP(7) | CP(8) | CP(9) | CP(10) | CP(11) | CP(12) | CP(13) | CP(14) | CP(15) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
xbufa(8) | xbufa(9) | xbufa(10) | xbufa(11) | xbufa(12) | xbufa(13) | xbufa(14) | xbufa(15) | xbufb(0) | xbufb(1) | xbufb(2) | xbufb(3) | xbufb(4) | xbufb(5) | xbufb(6) | xbufb(7) |
It is standard practice to use only upd_w(0) and leave xrot at 0x0 unless your application can benefit from this option.
Selects which of the 8 scaling factor values in zbuf will be used for the multiply operation, simply varies between 0x0 and 0x7
The 4 LSB select the starting point within the 16 CP values, since only 8 input CP values will be used for a mac operation.
Once the starting point is selected, the remaining MSB of rev_xstart determines which direction the operation will take place. The use of this flag improves the memory efficiency for conjugate-symmetric CPs since only half of the CP coefficients need to be present in the memory.
Example :
If the 4 bits are set to 0x7, CP(7) will be selected as the starting point. Then the MSB of rev_xstart will influence the way the operation works :
CP(0) | CP(1) | CP(2) | CP(3) | CP(4) | CP(5) | CP(6) | CP(7) | CP(8) | CP(9) | CP(10) | CP(11) | CP(12) | CP(13) | CP(14) | CP(15) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
start | ---—> | ---—> | ----—> | ----—> | ----—> | ----—> | end |
CP(0) | CP(1) | CP(2) | CP(3) | CP(4) | CP(5) | CP(6) | CP(7) | CP(8) | CP(9) | CP(10) | CP(11) | CP(12) | CP(13) | CP(14) | CP(15) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
end | <---— | <---— | <---— | <---— | <---— | <---— | start |
For both examples, the CP values will have been loaded into the lower half of xbufa and xbufb before they are passed to the function, and xrot can be left at 0x0.
The mac variant of this intrinsic is similar, but accumulates into acc instead of assignment.
Command : mul8_cfr(xbufa, xbufb, 0x04, 0x0, zbuf, 0x2)
Resulting operation :
Command : mul8_cfr(xbufa, xbufb, 0x09, 0x0, zbuf, 0x3)
Resulting operation :
v8cacc48 mul8_cfr | ( | v16cint16 | xbufa, |
v16cint16 | xbufb, | ||
int | rev_xstart, | ||
int | xrot, | ||
v8cint16 | zbuf, | ||
unsigned int | zstart | ||
) |
Complex multiply intrinsic function for cancellation signal calculations in peak-cancellation crest factor reduction algorithm.
acc | Running accumulation vector (8 x cint48 lanes). Only in mac variant. | Valid bits: All. |
xbufa | First input buffer of 16 complex samples of type cint16 | Valid bits: All. |
xbufb | Second input buffer of 16 complex samples of type cint16 | Valid bits: All. |
rev_xstart | MSB : Flag for backwards input selection / 4b LSB : select starting point within input data. | Valid bits: 5b LSB. |
xrot | Selects which 256b lanes of 8 complex samples from bufa and bufb to use. This must be a compile time constant. | Valid bits: 2b LSB. |
zbuf | Buffer of scaling factors for each qualified peak | Valid bits: All. |
zstart | Selects which of the 8 scaling factor values is used. This must be a compile time constant. | Valid bits: 3b LSB. |
The input data provided by xbufa and xbufb can be seen as a concatenation of 8 cancellation pulse (CP) coefficients of type cint16 from xbufa followed by the next 8 coefficients from xbufb as selected by xrot. The resulting 16 samples will be referred to as "CP" in this document. The CP coefficients are loaded to xbufa and xbufb from memory. zbuf contains the 8 scaling factor values, one for each qualified peak. The zstart parameter is used to select the scaling factor for each mul operation.
Selects the first or second set of 8 values to be used from both buffers A and B :
xrot value | selection in xbufa | selection in xbufb |
---|---|---|
0x0 | coefficients 0 to 7 | coefficients 0 to 7 |
0x1 | coefficients 8 to 15 | coefficients 0 to 7 |
0x2 | coefficients 0 to 7 | coefficients 8 to 15 |
0x3 | coefficients 8 to 15 | coefficients 8 to 15 |
Examples :
If you have updated previously updated xbufa with upd_w(0) (values 0 to 7 have been replaced), and xbufb with upd_w(1) (values 8 to 15), you would chose xrot=0x2
CP(0) | CP(1) | CP(2) | CP(3) | CP(4) | CP(5) | CP(6) | CP(7) | CP(8) | CP(9) | CP(10) | CP(11) | CP(12) | CP(13) | CP(14) | CP(15) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
xbufa(0) | xbufa(1) | xbufa(2) | xbufa(3) | xbufa(4) | xbufa(5) | xbufa(6) | xbufa(7) | xbufb(8) | xbufb(9) | xbufb(10) | xbufb(11) | xbufb(12) | xbufb(13) | xbufb(14) | xbufb(15) |
If you have xrot=0x1 :
CP(0) | CP(1) | CP(2) | CP(3) | CP(4) | CP(5) | CP(6) | CP(7) | CP(8) | CP(9) | CP(10) | CP(11) | CP(12) | CP(13) | CP(14) | CP(15) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
xbufa(8) | xbufa(9) | xbufa(10) | xbufa(11) | xbufa(12) | xbufa(13) | xbufa(14) | xbufa(15) | xbufb(0) | xbufb(1) | xbufb(2) | xbufb(3) | xbufb(4) | xbufb(5) | xbufb(6) | xbufb(7) |
It is standard practice to use only upd_w(0) and leave xrot at 0x0 unless your application can benefit from this option.
Selects which of the 8 scaling factor values in zbuf will be used for the multiply operation, simply varies between 0x0 and 0x7
The 4 LSB select the starting point within the 16 CP values, since only 8 input CP values will be used for a mac operation.
Once the starting point is selected, the remaining MSB of rev_xstart determines which direction the operation will take place. The use of this flag improves the memory efficiency for conjugate-symmetric CPs since only half of the CP coefficients need to be present in the memory.
Example :
If the 4 bits are set to 0x7, CP(7) will be selected as the starting point. Then the MSB of rev_xstart will influence the way the operation works :
CP(0) | CP(1) | CP(2) | CP(3) | CP(4) | CP(5) | CP(6) | CP(7) | CP(8) | CP(9) | CP(10) | CP(11) | CP(12) | CP(13) | CP(14) | CP(15) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
start | ---—> | ---—> | ----—> | ----—> | ----—> | ----—> | end |
CP(0) | CP(1) | CP(2) | CP(3) | CP(4) | CP(5) | CP(6) | CP(7) | CP(8) | CP(9) | CP(10) | CP(11) | CP(12) | CP(13) | CP(14) | CP(15) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
end | <---— | <---— | <---— | <---— | <---— | <---— | start |
For both examples, the CP values will have been loaded into the lower half of xbufa and xbufb before they are passed to the function, and xrot can be left at 0x0.
The mac variant of this intrinsic is similar, but accumulates into acc instead of assignment.
Command : mul8_cfr(xbufa, xbufb, 0x04, 0x0, zbuf, 0x2)
Resulting operation :
Command : mul8_cfr(xbufa, xbufb, 0x09, 0x0, zbuf, 0x3)
Resulting operation :
void split | ( | int | a, |
unsigned | n, | ||
int & | d0, | ||
unsigned & | d1 | ||
) |
Intrinsic used to split the 32 bit input data into two resulting variables at the n-th bit.
The split separates the 32 bits of into index info to update CP LUT pointers and intrinsic prepares the magnitude values for further processing in the DPD. The parameters are the following:
a | Input data as a 32bit signed integer. |
n | Number of LSBs that shall end up in d1. This must be a compile-time constant |
d0 | Output variable that will contain bits n to 31 of the input. Intended as an index and is a signed number (sign extended). |
d1 | Output variable that will contain bits 0 to n-1 of the input. ####Example : #### Command : split(data, 6, out1, out2) We will imagine that data = 0x44FA, which gives the following operation : data = 0100 0100 11|11 1010 (split after the n-th LSB, which is 6 in this example) This gives : out0 = 0000 0001 0001 0011 out1 = 0000 0000 0011 1010 ##Crest Factor Reduction Application## For Peak Cancellation CFR, one of the two input streams into an AI Engine is dedicated to communicate 32 bit metadata samples. The 27 MSB of a metadata sample are used for Cancellation Pulse (CP) LUT indexing, and the 5 LSB provide configuration information for the subsequent mul or mac operation. See below for more information on how 5 LSB are used for configuring mul and mac operation: \ref mul8_cfr "v8cacc48 mul8_cfr(v16cint16 xbufa,v16cint16 xbufb, int rev_xstart, int xrot, v8cint16 zbuf, int zstart)" \ref mac8_cfr "v8cacc48 mac8_cfr(v8cacc48 acc, v16cint16 xbufa,v16cint16 xbufb, int rev_xstart, int xrot, v8cint16 zbuf, int zstart)" ##Digital Pre-Distortion Appliction## The split intrinsic used in DPD applications is slightly different and has an additional parameter: \ref split_dpd "void split(int mag, int frac_bits, int lut_width, int& idx, unsigned& frac)" |