Overview

These are the intrinsic functions used for implementing a peak cancellation based crest factor reduction (PC-CFR) application. The functionality for this application is split between AIE and programmable logic (PL), where the PL carries out the peak detections and AIE computes the aggregate cancellation signal for the detected peaks. The cancellation signal samples computed by the AIE are subtracted in the PL from the delayed original signal, to cancel the peaks.

The AIE computes the cancellation signal samples by scaling the cancellation pulse (CP) coefficients (which are stored in the AIE memory) for different peaks and summing them up. The two input stream interfaces of the AI Engine are used to receive the following information from the PL: 1) Metadata for LUT indices to read CP coefficients + configuration information for the vectorized mul/mac operations, 2) Complex scaling factors for the detected peaks. The output stream interface of the AI Engine is employed to send the computed cancellation signal samples to the PL.

Typically the AIE program computing the aggregate cancellation signal for N detected peaks comprises the following steps :

Read the input streams:
- Split the metadata from input stream port 0 into CP lut index (idx) and configuration information (ci) to be used by mul or mac intrinsics
- Get the scaling factors from input stream port 1 and write them into a scaling factor buffer
Load CP_LUT(idx) and CP_LUT(idx+1) from memory into CP buffer
Pass ci as a parameter to configure the mul intrinsic, which multiplies CP coefficients selected from the CP buffer with
the scaling factor for the first peak from the scaling factor buffer
Repeat the above steps N-1 times by using the mac intrinsic (instead of mul) to find the accumulated result for N peaks
Move the accumulated result through the SRS unit to the output stream.

Functions
void	split (int a, unsigned n, int &d0, unsigned &d1)
	Intrinsic used to split the 32 bit input data into two resulting variables at the n-th bit.

CFR Multiplication Intrinsics
v8cacc48	mul8_cfr (v16cint16 xbufa, v16cint16 xbufb, int rev_xstart, int xrot, v8cint16 zbuf, unsigned int zstart)
	Complex multiply intrinsic function for cancellation signal calculations in peak-cancellation crest factor reduction algorithm.

v8cacc48	mac8_cfr (v8cacc48 acc, v16cint16 xbufa, v16cint16 xbufb, int rev_xstart, int xrot, v8cint16 zbuf, unsigned int zstart)
	Complex multiply intrinsic function for cancellation signal calculations in peak-cancellation crest factor reduction algorithm.

Function Documentation

v8cacc48 mac8_cfr	(	v8cacc48	acc,
		v16cint16	xbufa,
		v16cint16	xbufb,
		int	rev_xstart,
		int	xrot,
		v8cint16	zbuf,
		unsigned int	zstart
	)

Complex multiply intrinsic function for cancellation signal calculations in peak-cancellation crest factor reduction algorithm.

Parameters

acc	Running accumulation vector (8 x cint48 lanes). Only in mac variant. \| Valid bits: All.
xbufa	First input buffer of 16 complex samples of type cint16 \| Valid bits: All.
xbufb	Second input buffer of 16 complex samples of type cint16 \| Valid bits: All.
rev_xstart	MSB : Flag for backwards input selection / 4b LSB : select starting point within input data. \| Valid bits: 5b LSB.
xrot	Selects which 256b lanes of 8 complex samples from bufa and bufb to use. This must be a compile time constant. \| Valid bits: 2b LSB.
zbuf	Buffer of scaling factors for each qualified peak \| Valid bits: All.
zstart	Selects which of the 8 scaling factor values is used. This must be a compile time constant. \| Valid bits: 3b LSB.

Returns: Resulting accumulation vector (8 x cint48 lanes) | Valid bits: All.

Note: Parameters 'xrot' and 'zstart' must be compile time constants.

The input data provided by xbufa and xbufb can be seen as a concatenation of 8 cancellation pulse (CP) coefficients of type cint16 from xbufa followed by the next 8 coefficients from xbufb as selected by xrot. The resulting 16 samples will be referred to as "CP" in this document. The CP coefficients are loaded to xbufa and xbufb from memory. zbuf contains the 8 scaling factor values, one for each qualified peak. The zstart parameter is used to select the scaling factor for each mul operation.

DATA SELECTORS

xrot:

Selects the first or second set of 8 values to be used from both buffers A and B :

xrot value	selection in xbufa	selection in xbufb
0x0	coefficients 0 to 7	coefficients 0 to 7
0x1	coefficients 8 to 15	coefficients 0 to 7
0x2	coefficients 0 to 7	coefficients 8 to 15
0x3	coefficients 8 to 15	coefficients 8 to 15

Examples :

If you have updated previously updated xbufa with upd_w(0) (values 0 to 7 have been replaced), and xbufb with upd_w(1) (values 8 to 15), you would chose xrot=0x2

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
xbufa(0)	xbufa(1)	xbufa(2)	xbufa(3)	xbufa(4)	xbufa(5)	xbufa(6)	xbufa(7)	xbufb(8)	xbufb(9)	xbufb(10)	xbufb(11)	xbufb(12)	xbufb(13)	xbufb(14)	xbufb(15)

If you have xrot=0x1 :

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
xbufa(8)	xbufa(9)	xbufa(10)	xbufa(11)	xbufa(12)	xbufa(13)	xbufa(14)	xbufa(15)	xbufb(0)	xbufb(1)	xbufb(2)	xbufb(3)	xbufb(4)	xbufb(5)	xbufb(6)	xbufb(7)

It is standard practice to use only upd_w(0) and leave xrot at 0x0 unless your application can benefit from this option.

zstart:

Selects which of the 8 scaling factor values in zbuf will be used for the multiply operation, simply varies between 0x0 and 0x7

rev_xstart:

The 4 LSB select the starting point within the 16 CP values, since only 8 input CP values will be used for a mac operation.

Once the starting point is selected, the remaining MSB of rev_xstart determines which direction the operation will take place. The use of this flag improves the memory efficiency for conjugate-symmetric CPs since only half of the CP coefficients need to be present in the memory.

Example :

If the 4 bits are set to 0x7, CP(7) will be selected as the starting point. Then the MSB of rev_xstart will influence the way the operation works :

Set to 0 : CP(7) up to CP(14) will be used

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
							start	---—>	---—>	----—>	----—>	----—>	----—>	end

Set to 1 : The complex conjugates of CP(7) down to CP(0) will be used

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
end	<---—	<---—	<---—	<---—	<---—	<---—	start

FULL EXAMPLES

For both examples, the CP values will have been loaded into the lower half of xbufa and xbufb before they are passed to the function, and xrot can be left at 0x0.

The mac variant of this intrinsic is similar, but accumulates into acc instead of assignment.

Example 1 (MSB of rev_xstart=0)

Command : mul8_cfr(xbufa, xbufb, 0x04, 0x0, zbuf, 0x2)

zstart is at 0x2, the third scaling factor value of zbuf will be used.

The 4 LSB of rev_xstart are set to 0x4, the starting point within the 16 available CP values will be 4.

Resulting operation :

acc(0) = zbuf(2) * CP(4)
acc(1) = zbuf(2) * CP(5)
acc(2) = zbuf(2) * CP(6)
acc(3) = zbuf(2) * CP(7)
acc(4) = zbuf(2) * CP(8)
acc(5) = zbuf(2) * CP(9)
acc(6) = zbuf(2) * CP(10)
acc(7) = zbuf(2) * CP(11)

Example 2 (MSB of rev_xstart=1)

Command : mul8_cfr(xbufa, xbufb, 0x09, 0x0, zbuf, 0x3)

zstart is at 0x3, the fourth scaling factor value of zbuf will be used.

The 4 LSB of rev_xstart are set to 0x9, the starting point within the 16 available CP values will be 9.

The MSB of rev_xstart is set to 1, so the CP values will go from CP(9) down to CP(2) and use the complex conjugates

Resulting operation :

acc(0) = zbuf(3) * conj(CP(9))
acc(1) = zbuf(3) * conj(CP(8))
acc(2) = zbuf(3) * conj(CP(7))
acc(3) = zbuf(3) * conj(CP(6))
acc(4) = zbuf(3) * conj(CP(5))
acc(5) = zbuf(3) * conj(CP(4))
acc(6) = zbuf(3) * conj(CP(3))
acc(7) = zbuf(3) * conj(CP(2))

v8cacc48 mul8_cfr	(	v16cint16	xbufa,
		v16cint16	xbufb,
		int	rev_xstart,
		int	xrot,
		v8cint16	zbuf,
		unsigned int	zstart
	)

Complex multiply intrinsic function for cancellation signal calculations in peak-cancellation crest factor reduction algorithm.

Parameters

acc	Running accumulation vector (8 x cint48 lanes). Only in mac variant. \| Valid bits: All.
xbufa	First input buffer of 16 complex samples of type cint16 \| Valid bits: All.
xbufb	Second input buffer of 16 complex samples of type cint16 \| Valid bits: All.
rev_xstart	MSB : Flag for backwards input selection / 4b LSB : select starting point within input data. \| Valid bits: 5b LSB.
xrot	Selects which 256b lanes of 8 complex samples from bufa and bufb to use. This must be a compile time constant. \| Valid bits: 2b LSB.
zbuf	Buffer of scaling factors for each qualified peak \| Valid bits: All.
zstart	Selects which of the 8 scaling factor values is used. This must be a compile time constant. \| Valid bits: 3b LSB.

Returns: Resulting accumulation vector (8 x cint48 lanes) | Valid bits: All.

Note: Parameters 'xrot' and 'zstart' must be compile time constants.

The input data provided by xbufa and xbufb can be seen as a concatenation of 8 cancellation pulse (CP) coefficients of type cint16 from xbufa followed by the next 8 coefficients from xbufb as selected by xrot. The resulting 16 samples will be referred to as "CP" in this document. The CP coefficients are loaded to xbufa and xbufb from memory. zbuf contains the 8 scaling factor values, one for each qualified peak. The zstart parameter is used to select the scaling factor for each mul operation.

DATA SELECTORS

xrot:

Selects the first or second set of 8 values to be used from both buffers A and B :

xrot value	selection in xbufa	selection in xbufb
0x0	coefficients 0 to 7	coefficients 0 to 7
0x1	coefficients 8 to 15	coefficients 0 to 7
0x2	coefficients 0 to 7	coefficients 8 to 15
0x3	coefficients 8 to 15	coefficients 8 to 15

Examples :

If you have updated previously updated xbufa with upd_w(0) (values 0 to 7 have been replaced), and xbufb with upd_w(1) (values 8 to 15), you would chose xrot=0x2

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
xbufa(0)	xbufa(1)	xbufa(2)	xbufa(3)	xbufa(4)	xbufa(5)	xbufa(6)	xbufa(7)	xbufb(8)	xbufb(9)	xbufb(10)	xbufb(11)	xbufb(12)	xbufb(13)	xbufb(14)	xbufb(15)

If you have xrot=0x1 :

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
xbufa(8)	xbufa(9)	xbufa(10)	xbufa(11)	xbufa(12)	xbufa(13)	xbufa(14)	xbufa(15)	xbufb(0)	xbufb(1)	xbufb(2)	xbufb(3)	xbufb(4)	xbufb(5)	xbufb(6)	xbufb(7)

It is standard practice to use only upd_w(0) and leave xrot at 0x0 unless your application can benefit from this option.

zstart:

Selects which of the 8 scaling factor values in zbuf will be used for the multiply operation, simply varies between 0x0 and 0x7

rev_xstart:

The 4 LSB select the starting point within the 16 CP values, since only 8 input CP values will be used for a mac operation.

Once the starting point is selected, the remaining MSB of rev_xstart determines which direction the operation will take place. The use of this flag improves the memory efficiency for conjugate-symmetric CPs since only half of the CP coefficients need to be present in the memory.

Example :

If the 4 bits are set to 0x7, CP(7) will be selected as the starting point. Then the MSB of rev_xstart will influence the way the operation works :

Set to 0 : CP(7) up to CP(14) will be used

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
							start	---—>	---—>	----—>	----—>	----—>	----—>	end

Set to 1 : The complex conjugates of CP(7) down to CP(0) will be used

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
end	<---—	<---—	<---—	<---—	<---—	<---—	start

FULL EXAMPLES

For both examples, the CP values will have been loaded into the lower half of xbufa and xbufb before they are passed to the function, and xrot can be left at 0x0.

The mac variant of this intrinsic is similar, but accumulates into acc instead of assignment.

Example 1 (MSB of rev_xstart=0)

Command : mul8_cfr(xbufa, xbufb, 0x04, 0x0, zbuf, 0x2)

zstart is at 0x2, the third scaling factor value of zbuf will be used.

The 4 LSB of rev_xstart are set to 0x4, the starting point within the 16 available CP values will be 4.

Resulting operation :

acc(0) = zbuf(2) * CP(4)
acc(1) = zbuf(2) * CP(5)
acc(2) = zbuf(2) * CP(6)
acc(3) = zbuf(2) * CP(7)
acc(4) = zbuf(2) * CP(8)
acc(5) = zbuf(2) * CP(9)
acc(6) = zbuf(2) * CP(10)
acc(7) = zbuf(2) * CP(11)

Example 2 (MSB of rev_xstart=1)

Command : mul8_cfr(xbufa, xbufb, 0x09, 0x0, zbuf, 0x3)

zstart is at 0x3, the fourth scaling factor value of zbuf will be used.

The 4 LSB of rev_xstart are set to 0x9, the starting point within the 16 available CP values will be 9.

The MSB of rev_xstart is set to 1, so the CP values will go from CP(9) down to CP(2) and use the complex conjugates

Resulting operation :

acc(0) = zbuf(3) * conj(CP(9))
acc(1) = zbuf(3) * conj(CP(8))
acc(2) = zbuf(3) * conj(CP(7))
acc(3) = zbuf(3) * conj(CP(6))
acc(4) = zbuf(3) * conj(CP(5))
acc(5) = zbuf(3) * conj(CP(4))
acc(6) = zbuf(3) * conj(CP(3))
acc(7) = zbuf(3) * conj(CP(2))

void split	(	int	a,
		unsigned	n,
		int &	d0,
		unsigned &	d1
	)

Intrinsic used to split the 32 bit input data into two resulting variables at the n-th bit.

The split separates the 32 bits of  into index info to update CP LUT pointers and intrinsic prepares the magnitude values for further processing in the DPD. The parameters are the following:

Parameters

a	Input data as a 32bit signed integer.
n	Number of LSBs that shall end up in d1. This must be a compile-time constant
d0	Output variable that will contain bits n to 31 of the input. Intended as an index and is a signed number (sign extended).
d1	Output variable that will contain bits 0 to n-1 of the input. ####Example : #### Command : split(data, 6, out1, out2) We will imagine that data = 0x44FA, which gives the following operation : data = 0100 0100 11\|11 1010 (split after the n-th LSB, which is 6 in this example) This gives : out0 = 0000 0001 0001 0011 out1 = 0000 0000 0011 1010 ##Crest Factor Reduction Application## For Peak Cancellation CFR, one of the two input streams into an AI Engine is dedicated to communicate 32 bit metadata samples. The 27 MSB of a metadata sample are used for Cancellation Pulse (CP) LUT indexing, and the 5 LSB provide configuration information for the subsequent mul or mac operation. See below for more information on how 5 LSB are used for configuring mul and mac operation: \ref mul8_cfr "v8cacc48 mul8_cfr(v16cint16 xbufa,v16cint16 xbufb, int rev_xstart, int xrot, v8cint16 zbuf, int zstart)" \ref mac8_cfr "v8cacc48 mac8_cfr(v8cacc48 acc, v16cint16 xbufa,v16cint16 xbufb, int rev_xstart, int xrot, v8cint16 zbuf, int zstart)" ##Digital Pre-Distortion Appliction## The split intrinsic used in DPD applications is slightly different and has an additional parameter: \ref split_dpd "void split(int mag, int frac_bits, int lut_width, int& idx, unsigned& frac)"

Overview

Functions

CFR Multiplication Intrinsics

Function Documentation

DATA SELECTORS

xrot:

zstart:

rev_xstart:

FULL EXAMPLES

Example 1 (MSB of rev_xstart=0)

Example 2 (MSB of rev_xstart=1)

DATA SELECTORS

xrot:

zstart:

rev_xstart:

FULL EXAMPLES

Example 1 (MSB of rev_xstart=0)

Example 2 (MSB of rev_xstart=1)