AI Engine Intrinsics User Guide  (AIE) v(2024.1)
 All Data Structures Namespaces Functions Variables Typedefs Groups Pages

Overview

These are the intrinsics functions used for implementing a DPD forward path. The AI Engine utilizes LUTs to store the products base functions and coefficients. For details in the implementation please read the further application-specific documentation on Digital Pre-distortion on AI Engine.

Typically the DPD forward path comprises the following steps:

Macros

#define PMX_CFG(a00, a01, a02, a03, a04, a05, a06, a07, a08, a09, a10, a11, a12, a13, a14, a15)   { a00,a01,a02,a03,a04,a05,(a06)&3,(a06)>>2,a07,a08,a09,a10,a11,(a12)&0xf,(a12)>>4,a13,a14,a15 }
 This MACRO is deprecated and it is defined only for legacy. DPD configuration should be done in user space. This macro was used to create a pmx_cfg struct from the values passed in.
 

Functions

class deprecated ("This type is deprecated and it is defined only for legacy. DPD configuration should be done in user space.")]] pmx_idx
 This type is deprecated and it is defined only for legacy. DPD configuration should be done in user space. This macro was used to create a pmx_cfg struct from the values passed in.
 
v8cacc48 dpd_ipol (v32cint16 xbuf, pmx_idx loffs, pmx_idx roffs, v16int16 zbuf, unsigned int zoffs, unsigned int zoffs_hi, int shft)
 LUT interpolation and post-add intrinsic for 8 complex valued output lanes.
 
pmx_idx const set_pmx_idx (pmx_cfg const &pmx)
 Set permutation control for left and right buffer.
 
void split (int a, unsigned n, unsigned const w, int &msb, unsigned &lsb)
 Intrinsic used by DPD to split the magnitude into index and fraction for LUT interpolation.
 
void split2 (int a, unsigned n, unsigned const w, int &msb_lo, int &msb_hi, unsigned &lsb)
 Similar to doing two separate splits on the lower and upper half of input a.
 

DPD multiplication intrinsics

v8cacc48 dpd (v8cacc48 acc, int rot, v16cint16 lut, v8cint16 data, unsigned int zoffs)
 Intrinsic multiplying the LUT values with the data values and shifting the outputs through the output delay line.
 
v8cacc48 dpd (v8cacc48 acc, v4cacc48 scd, int rot, v16cint16 lut, v8cint16 data, unsigned int zoffs)
 Intrinsic multiplying the LUT values with the data values and shifting the outputs through the output delay line.
 
v8cacc48 dpd (v8cacc48 acc, int rot, v16cint16 lut, v16int16 data, unsigned int zoffs, unsigned int zoffs_hi)
 Intrinsic multiplying the LUT values with the data values and shifting the outputs through the output delay line.
 
v8cacc48 dpd (v8cacc48 acc, v4cacc48 scd, int rot, v16cint16 lut, v16int16 data, unsigned int zoffs, unsigned int zoffs_hi)
 Intrinsic multiplying the LUT values with the data values and shifting the outputs through the output delay line.
 
v8cacc48 mac4_preadd_rot (v8cacc48 acc, v4cacc48 scd, int rot, v16cint16 xbuff, int xstart, unsigned int xoffsets, int xstep, int ystart, int ystepmult, v8cint16 zbuff, unsigned int zstart, unsigned int zoffsets, int zstep)
 Multiply-accumulate with pre-adding and accumulator rotation.
 
v8cacc48 mac4_preadd_rot (v8cacc48 acc, int rot, v16cint16 xbuff, int xstart, unsigned int xoffsets, int xstep, int ystart, int ystepmult, v8cint16 zbuff, unsigned int zstart, unsigned int zoffsets, int zstep)
 Multiply-accumulate with pre-adding and accumulator rotation.
 
v8cacc48 mac4_preadd_rot (v8cacc48 acc, v4cacc48 scd, int rot, v32cint16 xbuff, int xstart, unsigned int xoffsets, int xstep, int ystart, int ystepmult, v8cint16 zbuff, unsigned int zstart, unsigned int zoffsets, int zstep)
 Multiply-accumulate with pre-adding and accumulator rotation.
 
v8cacc48 mac4_preadd_rot (v8cacc48 acc, int rot, v32cint16 xbuff, int xstart, unsigned int xoffsets, int xstep, int ystart, int ystepmult, v8cint16 zbuff, unsigned int zstart, unsigned int zoffsets, int zstep)
 Multiply-accumulate with pre-adding and accumulator rotation.
 
v8cacc48 mac4_rot (v8cacc48 acc, v4cacc48 scd, int rot, v16cint16 xbuff, int xstart, unsigned int xoffsets, int xstep, v8cint16 zbuff, unsigned int zstart, unsigned int zoffsets, int zstep)
 Multiply-accumulate with accumulator rotation.
 
v8cacc48 mac4_rot (v8cacc48 acc, v4cacc48 scd, int rot, v16cint16 xbuff, int xstart, unsigned int xoffsets, int xstep, v16int16 zbuff, unsigned int zstart, unsigned int zoffsets, int zstep)
 Multiply-accumulate with accumulator rotation.
 
v8cacc48 mac4_rot (v8cacc48 acc, int rot, v16cint16 xbuff, int xstart, unsigned int xoffsets, int xstep, v8cint16 zbuff, unsigned int zstart, unsigned int zoffsets, int zstep)
 Multiply-accumulate with accumulator rotation.
 
v8cacc48 mac4_rot (v8cacc48 acc, int rot, v16cint16 xbuff, int xstart, unsigned int xoffsets, int xstep, v16int16 zbuff, unsigned int zstart, unsigned int zoffsets, int zstep)
 Multiply-accumulate with accumulator rotation.
 
v8cacc48 mac4_rot (v8cacc48 acc, v4cacc48 scd, int rot, v32cint16 xbuff, int xstart, unsigned int xoffsets, int xstep, v8cint16 zbuff, unsigned int zstart, unsigned int zoffsets, int zstep)
 Multiply-accumulate with accumulator rotation.
 
v8cacc48 mac4_rot (v8cacc48 acc, v4cacc48 scd, int rot, v32cint16 xbuff, int xstart, unsigned int xoffsets, int xstep, v16int16 zbuff, unsigned int zstart, unsigned int zoffsets, int zstep)
 Multiply-accumulate with accumulator rotation.
 
v8cacc48 mac4_rot (v8cacc48 acc, int rot, v32cint16 xbuff, int xstart, unsigned int xoffsets, int xstep, v8cint16 zbuff, unsigned int zstart, unsigned int zoffsets, int zstep)
 Multiply-accumulate with accumulator rotation.
 
v8cacc48 mac4_rot (v8cacc48 acc, int rot, v32cint16 xbuff, int xstart, unsigned int xoffsets, int xstep, v16int16 zbuff, unsigned int zstart, unsigned int zoffsets, int zstep)
 Multiply-accumulate with accumulator rotation.
 

Macro Definition Documentation

#define PMX_CFG (   a00,
  a01,
  a02,
  a03,
  a04,
  a05,
  a06,
  a07,
  a08,
  a09,
  a10,
  a11,
  a12,
  a13,
  a14,
  a15 
)    { a00,a01,a02,a03,a04,a05,(a06)&3,(a06)>>2,a07,a08,a09,a10,a11,(a12)&0xf,(a12)>>4,a13,a14,a15 }

This MACRO is deprecated and it is defined only for legacy. DPD configuration should be done in user space. This macro was used to create a pmx_cfg struct from the values passed in.

Function Documentation

union deprecated ( "This type is deprecated and it is defined only for legacy. DPD configuration should be done in user space."  )
readwrite

This type is deprecated and it is defined only for legacy. DPD configuration should be done in user space. This macro was used to create a pmx_cfg struct from the values passed in.

This type is deprecated and it is defined only for legacy. DPD configuration should be done in user space. This type was used internally to the function set_pmx_idx.

This type is deprecated and it is defined only for legacy. DPD configuration should be done in user space. The permuation configuration which was used in setting up the permuation control.

v8cacc48 dpd ( v8cacc48  acc,
int  rot,
v16cint16  lut,
v8cint16  data,
unsigned int  zoffs 
)

Intrinsic multiplying the LUT values with the data values and shifting the outputs through the output delay line.

The dpd intrinsic wraps up the DPD. First it permutes the input
data vector to ensure that the correct data values are multiplied
with their corresponding LUT values. Then it executes the 8 complex/complex or 16 complex/real
multiplications. After this, neighbouring values are added in a
post-add step and finally the results are added onto a v8cacc48
delay line which is then shifted by one value. The reason that
the output delay accumulator is 8 values wide is that we want to
output 4 output lanes at a time. Since the highest three entries
in the delay line are intermediate values they are not ready for
output.

The complete operation can be described by the following Pseudo-code:
sum[0] = lut[0]*data[zoffs[0]] + lut[1]*data[zoffs[1]]
sum[1] = lut[2]*data[zoffs[2]] + lut[3]*data[zoffs[3]]
sum[2] = lut[4]*data[zoffs[4]] + lut[5]*data[zoffs[5]]
sum[3] = lut[6]*data[zoffs[6]] + lut[7]*data[zoffs[7]]
v4cacc48 acc_in = null_v4acc48();
//accumulate and rotate
for (auto i = 0; i < 4; i++)
acc[i] = acc[i+rot] ;
for (auto i = 4; i < (8-rot); i++)
acc[i] = acc[i+rot] + sum[i-4];
for (auto i = (8-rot); i < 8; i++)
acc[i] = acc_in[i+rot-8] + sum[i-4] ;
Parameters
accOutput accumulator which holds the output delay line
rotRotation value by which the delay line is shifted. Valid values are 1,2 and 4. This must be a compile time constant.
lutVector of 16 complex LUT values to be multiplied with the data values
dataVector of 8 complex data values
zoffsPermutation control for the input data values. Each index holds 4bit within this 32bit integer.
##Example##

Let us consider the eight complex input terms are ordered in lut as follows:

n|Term 1<br>m r|Term 2<br>m r|Lane<br>#|DD<br>m-n
:|:-----------:|:-----------:|:-------:|:-------:
0|0     0      |0     2      |0        |0
0|0     1      |0     3      |1        |0
1|1     1      |1     2      |2        |0
1|1     2      |1     4      |3        |0
2|2     2      |2     2      |4        |0
2|2     3      |2     5      |5        |0
3|3     3      |3     2      |6        |0
3|3     4      |3     6      |7        |0

Term 1 and Term 2 denote the two terms that were added into one value by the post-add step during interpolation. One can see that neighbouring pairs share the same n and can therefore be added in the post-adding step. The data delay column tells us which values have to be passed to the zoffs parameter. The function call in this case looks like this:
int const zoffs = 0x00000000;
dpd_result = dpd(dpd_result,get_scd(),1,lut,data,zoffs);
v8cacc48 dpd ( v8cacc48  acc,
v4cacc48  scd,
int  rot,
v16cint16  lut,
v8cint16  data,
unsigned int  zoffs 
)

Intrinsic multiplying the LUT values with the data values and shifting the outputs through the output delay line.

The dpd intrinsic wraps up the DPD. First it permutes the input
data vector to ensure that the correct data values are multiplied
with their corresponding LUT values. Then it executes the 8 complex/complex
multiplications. After this, neighbouring values are added in a
post-add step and finally the results are added onto a v8cacc48
delay line which is then shifted by one value. The reason that
the output delay accumulator is 8 values wide is that we want to
output 4 output lanes at a time. Since the highest three entries
in the delay line are intermediate values they are not ready for
output.

The complete operation can be described by the following Pseudo-code:
sum[0] = lut[0]*data[zoffs[0]] + lut[1]*data[zoffs[1]]
sum[1] = lut[2]*data[zoffs[2]] + lut[3]*data[zoffs[3]]
sum[2] = lut[4]*data[zoffs[4]] + lut[5]*data[zoffs[5]]
sum[3] = lut[6]*data[zoffs[6]] + lut[7]*data[zoffs[7]]
// input from cascade
v4cacc48 acc_in = scd;
//accumulate and rotate
for (auto i = 0; i < 4; i++)
acc[i] = acc[i+rot] ;
for (auto i = 4; i < (8-rot); i++)
acc[i] = acc[i+rot] + sum[i-4];
for (auto i = (8-rot); i < 8; i++)
acc[i] = acc_in[i+rot-8] + sum[i-4] ;
Parameters
accOutput accumulator which holds the output delay line
scdInput accumulator from the SCD cascade stream. Optional.
rotRotation value by which the delay line is shifted. Valid values are 1,2 and 4. This must be a compile time constant.
lutVector of 16 complex LUT values to be multiplied with the data values
dataVector of 8 complex data values
zoffsPermutation control for the input data values. Each index holds 4bit within this 32bit integer.
##Example##

Let us consider the eight complex input terms are ordered in lut as follows:

n|Term 1<br>m r|Term 2<br>m r|Lane<br>#|DD<br>m-n
:|:-----------:|:-----------:|:-------:|:-------:
0|0     0      |0     2      |0        |0
0|0     1      |0     3      |1        |0
1|1     1      |1     2      |2        |0
1|1     2      |1     4      |3        |0
2|2     2      |2     2      |4        |0
2|2     3      |2     5      |5        |0
3|3     3      |3     2      |6        |0
3|3     4      |3     6      |7        |0

Term 1 and Term 2 denote the two terms that were added into one value by the post-add step during interpolation. One can see that neighbouring pairs share the same n and can therefore be added in the post-adding step. The data delay column tells us which values have to be passed to the zoffs parameter. The function call in this case looks like this:
int const zoffs = 0x00000000;
dpd_result = dpd(dpd_result,get_scd(),1,lut,data,zoffs);
v8cacc48 dpd ( v8cacc48  acc,
int  rot,
v16cint16  lut,
v16int16  data,
unsigned int  zoffs,
unsigned int  zoffs_hi 
)

Intrinsic multiplying the LUT values with the data values and shifting the outputs through the output delay line.

The dpd intrinsic wraps up the DPD. First it permutes the input
data vector to ensure that the correct data values are multiplied
with their corresponding LUT values. Then it executes the 16 complex/real
multiplications. After this, neighbouring values are added in a
post-add step and finally the results are added onto a v8cacc48
delay line which is then shifted by one value. The reason that
the output delay accumulator is 8 values wide is that we want to
output 4 output lanes at a time. Since the highest three entries
in the delay line are intermediate values they are not ready for
output.

The complete operation can be described by the following Pseudo-code:
sum[0] = lut[0]*data[zoffs[0]] + lut[1]*data[zoffs[1]] + lut[8]*data[zoffs_hi[0]] + lut[9]*data[zoffs_hi[1]]
sum[1] = lut[2]*data[zoffs[2]] + lut[3]*data[zoffs[3]] + lut[10]*data[zoffs_hi[2]] + lut[11]*data[zoffs_hi[3]]
sum[2] = lut[4]*data[zoffs[4]] + lut[5]*data[zoffs[5]] + lut[12]*data[zoffs_hi[4]] + lut[13]*data[zoffs_hi[5]]
sum[3] = lut[6]*data[zoffs[6]] + lut[7]*data[zoffs[7]] + lut[14]*data[zoffs_hi[6]] + lut[15]*data[zoffs_hi[7]]
v4cacc48 acc_in = null_v4acc48();
//accumulate and rotate
for (auto i = 0; i < 4; i++)
acc[i] = acc[i+rot] ;
for (auto i = 4; i < (8-rot); i++)
acc[i] = acc[i+rot] + sum[i-4];
for (auto i = (8-rot); i < 8; i++)
acc[i] = acc_in[i+rot-8] + sum[i-4] ;
Parameters
accOutput accumulator which holds the output delay line
rotRotation value by which the delay line is shifted. Valid values are 1,2 and 4. This must be a compile time constant.
lutVector of 16 complex LUT values to be multiplied with the data values
dataVector of 16 real data values
zoffsPermutation control for the input data values. Each index holds 4bit within this 32bit integer. Applies to column 0 and 1.
zoffs_hiPermutation control for the input data values. Each index holds 4bit within this 32bit integer. Applies to column 2 and 3.
\endcode
v8cacc48 dpd ( v8cacc48  acc,
v4cacc48  scd,
int  rot,
v16cint16  lut,
v16int16  data,
unsigned int  zoffs,
unsigned int  zoffs_hi 
)

Intrinsic multiplying the LUT values with the data values and shifting the outputs through the output delay line.

The dpd intrinsic wraps up the DPD. First it permutes the input
data vector to ensure that the correct data values are multiplied
with their corresponding LUT values. Then it executes the 16 complex/real
multiplications. After this, neighbouring values are added in a
post-add step and finally the results are added onto a v8cacc48
delay line which is then shifted by one value. The reason that
the output delay accumulator is 8 values wide is that we want to
output 4 output lanes at a time. Since the highest three entries
in the delay line are intermediate values they are not ready for
output.

The complete operation can be described by the following Pseudo-code:
sum[0] = lut[0]*data[zoffs[0]] + lut[1]*data[zoffs[1]] + lut[8]*data[zoffs_hi[0]] + lut[9]*data[zoffs_hi[1]]
sum[1] = lut[2]*data[zoffs[2]] + lut[3]*data[zoffs[3]] + lut[10]*data[zoffs_hi[2]] + lut[11]*data[zoffs_hi[3]]
sum[2] = lut[4]*data[zoffs[4]] + lut[5]*data[zoffs[5]] + lut[12]*data[zoffs_hi[4]] + lut[13]*data[zoffs_hi[5]]
sum[3] = lut[6]*data[zoffs[6]] + lut[7]*data[zoffs[7]] + lut[14]*data[zoffs_hi[6]] + lut[15]*data[zoffs_hi[7]]
// input from cascade
v4cacc48 acc_in = scd;
//accumulate and rotate
for (auto i = 0; i < 4; i++)
acc[i] = acc[i+rot] ;
for (auto i = 4; i < (8-rot); i++)
acc[i] = acc[i+rot] + sum[i-4];
for (auto i = (8-rot); i < 8; i++)
acc[i] = acc_in[i+rot-8] + sum[i-4] ;
Parameters
accOutput accumulator which holds the output delay line
scdInput accumulator from the SCD cascade stream. Optional.
rotRotation value by which the delay line is shifted. Valid values are 1,2 and 4. This must be a compile time constant.
lutVector of 16 complex LUT values to be multiplied with the data values
dataVector of 16 real data values
zoffsPermutation control for the input data values. Each index holds 4bit within this 32bit integer. Applies to column 0 and 1.
zoffs_hiPermutation control for the input data values. Each index holds 4bit within this 32bit integer. Applies to column 2 and 3.
\endcode
v8cacc48 dpd_ipol ( v32cint16  xbuf,
pmx_idx  loffs,
pmx_idx  roffs,
v16int16  zbuf,
unsigned int  zoffs,
unsigned int  zoffs_hi,
int  shft 
)

LUT interpolation and post-add intrinsic for 8 complex valued output lanes.

The dpd_ipol intrinsic calculates 8 complex values, each one being the sum of two interpolated complex LUT values.
In detail it performs the following operation:

~~~~~~~~
for (i=0; i<16; i++) {
    lbuf(i) = xbuf(loffs(i));
    rbuf(i) = xbuf(roffs(i));
    frac(i) = zbuf((i>8) ? zoffs_hi(i) : zoffs(i));
}

for (i=0; i<8; i++) {
    out(i) = (lbuf(2*i  )<<shift) + (rbuf(2*i  )-lbuf(2*i  ))*frac(2*i  );
    out(i)+= (lbuf(2*i+1)<<shift) + (rbuf(2*i+1)-lbuf(2*i+1))*frac(2*i+1);
}
~~~~~~~~

The following parameters are required by the function:
Parameters
xbufBuffer of 32 complex LUT values of type cint16
loffsPermutation control determining which values end up in lbuf
roffsPermutation control determining which values end up in rbuf
zbufBuffer of 16 real fractional values of type int16
zoffsPermutation control determining which values end up in frac
zoffs_hiPermutation control determining which values end up in frac
shiftValue by which the non-multiplied part of the interpolation is shifted in order to compensate for the scaling the happens by multiplying with frac
In the first loop a mapping from xbuf to lbuf/rbuf and from zbuf to frac takes place.

The user controls the mapping taking place through the four parameters loffs, roffs, zoffs and zoffs_hi. The following criteria must be fulfilled to ensure a correct functionality of the DPD:

1. The pre-adder (rbuf(2*i)-lbuf(2*i) in the code example) subtracts corresponding pairs in the lbuf and rbuf. This means that if a value at a given index in lbuf holds the LUT(idx) value, then the value in rbuf at index i must hold the LUT(idx+1) value.
2. The post-adder (adding up the two values in the second loop) adds neighbouring interpolated values. This means that it has to be ensured that these neighbouring values are meant to be added within the further course of the application. For example in case of a DPD they must have the same m value where (m,r) is the term tuple.
3. If used within a DPD application the next instruction will further reduce the number of values from 8 to four in a post-add step. Again this takes two neighbouring values and adds them together. This has to be taken into consideration at this time as well. They have to belong to the same n value, meaning they have to correspond to the same time in the output delay line.

##DPD Example##

Let us assume that we want to call the intrinsic for a DPD with the following terms:

r\\m|3|2|1|0
:-:|:|:|:|:
6  |x|-|-|-
5  |x|x|-|-
4  |x|x|x|-
3  |x|x|x|x
2  |-|x|x|x
1  |-|-|x|x
0  |-|-|-|x

And the LUT values have been loaded to xbuf in the following pattern:

LUT|m|r|Lane<br>idx|Lane<br>idx+1
:-:|:|:|:-------:|:---------:
0  |0|0|0        |4
0  |1|1|1        |5
0  |2|2|2        |6
0  |3|3|3        |7
1  |0|1|8        |12
1  |1|2|9        |13
1  |2|3|10       |14
1  |3|4|11       |15
2  |0|2|16       |20
2  |1|3|17       |21
2  |2|4|18       |22
2  |3|5|19       |23
3  |0|3|24       |28
3  |1|4|25       |29
3  |2|5|26       |30
3  |3|6|27       |31

In order to meet the criteria mentioned above we have to map this
to the following pattern in lbuf and rbuf. The PermXX values
represent the indices into the old pattern and therefore the values
that have to be in the control parameters.

n|m|r |Lane|PermIL|PermIR|PermIF
:|:|::|:--:|:----:|:----:|:-----
0|0|0 |0   |0     |4     |0
0|0|1 |1   |8     |12    |1
0|0|2 |2   |16    |20    |2
0|0|3 |3   |24    |28    |3
1|1|1 |4   |1     |5     |0
1|1|2 |5   |9     |13    |1
1|1|3 |6   |17    |21    |2
1|1|4 |7   |25    |29    |3
2|2|2 |8   |2     |6     |0
2|2|3 |9   |10    |14    |1
2|2|4 |10  |18    |22    |2
2|2|5 |11  |26    |30    |3
3|3|3 |12  |3     |7     |0
3|3|4 |13  |11    |15    |1
3|3|5 |14  |19    |23    |2
3|3|6 |15  |27    |31    |3

Therefore one can prepare the permutation configuration using the following structure and call the intrinsic:

~~~~~~~~

pmx_cfg left  = PMX_CFG ( 0, 8,16,24, 1, 9,17,25, 2,10,18,26, 3,11,19,27);
pmx_cfg right = PMX_CFG ( 4,12,20,28, 5,13,21,29, 6,14,22,30, 7,15,23,31);

unsigned int zoffs    = 0x32103210;
unsigned int zoffs_hi = 0x32103210;

v8cacc48 ipol_result = dpd_ipol(lut,
                                set_pmx_idx(left), set_pmx_idx(right),
                                frac, zoffs, zoffs_hi,
                                mag_scale);
~~~~~~~~
v8cacc48 mac4_preadd_rot ( v8cacc48  acc,
v4cacc48  scd,
int  rot,
v16cint16  xbuff,
int  xstart,
unsigned int  xoffsets,
int  xstep,
int  ystart,
int  ystepmult,
v8cint16  zbuff,
unsigned int  zstart,
unsigned int  zoffsets,
int  zstep 
)

Multiply-accumulate with pre-adding and accumulator rotation.

// rotate accumulator and shift in values from scd
v8cacc48 acc_tmp = acc[rot:7]::scd[0:rot-1];
acc_tmp[4] += (xbuff[xstart+xoffsets[0] ] + xbuff[ystart+xoffsets[0] ]) * zbuf[zstart+zoffsets[0] ]
+ (xbuff[xstart+xoffsets[0]+xstep] + xbuff[ystart+xoffsets[0]+xstep*ystepmult]) * zbuf[zstart+zoffsets[0]+zstep]
acc_tmp[5] += (xbuff[xstart+xoffsets[1] ] + xbuff[ystart+xoffsets[1] ]) * zbuf[zstart+zoffsets[1] ]
+ (xbuff[xstart+xoffsets[1]+xstep] + xbuff[ystart+xoffsets[1]+xstep*ystepmult]) * zbuf[zstart+zoffsets[1]+zstep]
acc_tmp[6] += (xbuff[xstart+xoffsets[2] ] + xbuff[ystart+xoffsets[2] ]) * zbuf[zstart+zoffsets[2] ]
+ (xbuff[xstart+xoffsets[2]+xstep] + xbuff[ystart+xoffsets[2]+xstep*ystepmult]) * zbuf[zstart+zoffsets[2]+zstep]
acc_tmp[7] += (xbuff[xstart+xoffsets[3] ] + xbuff[ystart+xoffsets[3] ]) * zbuf[zstart+zoffsets[3] ]
+ (xbuff[xstart+xoffsets[3]+xstep] + xbuff[ystart+xoffsets[3]+xstep*ystepmult]) * zbuf[zstart+zoffsets[3]+zstep]
return acc_tmp
Parameters
[in]accPrevious accumulator
[in]scdVector to be "shifted in". This must come from a call to get_scd()
[in]rotNumber of lanes to be rotated. Can be 1, 2 or 4. Must be a compile time constant
[in]xbuffData buffer for first multiplier input (LUT terms in the DPD context)
[in]xstartStart index in xbuff for first pre-addition summand
[in]xoffsetsRow-dependent offsets in xbuff for first pre-addition summand
[in]xstepOffset between columns for xbuff for first pre-addition summand
[in]ystartStart index in xbuff for second pre-addition summand
[in]ystepmultColumn offset multiplier in xbuff for second pre-addition summand. Can be 0, 1, 2, 4, 8, -1, -2 or -4. Must be a compile time constant
[in]zbuffData buffer for second multiplier input (data values in the DPD context)
[in]zstartStart index in zbuff. Must be a compile time constant
[in]zoffsetsRow-dependent offsets in zbuff
[in]zstepOffset between columns for zbuff
v8cacc48 mac4_preadd_rot ( v8cacc48  acc,
int  rot,
v16cint16  xbuff,
int  xstart,
unsigned int  xoffsets,
int  xstep,
int  ystart,
int  ystepmult,
v8cint16  zbuff,
unsigned int  zstart,
unsigned int  zoffsets,
int  zstep 
)

Multiply-accumulate with pre-adding and accumulator rotation.

// rotate accumulator and shift in zeros
v8cacc48 acc_tmp = acc[rot:7]::zeros(rot,1);
acc_tmp[4] += (xbuff[xstart+xoffsets[0] ] + xbuff[ystart+xoffsets[0] ]) * zbuf[zstart+zoffsets[0] ]
+ (xbuff[xstart+xoffsets[0]+xstep] + xbuff[ystart+xoffsets[0]+xstep*ystepmult]) * zbuf[zstart+zoffsets[0]+zstep]
acc_tmp[5] += (xbuff[xstart+xoffsets[1] ] + xbuff[ystart+xoffsets[1] ]) * zbuf[zstart+zoffsets[1] ]
+ (xbuff[xstart+xoffsets[1]+xstep] + xbuff[ystart+xoffsets[1]+xstep*ystepmult]) * zbuf[zstart+zoffsets[1]+zstep]
acc_tmp[6] += (xbuff[xstart+xoffsets[2] ] + xbuff[ystart+xoffsets[2] ]) * zbuf[zstart+zoffsets[2] ]
+ (xbuff[xstart+xoffsets[2]+xstep] + xbuff[ystart+xoffsets[2]+xstep*ystepmult]) * zbuf[zstart+zoffsets[2]+zstep]
acc_tmp[7] += (xbuff[xstart+xoffsets[3] ] + xbuff[ystart+xoffsets[3] ]) * zbuf[zstart+zoffsets[3] ]
+ (xbuff[xstart+xoffsets[3]+xstep] + xbuff[ystart+xoffsets[3]+xstep*ystepmult]) * zbuf[zstart+zoffsets[3]+zstep]
return acc_tmp
Parameters
[in]accPrevious accumulator
[in]rotNumber of lanes to be rotated. Can be 1, 2 or 4. Must be a compile time constant
[in]xbuffData buffer for first multiplier input (LUT terms in the DPD context)
[in]xstartStart index in xbuff for first pre-addition summand
[in]xoffsetsRow-dependent offsets in xbuff for first pre-addition summand
[in]xstepOffset between columns for xbuff for first pre-addition summand
[in]ystartStart index in xbuff for second pre-addition summand
[in]ystepmultColumn offset multiplier in xbuff for second pre-addition summand. Can be 0, 1, 2, 4, 8, -1, -2 or -4. Must be a compile time constant
[in]zbuffData buffer for second multiplier input (data values in the DPD context)
[in]zstartStart index in zbuff. Must be a compile time constant
[in]zoffsetsRow-dependent offsets in zbuff
[in]zstepOffset between columns for zbuff
v8cacc48 mac4_preadd_rot ( v8cacc48  acc,
v4cacc48  scd,
int  rot,
v32cint16  xbuff,
int  xstart,
unsigned int  xoffsets,
int  xstep,
int  ystart,
int  ystepmult,
v8cint16  zbuff,
unsigned int  zstart,
unsigned int  zoffsets,
int  zstep 
)

Multiply-accumulate with pre-adding and accumulator rotation.

// rotate accumulator and shift in values from scd
v8cacc48 acc_tmp = acc[rot:7]::scd[0:rot-1];
acc_tmp[4] += (xbuff[xstart+xoffsets[0] ] + xbuff[ystart+xoffsets[0] ]) * zbuf[zstart+zoffsets[0] ]
+ (xbuff[xstart+xoffsets[0]+xstep] + xbuff[ystart+xoffsets[0]+xstep*ystepmult]) * zbuf[zstart+zoffsets[0]+zstep]
acc_tmp[5] += (xbuff[xstart+xoffsets[1] ] + xbuff[ystart+xoffsets[1] ]) * zbuf[zstart+zoffsets[1] ]
+ (xbuff[xstart+xoffsets[1]+xstep] + xbuff[ystart+xoffsets[1]+xstep*ystepmult]) * zbuf[zstart+zoffsets[1]+zstep]
acc_tmp[6] += (xbuff[xstart+xoffsets[2] ] + xbuff[ystart+xoffsets[2] ]) * zbuf[zstart+zoffsets[2] ]
+ (xbuff[xstart+xoffsets[2]+xstep] + xbuff[ystart+xoffsets[2]+xstep*ystepmult]) * zbuf[zstart+zoffsets[2]+zstep]
acc_tmp[7] += (xbuff[xstart+xoffsets[3] ] + xbuff[ystart+xoffsets[3] ]) * zbuf[zstart+zoffsets[3] ]
+ (xbuff[xstart+xoffsets[3]+xstep] + xbuff[ystart+xoffsets[3]+xstep*ystepmult]) * zbuf[zstart+zoffsets[3]+zstep]
return acc_tmp
Parameters
[in]accPrevious accumulator
[in]scdVector to be "shifted in". This must come from a call to get_scd()
[in]rotNumber of lanes to be rotated. Can be 1, 2 or 4. Must be a compile time constant
[in]xbuffData buffer for first multiplier input (LUT terms in the DPD context)
[in]xstartStart index in xbuff for first pre-addition summand
[in]xoffsetsRow-dependent offsets in xbuff for first pre-addition summand
[in]xstepOffset between columns for xbuff for first pre-addition summand
[in]ystartStart index in xbuff for second pre-addition summand
[in]ystepmultColumn offset multiplier in xbuff for second pre-addition summand. Can be 0, 1, 2, 4, 8, -1, -2 or -4. Must be a compile time constant
[in]zbuffData buffer for second multiplier input (data values in the DPD context)
[in]zstartStart index in zbuff. Must be a compile time constant
[in]zoffsetsRow-dependent offsets in zbuff
[in]zstepOffset between columns for zbuff
v8cacc48 mac4_preadd_rot ( v8cacc48  acc,
int  rot,
v32cint16  xbuff,
int  xstart,
unsigned int  xoffsets,
int  xstep,
int  ystart,
int  ystepmult,
v8cint16  zbuff,
unsigned int  zstart,
unsigned int  zoffsets,
int  zstep 
)

Multiply-accumulate with pre-adding and accumulator rotation.

// rotate accumulator and shift in zeros
v8cacc48 acc_tmp = acc[rot:7]::zeros(rot,1);
acc_tmp[4] += (xbuff[xstart+xoffsets[0] ] + xbuff[ystart+xoffsets[0] ]) * zbuf[zstart+zoffsets[0] ]
+ (xbuff[xstart+xoffsets[0]+xstep] + xbuff[ystart+xoffsets[0]+xstep*ystepmult]) * zbuf[zstart+zoffsets[0]+zstep]
acc_tmp[5] += (xbuff[xstart+xoffsets[1] ] + xbuff[ystart+xoffsets[1] ]) * zbuf[zstart+zoffsets[1] ]
+ (xbuff[xstart+xoffsets[1]+xstep] + xbuff[ystart+xoffsets[1]+xstep*ystepmult]) * zbuf[zstart+zoffsets[1]+zstep]
acc_tmp[6] += (xbuff[xstart+xoffsets[2] ] + xbuff[ystart+xoffsets[2] ]) * zbuf[zstart+zoffsets[2] ]
+ (xbuff[xstart+xoffsets[2]+xstep] + xbuff[ystart+xoffsets[2]+xstep*ystepmult]) * zbuf[zstart+zoffsets[2]+zstep]
acc_tmp[7] += (xbuff[xstart+xoffsets[3] ] + xbuff[ystart+xoffsets[3] ]) * zbuf[zstart+zoffsets[3] ]
+ (xbuff[xstart+xoffsets[3]+xstep] + xbuff[ystart+xoffsets[3]+xstep*ystepmult]) * zbuf[zstart+zoffsets[3]+zstep]
return acc_tmp
Parameters
[in]accPrevious accumulator
[in]rotNumber of lanes to be rotated. Can be 1, 2 or 4. Must be a compile time constant
[in]xbuffData buffer for first multiplier input (LUT terms in the DPD context)
[in]xstartStart index in xbuff for first pre-addition summand
[in]xoffsetsRow-dependent offsets in xbuff for first pre-addition summand
[in]xstepOffset between columns for xbuff for first pre-addition summand
[in]ystartStart index in xbuff for second pre-addition summand
[in]ystepmultColumn offset multiplier in xbuff for second pre-addition summand. Can be 0, 1, 2, 4, 8, -1, -2 or -4. Must be a compile time constant
[in]zbuffData buffer for second multiplier input (data values in the DPD context)
[in]zstartStart index in zbuff. Must be a compile time constant
[in]zoffsetsRow-dependent offsets in zbuff
[in]zstepOffset between columns for zbuff
v8cacc48 mac4_rot ( v8cacc48  acc,
v4cacc48  scd,
int  rot,
v16cint16  xbuff,
int  xstart,
unsigned int  xoffsets,
int  xstep,
v8cint16  zbuff,
unsigned int  zstart,
unsigned int  zoffsets,
int  zstep 
)

Multiply-accumulate with accumulator rotation.

// rotate accumulator and shift in values from scd
v8cacc48 acc_tmp = acc[rot:7]::scd[0:rot-1];
acc_tmp[4] += xbuff[xstart+xoffsets[0]] * zbuf[zstart+zoffsets[0]] + xbuff[xstart+xoffsets[0]+xstep] * zbuf[zstart+zoffsets[0]+zstep]
acc_tmp[5] += xbuff[xstart+xoffsets[1]] * zbuf[zstart+zoffsets[1]] + xbuff[xstart+xoffsets[1]+xstep] * zbuf[zstart+zoffsets[1]+zstep]
acc_tmp[6] += xbuff[xstart+xoffsets[2]] * zbuf[zstart+zoffsets[2]] + xbuff[xstart+xoffsets[2]+xstep] * zbuf[zstart+zoffsets[2]+zstep]
acc_tmp[7] += xbuff[xstart+xoffsets[3]] * zbuf[zstart+zoffsets[3]] + xbuff[xstart+xoffsets[3]+xstep] * zbuf[zstart+zoffsets[3]+zstep]
return acc_tmp
Parameters
[in]accPrevious accumulator
[in]scdVector to be "shifted in". This must come from a call to get_scd()
[in]rotNumber of lanes to be rotated. Can be 1, 2 or 4. Must be a compile time constant
[in]xbuffData buffer for first multiplier input (LUT terms in the DPD context)
[in]xstartStart index in xbuff
[in]xoffsetsRow-dependent offsets in xbuff
[in]xstepOffset between columns for xbuff
[in]zbuffData buffer for second multiplier input (data values in the DPD context)
[in]zstartStart index in zbuff. Must be a compile time constant
[in]zoffsetsRow-dependent offsets in zbuff
[in]zstepOffset between columns for zbuff
v8cacc48 mac4_rot ( v8cacc48  acc,
v4cacc48  scd,
int  rot,
v16cint16  xbuff,
int  xstart,
unsigned int  xoffsets,
int  xstep,
v16int16  zbuff,
unsigned int  zstart,
unsigned int  zoffsets,
int  zstep 
)

Multiply-accumulate with accumulator rotation.

// rotate accumulator and shift in values from scd
v8cacc48 acc_tmp = acc[rot:7]::scd[0:rot-1];
acc_tmp[4] += xbuff[xstart+xoffsets[0]] * zbuf[zstart+zoffsets[0]] + xbuff[xstart+xoffsets[0]+xstep] * zbuf[zstart+zoffsets[0]+zstep] + xbuff[xstart + xoffset[0] + 2*xstep] * zbuff[zstart + zoffset[0] + 2*zstep] + xbuff[xstart + xoffset[0] + 3*xstep] * zbuff[zstart + zoffset[0] + 3*zstep]
acc_tmp[5] += xbuff[xstart+xoffsets[1]] * zbuf[zstart+zoffsets[1]] + xbuff[xstart+xoffsets[1]+xstep] * zbuf[zstart+zoffsets[1]+zstep] + xbuff[xstart + xoffset[1] + 2*xstep] * zbuff[zstart + zoffset[1] + 2*zstep] + xbuff[xstart + xoffset[1] + 3*xstep] * zbuff[zstart + zoffset[1] + 3*zstep]
acc_tmp[6] += xbuff[xstart+xoffsets[2]] * zbuf[zstart+zoffsets[2]] + xbuff[xstart+xoffsets[2]+xstep] * zbuf[zstart+zoffsets[2]+zstep] + xbuff[xstart + xoffset[2] + 2*xstep] * zbuff[zstart + zoffset[2] + 2*zstep] + xbuff[xstart + xoffset[2] + 3*xstep] * zbuff[zstart + zoffset[2] + 3*zstep]
acc_tmp[7] += xbuff[xstart+xoffsets[3]] * zbuf[zstart+zoffsets[3]] + xbuff[xstart+xoffsets[3]+xstep] * zbuf[zstart+zoffsets[3]+zstep] + xbuff[xstart + xoffset[3] + 2*xstep] * zbuff[zstart + zoffset[3] + 2*zstep] + xbuff[xstart + xoffset[3] + 3*xstep] * zbuff[zstart + zoffset[3] + 3*zstep]
return acc_tmp
Parameters
[in]accPrevious accumulator
[in]scdVector to be "shifted in". This must come from a call to get_scd()
[in]rotNumber of lanes to be rotated. Can be 1, 2 or 4. Must be a compile time constant
[in]xbuffData buffer for first multiplier input (LUT terms in the DPD context)
[in]xstartStart index in xbuff
[in]xoffsetsRow-dependent offsets in xbuff
[in]xstepOffset between columns for xbuff
[in]zbuffData buffer for second multiplier input (data values in the DPD context)
[in]zstartStart index in zbuff. Must be a compile time constant
[in]zoffsetsRow-dependent offsets in zbuff
[in]zstepOffset between columns for zbuff
v8cacc48 mac4_rot ( v8cacc48  acc,
int  rot,
v16cint16  xbuff,
int  xstart,
unsigned int  xoffsets,
int  xstep,
v8cint16  zbuff,
unsigned int  zstart,
unsigned int  zoffsets,
int  zstep 
)

Multiply-accumulate with accumulator rotation.

// rotate accumulator and shift in zeros
v8cacc48 acc_tmp = acc[rot:7]::zeros(rot,1);
acc_tmp[4] += xbuff[xstart+xoffsets[0]] * zbuf[zstart+zoffsets[0]] + xbuff[xstart+xoffsets[0]+xstep] * zbuf[zstart+zoffsets[0]+zstep]
acc_tmp[5] += xbuff[xstart+xoffsets[1]] * zbuf[zstart+zoffsets[1]] + xbuff[xstart+xoffsets[1]+xstep] * zbuf[zstart+zoffsets[1]+zstep]
acc_tmp[6] += xbuff[xstart+xoffsets[2]] * zbuf[zstart+zoffsets[2]] + xbuff[xstart+xoffsets[2]+xstep] * zbuf[zstart+zoffsets[2]+zstep]
acc_tmp[7] += xbuff[xstart+xoffsets[3]] * zbuf[zstart+zoffsets[3]] + xbuff[xstart+xoffsets[3]+xstep] * zbuf[zstart+zoffsets[3]+zstep]
return acc_tmp
Parameters
[in]accPrevious accumulator
[in]rotNumber of lanes to be rotated. Can be 1, 2 or 4. Must be a compile time constant
[in]xbuffData buffer for first multiplier input (LUT terms in the DPD context)
[in]xstartStart index in xbuff for first pre-addition summand
[in]xoffsetsRow-dependent offsets in xbuff
[in]xstepOffset between columns for xbuff
[in]zbuffData buffer for second multiplier input (data values in the DPD context)
[in]zstartStart index in zbuff. Must be a compile time constant
[in]zoffsetsRow-dependent offsets in zbuff
[in]zstepOffset between columns for zbuff
v8cacc48 mac4_rot ( v8cacc48  acc,
int  rot,
v16cint16  xbuff,
int  xstart,
unsigned int  xoffsets,
int  xstep,
v16int16  zbuff,
unsigned int  zstart,
unsigned int  zoffsets,
int  zstep 
)

Multiply-accumulate with accumulator rotation.

// rotate accumulator and shift in zeros
v8cacc48 acc_tmp = acc[rot:7]::zeros(rot,1);
acc_tmp[4] += xbuff[xstart+xoffsets[0]] * zbuf[zstart+zoffsets[0]] + xbuff[xstart+xoffsets[0]+xstep] * zbuf[zstart+zoffsets[0]+zstep] + xbuff[xstart + xoffset[0] + 2*xstep] * zbuff[zstart + zoffset[0] + 2*zstep] + xbuff[xstart + xoffset[0] + 3*xstep] * zbuff[zstart + zoffset[0] + 3*zstep]
acc_tmp[5] += xbuff[xstart+xoffsets[1]] * zbuf[zstart+zoffsets[1]] + xbuff[xstart+xoffsets[1]+xstep] * zbuf[zstart+zoffsets[1]+zstep] + xbuff[xstart + xoffset[1] + 2*xstep] * zbuff[zstart + zoffset[1] + 2*zstep] + xbuff[xstart + xoffset[1] + 3*xstep] * zbuff[zstart + zoffset[1] + 3*zstep]
acc_tmp[6] += xbuff[xstart+xoffsets[2]] * zbuf[zstart+zoffsets[2]] + xbuff[xstart+xoffsets[2]+xstep] * zbuf[zstart+zoffsets[2]+zstep] + xbuff[xstart + xoffset[2] + 2*xstep] * zbuff[zstart + zoffset[2] + 2*zstep] + xbuff[xstart + xoffset[2] + 3*xstep] * zbuff[zstart + zoffset[2] + 3*zstep]
acc_tmp[7] += xbuff[xstart+xoffsets[3]] * zbuf[zstart+zoffsets[3]] + xbuff[xstart+xoffsets[3]+xstep] * zbuf[zstart+zoffsets[3]+zstep] + xbuff[xstart + xoffset[3] + 2*xstep] * zbuff[zstart + zoffset[3] + 2*zstep] + xbuff[xstart + xoffset[3] + 3*xstep] * zbuff[zstart + zoffset[3] + 3*zstep]
return acc_tmp
Parameters
[in]accPrevious accumulator
[in]rotNumber of lanes to be rotated. Can be 1, 2 or 4. Must be a compile time constant
[in]xbuffData buffer for first multiplier input (LUT terms in the DPD context)
[in]xstartStart index in xbuff for first pre-addition summand
[in]xoffsetsRow-dependent offsets in xbuff
[in]xstepOffset between columns for xbuff
[in]zbuffData buffer for second multiplier input (data values in the DPD context)
[in]zstartStart index in zbuff. Must be a compile time constant
[in]zoffsetsRow-dependent offsets in zbuff
[in]zstepOffset between columns for zbuff
v8cacc48 mac4_rot ( v8cacc48  acc,
v4cacc48  scd,
int  rot,
v32cint16  xbuff,
int  xstart,
unsigned int  xoffsets,
int  xstep,
v8cint16  zbuff,
unsigned int  zstart,
unsigned int  zoffsets,
int  zstep 
)

Multiply-accumulate with accumulator rotation.

// rotate accumulator and shift in values from scd
v8cacc48 acc_tmp = acc[rot:7]::scd[0:rot-1];
acc_tmp[4] += xbuff[xstart+xoffsets[0]] * zbuf[zstart+zoffsets[0]] + xbuff[xstart+xoffsets[0]+xstep] * zbuf[zstart+zoffsets[0]+zstep]
acc_tmp[5] += xbuff[xstart+xoffsets[1]] * zbuf[zstart+zoffsets[1]] + xbuff[xstart+xoffsets[1]+xstep] * zbuf[zstart+zoffsets[1]+zstep]
acc_tmp[6] += xbuff[xstart+xoffsets[2]] * zbuf[zstart+zoffsets[2]] + xbuff[xstart+xoffsets[2]+xstep] * zbuf[zstart+zoffsets[2]+zstep]
acc_tmp[7] += xbuff[xstart+xoffsets[3]] * zbuf[zstart+zoffsets[3]] + xbuff[xstart+xoffsets[3]+xstep] * zbuf[zstart+zoffsets[3]+zstep]
return acc_tmp
Parameters
[in]accPrevious accumulator
[in]scdVector to be "shifted in". This must come from a call to get_scd()
[in]rotNumber of lanes to be rotated. Can be 1, 2 or 4. Must be a compile time constant
[in]xbuffData buffer for first multiplier input (LUT terms in the DPD context)
[in]xstartStart index in xbuff for first pre-addition summand
[in]xoffsetsRow-dependent offsets in xbuff
[in]xstepOffset between columns for xbuff
[in]zbuffData buffer for second multiplier input (data values in the DPD context)
[in]zstartStart index in zbuff. Must be a compile time constant
[in]zoffsetsRow-dependent offsets in zbuff
[in]zstepOffset between columns for zbuff
v8cacc48 mac4_rot ( v8cacc48  acc,
v4cacc48  scd,
int  rot,
v32cint16  xbuff,
int  xstart,
unsigned int  xoffsets,
int  xstep,
v16int16  zbuff,
unsigned int  zstart,
unsigned int  zoffsets,
int  zstep 
)

Multiply-accumulate with accumulator rotation.

// rotate accumulator and shift in values from scd
v8cacc48 acc_tmp = acc[rot:7]::scd[0:rot-1];
acc_tmp[4] += xbuff[xstart+xoffsets[0]] * zbuf[zstart+zoffsets[0]] + xbuff[xstart+xoffsets[0]+xstep] * zbuf[zstart+zoffsets[0]+zstep] + xbuff[xstart + xoffset[0] + 2*xstep] * zbuff[zstart + zoffset[0] + 2*zstep] + xbuff[xstart + xoffset[0] + 3*xstep] * zbuff[zstart + zoffset[0] + 3*zstep]
acc_tmp[5] += xbuff[xstart+xoffsets[1]] * zbuf[zstart+zoffsets[1]] + xbuff[xstart+xoffsets[1]+xstep] * zbuf[zstart+zoffsets[1]+zstep] + xbuff[xstart + xoffset[1] + 2*xstep] * zbuff[zstart + zoffset[1] + 2*zstep] + xbuff[xstart + xoffset[1] + 3*xstep] * zbuff[zstart + zoffset[1] + 3*zstep]
acc_tmp[6] += xbuff[xstart+xoffsets[2]] * zbuf[zstart+zoffsets[2]] + xbuff[xstart+xoffsets[2]+xstep] * zbuf[zstart+zoffsets[2]+zstep] + xbuff[xstart + xoffset[2] + 2*xstep] * zbuff[zstart + zoffset[2] + 2*zstep] + xbuff[xstart + xoffset[2] + 3*xstep] * zbuff[zstart + zoffset[2] + 3*zstep]
acc_tmp[7] += xbuff[xstart+xoffsets[3]] * zbuf[zstart+zoffsets[3]] + xbuff[xstart+xoffsets[3]+xstep] * zbuf[zstart+zoffsets[3]+zstep] + xbuff[xstart + xoffset[3] + 2*xstep] * zbuff[zstart + zoffset[3] + 2*zstep] + xbuff[xstart + xoffset[3] + 3*xstep] * zbuff[zstart + zoffset[3] + 3*zstep]
return acc_tmp
Parameters
[in]accPrevious accumulator
[in]scdVector to be "shifted in". This must come from a call to get_scd()
[in]rotNumber of lanes to be rotated. Can be 1, 2 or 4. Must be a compile time constant
[in]xbuffData buffer for first multiplier input (LUT terms in the DPD context)
[in]xstartStart index in xbuff for first pre-addition summand
[in]xoffsetsRow-dependent offsets in xbuff
[in]xstepOffset between columns for xbuff
[in]zbuffData buffer for second multiplier input (data values in the DPD context)
[in]zstartStart index in zbuff. Must be a compile time constant
[in]zoffsetsRow-dependent offsets in zbuff
[in]zstepOffset between columns for zbuff
v8cacc48 mac4_rot ( v8cacc48  acc,
int  rot,
v32cint16  xbuff,
int  xstart,
unsigned int  xoffsets,
int  xstep,
v8cint16  zbuff,
unsigned int  zstart,
unsigned int  zoffsets,
int  zstep 
)

Multiply-accumulate with accumulator rotation.

// rotate accumulator and shift in zeros
v8cacc48 acc_tmp = acc[rot:7]::zeros(rot,1);
acc_tmp[4] += xbuff[xstart+xoffsets[0]] * zbuf[zstart+zoffsets[0]] + xbuff[xstart+xoffsets[0]+xstep] * zbuf[zstart+zoffsets[0]+zstep]
acc_tmp[5] += xbuff[xstart+xoffsets[1]] * zbuf[zstart+zoffsets[1]] + xbuff[xstart+xoffsets[1]+xstep] * zbuf[zstart+zoffsets[1]+zstep]
acc_tmp[6] += xbuff[xstart+xoffsets[2]] * zbuf[zstart+zoffsets[2]] + xbuff[xstart+xoffsets[2]+xstep] * zbuf[zstart+zoffsets[2]+zstep]
acc_tmp[7] += xbuff[xstart+xoffsets[3]] * zbuf[zstart+zoffsets[3]] + xbuff[xstart+xoffsets[3]+xstep] * zbuf[zstart+zoffsets[3]+zstep]
return acc_tmp
Parameters
[in]accPrevious accumulator
[in]rotNumber of lanes to be rotated. Can be 1, 2 or 4. Must be a compile time constant
[in]xbuffData buffer for first multiplier input (LUT terms in the DPD context)
[in]xstartStart index in xbuff for first pre-addition summand
[in]xoffsetsRow-dependent offsets in xbuff
[in]xstepOffset between columns for xbuff
[in]zbuffData buffer for second multiplier input (data values in the DPD context)
[in]zstartStart index in zbuff. Must be a compile time constant
[in]zoffsetsRow-dependent offsets in zbuff
[in]zstepOffset between columns for zbuff
v8cacc48 mac4_rot ( v8cacc48  acc,
int  rot,
v32cint16  xbuff,
int  xstart,
unsigned int  xoffsets,
int  xstep,
v16int16  zbuff,
unsigned int  zstart,
unsigned int  zoffsets,
int  zstep 
)

Multiply-accumulate with accumulator rotation.

// rotate accumulator and shift in zeros
v8cacc48 acc_tmp = acc[rot:7]::zeros(rot,1);
acc_tmp[4] += xbuff[xstart+xoffsets[0]] * zbuf[zstart+zoffsets[0]] + xbuff[xstart+xoffsets[0]+xstep] * zbuf[zstart+zoffsets[0]+zstep] + xbuff[xstart + xoffset[0] + 2*xstep] * zbuff[zstart + zoffset[0] + 2*zstep] + xbuff[xstart + xoffset[0] + 3*xstep] * zbuff[zstart + zoffset[0] + 3*zstep]
acc_tmp[5] += xbuff[xstart+xoffsets[1]] * zbuf[zstart+zoffsets[1]] + xbuff[xstart+xoffsets[1]+xstep] * zbuf[zstart+zoffsets[1]+zstep] + xbuff[xstart + xoffset[1] + 2*xstep] * zbuff[zstart + zoffset[1] + 2*zstep] + xbuff[xstart + xoffset[1] + 3*xstep] * zbuff[zstart + zoffset[1] + 3*zstep]
acc_tmp[6] += xbuff[xstart+xoffsets[2]] * zbuf[zstart+zoffsets[2]] + xbuff[xstart+xoffsets[2]+xstep] * zbuf[zstart+zoffsets[2]+zstep] + xbuff[xstart + xoffset[2] + 2*xstep] * zbuff[zstart + zoffset[2] + 2*zstep] + xbuff[xstart + xoffset[2] + 3*xstep] * zbuff[zstart + zoffset[2] + 3*zstep]
acc_tmp[7] += xbuff[xstart+xoffsets[3]] * zbuf[zstart+zoffsets[3]] + xbuff[xstart+xoffsets[3]+xstep] * zbuf[zstart+zoffsets[3]+zstep] + xbuff[xstart + xoffset[3] + 2*xstep] * zbuff[zstart + zoffset[3] + 2*zstep] + xbuff[xstart + xoffset[3] + 3*xstep] * zbuff[zstart + zoffset[3] + 3*zstep]
return acc_tmp
Parameters
[in]accPrevious accumulator
[in]rotNumber of lanes to be rotated. Can be 1, 2 or 4. Must be a compile time constant
[in]xbuffData buffer for first multiplier input (LUT terms in the DPD context)
[in]xstartStart index in xbuff for first pre-addition summand
[in]xoffsetsRow-dependent offsets in xbuff
[in]xstepOffset between columns for xbuff
[in]zbuffData buffer for second multiplier input (data values in the DPD context)
[in]zstartStart index in zbuff. Must be a compile time constant
[in]zoffsetsRow-dependent offsets in zbuff
[in]zstepOffset between columns for zbuff
pmx_idx const set_pmx_idx ( pmx_cfg const &  pmx)

Set permutation control for left and right buffer.

This intrinsic is used in DPD to set the permutation control for left and right buffer. Is is composed of 16 5-bit values.

void split ( int  a,
unsigned  n,
unsigned const  w,
int &  msb,
unsigned &  lsb 
)

Intrinsic used by DPD to split the magnitude into index and fraction for LUT interpolation.

The split intrinsic prepares the magnitude values for further processing in the DPD. The parameters are the following:

Parameters
aInput magnitude as a 32bit integer
nNumber of LSBs that shall end up in the fraction. This must be a compile time constant.
wWidth of the LUT in bytes on a binary logarithmic scale. The index will be shifted to the left by this value. This must be a compile-time constant.
msbOutput index into the LUT by reference
lsbOutput fraction for interpolation between msb and msb+1 in the LUT by reference

The instrinsic performs the following operation:

msb = (a >> n) << w;
lsb = a & ((1 << n) – 1);

To compute the index into the LUT the intrinsic first shifts the magnitude value to the right by n and then to the left by w, therefore disregarding the fractional bits and filling up as many zeroes as necessary to correctly index the LUT. For the fraction value the magnitude is masked so that only the fractional bits remain.

Example

Let us assume we want to split a magnitude value under the following requirements:

  1. Each LUT entry contains four cint16 values.
  2. We want to interpolate with a 7bit fractional value.

The first requirement dictates that w must be ld(4*sizeof(cint16)) = ld(16) = 4. n obviously is 7.

Therefore in this case a call of the split function could look like this:

split(a,7,4,msb,lsb);
void split2 ( int  a,
unsigned  n,
unsigned const  w,
int &  msb_lo,
int &  msb_hi,
unsigned &  lsb 
)

Similar to doing two separate splits on the lower and upper half of input a.

See Also
split(int a, int n, int const w, int& msb, unsigned& lsb)
Parameters
aInput as two lanes of 16 bits in a 32bit integer
nNumber of LSBs that shall end up in the fraction. This must be a compile time constant.
wWidth of the LUT in bytes on a binary logarithmic scale. The index will be shifted to the left by this value. This must be a compile time constant.
msb_loLower order bits of output index into the LUT by reference
msb_hiHigher order bits of output index into the LUT by reference
lsbOutput fraction for interpolation between msb and msb+1 in the LUT by reference (Two lanes of 16 bits each)

The instrinsic performs the following operations:

int a_lo = a & 0xFFFF;
int a_hi = (a >> 16) & 0xFFFF;
unsigned lsb_lo, lsb_hi;
split(a_lo,n,w,msb_lo,lsb_lo);
split(a_hi,n,w,msb_hi,lsb_hi);
lsb = (lsb_lo & 0xFFFF) | (lsb_hi << 16);