The basic functionality of these intrinsics performs vector multiply and accumulate operations between data from two buffers, the X and Z buffers, with the other parameters and options allowing flexibility (data selection within the vectors, number of output lanes) and optional features (different input data sizes, pre-adding, etc). There is an additional input buffer, the Y buffer, whose values can be pre-added with those from the X buffer before the multiplication occurs. The result from the intrinsic is added to an accumulator.

This diagram gives a functional overview of how these intrinsics work. For users who are familiar with FIR filters, in this scheme X and Y can be used for data and symmetric data respectively and Z for the coefficients when implementing a symmetric FIR filter for example.

The operation can be described using "lanes" and "columns". The number of lanes corresponds to the number of output values that will be generated from the intrinsic call. The number of columns is the number of multiplications that will be done per output lane, with each of the multiplication results being added together.

Example :

acc0 += z00*(x00 + y00) + z01*(x01 + y01) + z02*(x02 + y02) + z03*(x03 + y03)
acc1 += z10*(x10 + y10) + z11*(x11 + y11) + z12*(x12 + y12) + z13*(x13 + y13)
acc2 += z20*(x20 + y20) + z21*(x21 + y21) + z22*(x22 + y22) + z23*(x23 + y23)
acc3 += z30*(x30 + y30) + z31*(x31 + y31) + z32*(x32 + y32) + z33*(x33 + y33)

In this case, we are generating 4 outputs, so 4 lanes, and 4 columns for each with pre-adding from the X and Y buffers.

Intrinsic naming convention

The general naming convention for the vector MAC intrinsics is shown below. Optional caracteristics are shown with [] and mandatory ones with {} :

  [l]{mac|msc|mul|negmul}{2|4|8|16}[_abs|_max|_min|_maxdiff][_conj][{_sym|_antisym}[_ct|_uct]][_c|_cc|_cn|_nc]

Every operation will either be a multiplication, intitializing an accumulator, or a mac operation which accumulates to a running accumulator, of 2/4/8/16 lanes.

Optional specifications :

'l' denotes that an accumulator with 80 bit lanes is used for the operation
'sym' and 'antisym' indicates the use of pre-adding and pre-subtraction respectively
'max','min' and 'maxdiff' indicates the pre-selection of lanes in the xbuff based on the maximum, minimum or maximum difference value
'abs' indicates the pre-computation of the absolute value in the xbuff
'ct' is used for partial pre-adding and pre-subtraction (seperate selection for the data input from X for the final column)
'uct' is used for upshifting, returning a wide accumulator. The first four lanes are the expected output of the intrinsic, and the next four are a column of data selected from the Y buffer and upshifted
'n' and 'c' are used to indicate that the complex conjugate will be used for one of the input buffers with complex values
- 'c' : the only complex input buffer will be conjugated
- 'cn' : complex conjugate of X (or XY if pre-adding is used) buffer
- 'nc' : complex conjugate of Z buffer
- 'cc' : complex conjugate of both X (or XY if pre-adding is used) and Z buffers
'conj' indicates that the complex conjugate of Z will be used when multiplying the data input from Y

Data selection

The parameters of the intrinsics allow for flexible data selection from the different input buffers for each lane and column, all following the same pattern of parameters. A starting point in the buffer is given by the (x/y/z)start parameter which selects the first element for the first row and first column. To allow flexibility for each lane, (x/y/z)offsets provides an offset value for each lane that will be added to the starting point. Finally, the (x/y/z)step parameter defines the step in data selection between each column based on the previous position. It is worth noticing that when the ystep is not specified in the intrinsic it will be the symmetric of the xstep.

If pre-adding or pre-subtraction is used (including with conjugation/upshifting or partial), the Y buffer is used for the needed input data. In this case, the selection is done in the same way except it is minus the step. This also applies to when Vector MAC operations are combined with comparisons. In the case of partial pre-adding or pre-subtraction, the final column is without pre-adding and data is selected from the X buffer with the ctap parameter.

Data offsetting for more than 8 output lanes

When the output has more than 8 lanes (e.g. 16) there are extra offset parameters. Apart from the usual 'offsets' parameter there is an additional 'offsets_hi' parameter for the extra lanes. This extra parameter allows selecting the data that will be placed into the upper input lanes (8-16) of the multiplier.

xstart/zstart restrictions

Permute granularity for x/y and z buffers is 32b and 16b, respectively. The start and step values which are in sample granularity have to conform to the permute granularity (e.g., xstart for int16 data samples cannot take odd values and int8 data samples need to be multiple of 4). The lower level selection of the data samples are carried out of by the mini permute which is controlled by the square parameter.

Square parameters

When both input buffers are 16bit real buffers, or less there are extra selection parameters. Apart from the usual offsets parameters there is an additional 'square' parameter to select between elements of the input buffer. Additionally if the coefficient buffer (usually called zbuff) is a 8bit real buffer it too will possess a square parameter.

Offset computation

Data	Coefficient	Complex Data	Complex Coefficient	has xysquare	has zsquare	xstart restrictions	xstep restrictions	zstart restrictions	zstep restrictions	Data scheme	Coefficient scheme
all others	all others	any	any	no	no	signed 32b	signed 6b	4b	signed 6b	General	General
16-bit	16-bit	no	no	yes	no	multiple of 2 / signed 32b	multiple of 2 / signed 6b	4b	signed 6b	16b x 16b data	General
16-bit	8-bit	no	no	yes	yes	multiple of 2 / signed 32b	multiple of 2 / signed 6b	multiple of 2 / 4b	multiple of 2 / signed 6b	16b x 16b data	16b x8b coefficient
8-bit	8-bit	no	no	yes	yes	multiple of 4 / signed 32b	multiple of 4 / signed 6b	multiple of 2 / 4b	multiple of 2 / signed 6b	8b x 8b data	8b x 8b coefficient

Converting from an index to a row and column pair
For any given pair of c (column) and r (row) where c can go from 0 to cols and r can go from 0 to rows(number of lanes) in the output vector:

    for i = 0 ; i < rows * cols ; i++
      c = i % cols
      r = i / cols

r and c can be used to compute the offset for the corresponding index i

Note: If you want to do the opposite of the above and convert from a row and col pair to an index then you can do: i = c + (r * cols)

Schemes

    //lanes
    lanes = (number of elements in output vector)

    //multiplications
    int m=1;
    if (data_size  == 32) m*=2;
    if (coeff_size == 32) m*=2;
    if (data_complex)     m*=2;
    if (coeff_complex)    m*=2;

    //rows and cols
    rows = lanes
    cols = 32/(m*lanes)

General scheme

    for i = 0 ; i < rows * cols ; i++
      c = i % cols
      r = i / cols

      idx[i] = ( start + offs[r] + step*c ) % (#samples in buffer)

Note: For a v64int16 vector the number of samples would be 64. Whereas for a v4cint16 vector the number of samples would be 4.

16b x 16b data scheme

    for i = 0 ; i < rows * cols ; i++
      c = i % cols
      r = i / cols

      if (r % 2 == 0):
        offset = offs[r]*2
      else:
        offset = offs[r]*2 + (offs[r-1] + 1)*2

      xstep =   c/2*xstep + c%2
      ystep = -(c/2*xstep - c%2)

      idx[i] = ( xstart + offset + xstep ) % (#samples in buffer)

Note: Please note how the offset for the odd rows are relative to the previous even row offset.; If an Y buffer exists then the index computation is the same but uses the ystep computation above.

Once all indexes have been computed the square parameter is applied to each 2x2 matrix where the square parameter chooses the index from 0 to 3 - read as 4b parameters - ( increasing left to right, top to bottom ):

    idx[x  ][y]  idx[x  ][y+1]   < = > 0 1
    idx[x+1][y]  idx[x+1][y+1]   < = > 2 3

The 4 LSB for the square parameter corresponds to the first lane. The above would be represented by 0x3210. Any combination is allowed. For instance, square = 0x2130 would result in:

    idx[x  ][y]   idx[x+1][y+1]   < = > 0 3
    idx[x  ][y+1] idx[x+1][y  ]   < = > 1 2

16b x 8b coefficient scheme

    cols = 64/(m*lanes)

    for i = 0 ; i < rows * cols ; i++
      c = i % cols
      r = i / cols

      offset = offs[r]*2

      step = c/2*zstep + c%2

      idx[i] = ( xstart + offset + step ) % (#samples in buffer)

Once all indexes have been computed the square parameter is applied to each 2x2 matrix where the square parameter chooses the index from 0 to 3 - read as 4b parameters - ( increasing left to right, top to bottom ):

    idx[x  ][y]  idx[x  ][y+1]   < = > 0 1
    idx[x+1][y]  idx[x+1][y+1]   < = > 2 3

The 4 LSB for the square parameter corresponds to the first lane. The above would be represented by 0x3210. Any combination is allowed. For instance, square = 0x2130 would result in:

    idx[x  ][y]   idx[x+1][y+1]   < = > 0 3
    idx[x  ][y+1] idx[x+1][y  ]   < = > 1 2

8b x 8b data scheme

    cols = 128/(m*lanes)

    for i = 0 ; i < rows * cols ; i++
      c = i % cols
      r = i / cols

      rx = r / 2
      rr = r % 4
      if      rr == 0:
        offset = offs[rx]*4
      else if rr == 1:
        offset = offs[rx]*4 + 1
      else if rr == 2:
        offset = offs[rx]*4 + ( offs[rx-1] + 1 ) * 4
      else if rr == 3:
        offset = offs[rx]*4 + ( offs[rx-1] + 1 ) * 4 + 1

      xstep =     c/2*xstep + (c%2)*2
      ystep = - ( c/2*xstep - (c%2)*2 )

      idx[i] = ( xstart + offset + xstep ) % (#samples in buffer)

Note: If an Y buffer exists then the index computation is the same but uses the ystep computation above.

Once all indexes have been computed the square parameter is applied to each 4x2 matrix where the square parameter chooses the index from 0 to 3 - read as 4b parameters - ( increasing left to right, top to bottom ). In this case, because the matrix is 4x2 and only 4 indexes exist, the indexes from the even rows are duplicated for each odd row:

    idx[x  ][y]  idx[x  ][y+1]   < = > 0 1
    idx[x+1][y]  idx[x+1][y+1]   < = > 0 1
    idx[x+2][y]  idx[x+2][y+1]   < = > 2 3
    idx[x+3][y]  idx[x+3][y+1]   < = > 2 3

The 4 LSB for the square parameter corresponds to the first lane, first column and second lane, first column. The above would be represented by 0x3210. Any combination is allowed. For instance, square = 0x2130 would result in:

    idx[x  ][y]   idx[x+2][y+1]   < = > 0 3
    idx[x+1][y]   idx[x+3][y+1]   < = > 0 3
    idx[x  ][y+1] idx[x+2][y  ]   < = > 1 2
    idx[x+1][y+1] idx[x+3][y  ]   < = > 1 2

8b x 8b coefficient scheme

    for i = 0 ; i < rows * cols ; i++
      c = i % cols
      r = i / cols

      rz = ( row / 4 ) * 2 + ( row % 2 )
      offset = offs[rz]*2

      step = c/2*zstep + (c%2)

      idx[i] = ( xstart + offset + step ) % (#samples in buffer)

Once all indexes have been computed the square parameter is applied to each 4x2 matrix where the square parameter chooses the index from 0 to 3 - read as 4b parameters - ( increasing left to right, top to bottom ). In this case, because the matrix is 4x2 and only 4 indexes exist, the 2x2 indexes simply replicated twice.

    idx[x  ][y]  idx[x  ][y+1]   < = > 0 1
    idx[x+1][y]  idx[x+1][y+1]   < = > 2 3
    idx[x+2][y]  idx[x+2][y+1]   < = > 0 1
    idx[x+3][y]  idx[x+3][y+1]   < = > 2 3

The 4 LSB for the square parameter corresponds to the first lane, first column and second lane, first column. The above would be represented by 0x3210. Any combination is allowed. For instance, square = 0x2130 would result in:

    idx[x  ][y]   idx[x+2][y+1]   < = > 0 3
    idx[x+1][y]   idx[x+3][y+1]   < = > 1 2
    idx[x  ][y+1] idx[x+2][y  ]   < = > 0 3
    idx[x+1][y+1] idx[x+3][y  ]   < = > 1 2

Example:

An example on the use of the 'square' parameter can be the broadcasting of data to multiple multiplier input lanes. One such practical case would be 16bit real x real FIR application where we would require the following pattern:

    acc[0] = xbuff[0]*coef[0] + xbuff[1]*coef[1]
    acc[1] = xbuff[1]*coef[0] + xbuff[2]*coef[1]

We will consider for this example this mul intrinsic:

v16acc48 mul16 (v32int16 xbuff, int xstart, unsigned int xoffsets, int xoffsets_hi, int xysquare, v16int16 zbuff, int zstart, int zoffsets, int zoffsets_hi, int zstep)

To obtain the required pattern we would load lanes 0, 1, 2 and 3 in the lowest byte of the 'offsets' parameter by setting it to '0x00' and set the square parameter to '0x2110'. Additionally, the coefficient buffer needs to be accessed in the offset '0x00' for the first two lanes and have a step parameter of 1 (to place the coefficient one in the second column). Thus we would call the intrinsic in the following way:

acc = mul16 (xbuff, 0, 0x03020100, 0x47362514 , 0x2110, coef, 0, 0x00000000, 0x00000000, 1)

The call would then result in the following:

    acc[0]  = xbuff[0]  * coef[0] + xbuff[1]  * coef[1]
    acc[1]  = xbuff[1]  * coef[0] + xbuff[2]  * coef[1]
    acc[2]  = xbuff[2]  * coef[0] + xbuff[3]  * coef[1]
    acc[3]  = xbuff[3]  * coef[0] + xbuff[4]  * coef[1]
    acc[4]  = xbuff[4]  * coef[0] + xbuff[5]  * coef[1]
    acc[5]  = xbuff[5]  * coef[0] + xbuff[6]  * coef[1]
    acc[6]  = xbuff[6]  * coef[0] + xbuff[7]  * coef[1]
    acc[7]  = xbuff[7]  * coef[0] + xbuff[8]  * coef[1]
    acc[8]  = xbuff[8]  * coef[0] + xbuff[9]  * coef[1]
    acc[9]  = xbuff[9]  * coef[0] + xbuff[12] * coef[1]
    acc[10] = xbuff[10] * coef[0] + xbuff[11] * coef[1]
    acc[11] = xbuff[11] * coef[0] + xbuff[16] * coef[1]
    acc[12] = xbuff[12] * coef[0] + xbuff[13] * coef[1]
    acc[13] = xbuff[13] * coef[0] + xbuff[20] * coef[1]
    acc[14] = xbuff[14] * coef[0] + xbuff[15] * coef[1]
    acc[15] = xbuff[15] * coef[0] + xbuff[24] * coef[1]

Example, square parameter with more than 2 columns:

If we extend the previous example with 4 tap filter, and keep the same data precision (16bit real x real FIR) we would require the following pattern:

    acc[0] = xbuff[0]*coef[0] + xbuff[1]*coef[1] + xbuff[2]*coef[2] + xbuff[3]*coef[3]
    acc[1] = xbuff[1]*coef[0] + xbuff[2]*coef[1] + xbuff[3]*coef[2] + xbuff[4]*coef[3]
    ...

In order to know which multiplication we can use we need to calculate the output lanes, using the inverse formula for calculating cols.

  m=1
  if (data_size  == 32) m*=2;
  if (coeff_size == 32) m*=2;
  if (data_complex)     m*=2;
  if (coeff_complex)    m*=2;

  cols = 32/(m*lanes)

  We already know that we want 4 columns, so, with m=1, lanes has to be 8.

We will therefore consider this mul intrinsic:

v8acc48   mul8 (v64int16 xbuff, int xstart, unsigned int xoffsets, int xstep, unsigned int xsquare, v16int16 zbuff, int zstart, unsigned int zoffsets, int zstep)

Selecting parameters for x buffer

We will use the 16bx16b data selection scheme. In the next table we show the indexes of the x vector that we want. Our aim is now to find a combination of xstart, xoffset, xstep and xsquare to get this addressing.

	0	1	2	3
lane
acc[0]	x[0]	x[1]	x[2]	x[3]
acc[1]	x[1]	x[2]	x[3]	x[4]
acc[2]	x[2]	x[3]	x[4]	x[5]
acc[3]	x[3]	x[4]	x[5]	x[6]
acc[4]	x[4]	x[5]	x[6]	x[7]
acc[5]	x[5]	x[6]	x[7]	x[8]
acc[6]	x[6]	x[7]	x[8]	x[9]
acc[7]	x[7]	x[8]	x[9]	x[10]

We realize that we need an "x_start" of 0 because it's the only way to have a 0 at the cell of coordinates (acc_0,col_0).

If we take a closer look we see that the cell with coordinates (acc_1,col_0) requires a "1". By inspecting the 16bx16b data selection scheme we find that this is not possible because

  Formula is:

  if (r % 2 == 0):
        offset = offs[r]*2
      else:
        offset = offs[r]*2 + (offs[r-1] + 1)*2

  idx[r,c] = start + offset + c/2*xstep + c%2

  By substituting start = 0, c = 0, r=1

  idx[r,c] = offs[r]*2 + (offs[r-1] + 1)*2

And we realize that since everything is multiplied by 2, we will never get a 1 in this cell.

Thankfully we can correct this by using the xsquare parameter, which permutes 4 elements at a time as explained again in 16bx16b data selection scheme, and works in blocks of 2x2 matrixes, starting at the corner (acc_0,col_0) and moving with a stride of 2 in x and y directions.

With this idea in mind we try to achieve the following pattern using xstart,xoffsets and xstep:

	0	1	2	3
lane
acc[0]	x[0]	x[1]	x[2]	x[3]
acc[1]	x[2]	x[3]	x[4]	x[5]
acc[2]	x[2]	x[3]	x[4]	x[5]
acc[3]	x[4]	x[5]	x[6]	x[7]
acc[4]	x[4]	x[5]	x[6]	x[7]
acc[5]	x[6]	x[7]	x[8]	x[9]
acc[6]	x[6]	x[7]	x[8]	x[9]
acc[7]	x[8]	x[9]	x[10]	x[11]

And then we apply a square of 0x2110 such that


A	B
C	D

Becomes this:


A	B
B	C

Thus achieving the following index addresssing:

	0	1	2	3
lane
acc[0]	x[0]	x[1]	x[2]	x[3]
acc[1]	x[1]	x[2]	x[3]	x[4]
acc[2]	x[2]	x[3]	x[4]	x[5]
acc[3]	x[3]	x[4]	x[5]	x[6]
acc[4]	x[4]	x[5]	x[6]	x[7]
acc[5]	x[5]	x[6]	x[7]	x[8]
acc[6]	x[6]	x[7]	x[8]	x[9]
acc[7]	x[7]	x[8]	x[9]	x[10]

Remember that xoffset,xstart and xstep participate only in the first permutation, hence we choose them to create the first table. The chosen ones are:

    xstart = 0
    xstep = 2
    xoffset = 0x03020100

I suggest to try substituting those parameters into the 16bx16b data selection scheme and see that you get the same result as in the example.

Selecting parameters for z buffer

Z buffer follows the general scheme. What we want is the following addressing:

	0	1	2	3
lane
acc[0]	z[0]	z[1]	z[2]	z[3]
acc[1]	z[0]	z[1]	z[2]	z[3]
acc[2]	z[0]	z[1]	z[2]	z[3]
acc[3]	z[0]	z[1]	z[2]	z[3]
acc[4]	z[0]	z[1]	z[2]	z[3]
acc[5]	z[0]	z[1]	z[2]	z[3]
acc[6]	z[0]	z[1]	z[2]	z[3]
acc[7]	z[0]	z[1]	z[2]	z[3]

And hence we will choose

    zstart = 0
    zstep = 1
    zoffset = 0x0

The intrinsic call

acc = mul8 (xbuff, 0, 0x03020100, 0x2110, coef, 0, 0x00000000, 1)

The call would then result in the following:

    acc[0]  = xbuff[0]  * coef[0] + xbuff[1]  * coef[1] + xbuff[2]  * coef[2] + xbuff[3]   * coef[3]
    acc[1]  = xbuff[1]  * coef[0] + xbuff[2]  * coef[1] + xbuff[3]  * coef[2] + xbuff[4]   * coef[3]
    acc[2]  = xbuff[2]  * coef[0] + xbuff[3]  * coef[1] + xbuff[4]  * coef[2] + xbuff[5]   * coef[3]
    acc[3]  = xbuff[3]  * coef[0] + xbuff[4]  * coef[1] + xbuff[5]  * coef[2] + xbuff[6]   * coef[3]
    acc[4]  = xbuff[4]  * coef[0] + xbuff[5]  * coef[1] + xbuff[6]  * coef[2] + xbuff[7]   * coef[3]
    acc[5]  = xbuff[5]  * coef[0] + xbuff[6]  * coef[1] + xbuff[7]  * coef[2] + xbuff[8]   * coef[3]
    acc[6]  = xbuff[6]  * coef[0] + xbuff[7]  * coef[1] + xbuff[8]  * coef[2] + xbuff[9]   * coef[3]
    acc[7]  = xbuff[7]  * coef[0] + xbuff[8]  * coef[1] + xbuff[9]  * coef[2] + xbuff[10]  * coef[3]

Center-Tap Modes

For symmetric FIR filter implementations that need to compute symmetric terms and a center tap in the same instruction, a special mode where the uct_col variable is used is required. The center tap terms will be selected from the pre-computed column offsets in the Y buffer using the uct_col parameter (immediate parameter). For instance, if a given intrinsic takes 4 columns of the X and Y buffers:

The data selection using the start and offset parameters will occur;
The uct_col will select one of this columns (between 0-3 in this case, where 0 is the first column);
The data selected with uct_col will be fed to the up-shifting hardware which will place the result on the upper accumulator lanesa.

Optional parameters

Besides the distinctions above, there are also a few other variations of the intrinsics which don't change their functionality but allow for more flexibility with buffer sizes, types and data selection.

There are two version of the intrinsics without pre-adding/subtraction, using two different sized vectors for the X buffer, for example for this intrinsic in small and large X buffer.

For intrinsics using input from both an X and Y buffer, there is also a second version using just an X buffer with twice the size. In this case the data selection for X and Y are done from this single buffer.

Full examples

Here are a few detailed examples with increasing complexity using the intrinsics :

16 bit complex by 16 bit complex multiplication
16 bit complex by 16 bit real multiplication with pre-adding from X and Y buffers
16 bit complex by 16 bit real multiplication with partial pre-adding from Y buffer with X buffer conjugation

Lane selection example 1: 16 bit complex by 16 bit complex multiplication

The basic mul4 carries out the following operations and generates 4 complex outputs in parallel:

acc0 = z00*x00 + z01*x01
acc1 = z10*x10 + z11*x11
acc2 = z20*x20 + z21*x21
acc3 = z30*x30 + z31*x31

This equation shows the way data is selected from the X and Z buffers, where xN,M denotes the Mth X buffer term in output lane N (N=0,...,3 and M=0,1). The same format applies to input from the Z buffer. The input/output parameters of the intrinsic function are as follows.

Parameters

acc	Running accumulation vector (4 x cint48 lanes) \| Valid bits: All.
xbuff	Input buffer of 32 elements of type cint16 \| Valid bits: All.
xstart	Starting position offset applied to all lanes of input from X buffer \| Valid bits: 5b LSB.
xoffsets	4b offset for each lane in the xbuffer. LSB apply to first lane \| Valid bits: 16b LSB.
xstep	Step between each column for selection in the xbuffer \| Valid bits: 4b LSB.
zbuff	Input buffer of 8 elements of type cint16 \| Valid bits: All.
zstart	Starting position offset applied to all lanes for input from Z buffer \| Valid bits: 3b LSB.
zoffsets	4b offset for each lane, applied to input from Z buffer. LSB apply to first lane \| Valid bits: 16b LSB.
zstep	Step between each column for selection in the zbuffer \| Valid bits: 4b LSB.

The input data is read from the X and Z buffers. The indices to access the X buffer are determined based on xstart, xoffsets and xstep. Assuming that IN,M denotes the buffer indices for xN,M in (Eq.1):

BUFFER SELECTORS:

xoffsets = x3x2x1x0

l0,0	l0,1
(x0+xstart) mod 32	(x0+xstart+xstep) mod 32

l1,0	l1,1
(x1+xstart) mod 32	(x1+xstart+xstep) mod 32

l2,0	l2,1
(x2+xstart) mod 32	(x2+xstart+xstep) mod 32

l3,0	l3,1
(x3+xstart) mod 32	(x3+xstart+xstep) mod 32

The same selection method applies to the Z buffer with zstart/zoffsets/zsteps and a modulo 8.

Example: 6-tap complex FIR filter

For implementing a filter, xbuff can be used to store the data while zbuff stores the coefficientsxbuff (contains DXs of type cint16)

buffer																.
xbuff (HI)	D16	D17	D18	D19	D20	D21	D22	D23	D24	D25	D26	D27	D28	D29	D30	D31
xbuff (LO)	D0	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10	D11	D12	D13	D14	D15

zbuff (contains CXs of type cint16)

coeffs
zbuff	C0	C1	C2	C3	C4	C5	0	0

acc = mul4(xbuff, 0, 0x3210, 1, zbuff, 0, 0x0000, 1)
According to the configuration above, the data and coefficient parameters are set as follows:

xstart	x3	x2	x1	x0	xstep	zstart	z3	z2	z1	z0	zstep
0	3	2	1	0	1	0	0	0	0	0	1

In this case mul4 calculates the following data selectors for the X buffer:

l0,0	l0,1
0	1

l1,0	l1,1
1	2

l2,0	l2,1
2	3

l3,0	l3,1
3	4

Using the buffer selectors, mul4 accesses the data terms and coefficients in xbuff and zbuff buffers, respectively, and generates the contribution of the first two complex coefficients for the output :

acc0 = C0*D0 + C1*D1
acc1 = C0*D1 + C1*D2
acc2 = C0*D2 + C1*D3
acc3 = C0*D3 + C1*D4

We could then perform the following operations with the corresponding mac4 operation to accumulate to the returned register and compute the following operations of this filter for the first 4 output lanes :acc = mac4(acc, xbuff, 2, 0x3210, 1, zbuff, 2, 0x0000, 1) :

acc0 += C2*D2 + C3*D3
acc1 += C2*D3 + C3*D4
acc2 += C2*D4 + C3*D5
acc3 += C2*D5 + C3*D6

acc = mac4(acc, xbuff, 4, 0x3210, 1, zbuff, 4, 0x0000, 1) :

acc0 += C4*D4 + C5*D5
acc1 += C4*D5 + C5*D6
acc2 += C4*D6 + C5*D7
acc3 += C4*D7 + C5*D8

Lane selection example 2: 16 bit complex by 16 bit real multiplication with pre-adding from X and Y buffers

The mul4_sym intrinsic carries out the following operations and generates 4 complex outputs in parallel:

acc0 = z00*(x00 + y00) + z01*(x01 + y01) + z02*(x02 + y02) + z03*(x03 + y03)
acc1 = z10*(x10 + y10) + z11*(x11 + y11) + z12*(x12 + y12) + z13*(x13 + y13)
acc2 = z20*(x20 + y20) + z21*(x21 + y21) + z22*(x22 + y22) + z23*(x23 + y23)
acc3 = z30*(x30 + y30) + z31*(x31 + y31) + z32*(x32 + y32) + z33*(x33 + y33)

Parameters

acc	Running accumulation vector (4 x cint48 lanes) \| Valid bits: All.
xbuff	Input buffer of 16 elements of type cint16 \| Valid bits: All.
xstart	Starting position offset applied to all lanes of input from X buffer \| Valid bits: 4b LSB.
xyoffsets	4b offset for each lane, applied to both x and y buffers. LSB apply to first lane \| Valid bits: 16b LSB.
xystep	Step between each column for selection in the x and y buffers \| Valid bits: 4b LSB.
ybuff	Right input buffer of 16 elements of type cint16 \| Valid bits: All.
ystart	Starting position offset applied to all lanes for input from Y buffer \| Valid bits: 4b LSB.
zbuff	Input buffer of 16 elements of type int16 \| Valid bits: All.
zstart	Starting position offset applied to all lanes for input from Z buffer \| Valid bits: 4b LSB.
zoffsets	4b offset for each lane, applied to input from Z buffer. LSB apply to first lane \| Valid bits: 16b LSB.
zstep	Step between each column for selection in the zbuffer \| Valid bits: 4b LSB.

Example:16-tap symetric FIR filter with real coefficients

For implementing a symetric filter, pre-adding can be used to select data from the X and Y input buffers, with the Z buffer holding the coefficients.Xbuff and Ybuff(contains DXs of type cint16)

buffer
xbuff	D0	D1	D2	D3	D4	D5	D6	D7	0	0	0	0	0	0	0	0
ybuff	D8	D9	D10	D11	D12	D13	D14	D15	0	0	0	0	0	0	0	0

zbuff (contains CXs of type cint16)

coeffs
zbuff	C0	C1	C2	C3	C4	C5	C6	C7	0	0	0	0	0	0	0	0

acc = mul4_sym(xbuff, 0, 0x3210, 1, ybuff, 7, zbuff, 0, 0x0000, 1)
According to the configuration above, the data and coefficient parameters are set as follows:

xstart	xy3	xy2	xy1	xy0	xystep	ystart	zstart	z3	z2	z1	z0	zstep
0	3	2	1	0	1	15	0	0	0	0	0	1

Using the data selectors and the coefficient selectors (i.e., c0 to c3), cmul4 accesses the data terms and coefficients in buf and y buffers, respectively, and generates the contribution of the first two complex coefficients in y at the output: Using the buffer selectors mul4_sym pre-adds data from the X and Y buffers, accesses the coefficients in the Z buffer, and generates the contribution of the first 4 real coefficients for the output :

acc0 = C0*(D0+D15) + C1*(D1+D14) + C2*(D2+D13) + C3*(D3+D12)
acc1 = C0*(D1+D16) + C1*(D2+D15) + C2*(D3+D14) + C3*(D4+D13)
acc2 = C0*(D2+D17) + C1*(D3+D16) + C2*(D4+D15) + C3*(D5+D14)
acc3 = C0*(D3+D18) + C1*(D4+D17) + C2*(D5+D16) + C3*(D6+D15)

We could then perform the remaining operation with the corresponding mac4_sym operation to accumulate to the returned register and compute the following operations of this filter for the first 4 output lanes :acc = mac4_sym(acc,xbuff, 4, 0x3210, 1, ybuff, 3, zbuff, 0, 0x0000, 1:

acc0 = C4*(D4+D11) + C5*(D5+D10) + C6*(D6+D9 ) + C7*(D7 +D8 )
acc1 = C4*(D5+D12) + C5*(D6+D11) + C6*(D7+D10) + C7*(D8 +D9 )
acc2 = C4*(D6+D13) + C5*(D7+D12) + C6*(D8+D11) + C7*(D9 +D10)
acc3 = C4*(D7+D14) + C5*(D8+D13) + C6*(D9+D12) + C7*(D10+D11)

This same functionality can be found for pre-subtraction.

Lane selection example 3: 16 bit complex by 16 bit real multiplication with partial pre-adding from Y buffer with X buffer conjugation

This example will illustrate more complex data selection with an intrinsic using partial pre-adding.Desired operation :

acc0 = C0*( conj(D0)+conj(D25) ) + C2*( conj(D1)+conj(D24) ) + C4*( conj(D2)+conj(D23) ) + C6*conj(D15)
acc1 = C1*( conj(D2)+conj(D27) ) + C3*( conj(D3)+conj(D26) ) + C5*( conj(D4)+conj(D25) ) + C7*conj(D17)
acc2 = C3*( conj(D4)+conj(D29) ) + C5*( conj(D5)+conj(D28) ) + C7*( conj(D6)+conj(D27) ) + C9*conj(D19)
acc3 = C3*( conj(D6)+conj(D31) ) + C5*( conj(D7)+conj(D30) ) + C7*( conj(D8)+conj(D29) ) + C9*conj(D21)

The intrinsic used to achieve this will be mul4_sym_ct_c.

Parameters

acc	Running accumulation vector (4 x cint48 lanes) \| Valid bits: All.
xbuff	Input buffer of 32 elements of type cint16 \| Valid bits: All.
xstart	Starting position offset applied to all lanes of input from X buffer \| Valid bits: 5b LSB.
xyoffsets	4b offset for each lane, applied to both x and y buffers. LSB apply to first lane \| Valid bits: 16b LSB.
xystep	Step between each column for selection in the x and y buffers \| Valid bits: 4b LSB.
ystart	Starting position offset applied to all lanes for input from Y buffer \| Valid bits: 5b LSB.
ctap	Selector for partial pre-substraction \| Valid bits: 4b LSB.
zbuff	Input buffer of 16 elements of type int16 \| Valid bits: All.
zstart	Starting position offset applied to all lanes for input from Z buffer \| Valid bits: 4b LSB.
zoffsets	4b offset for each lane, applied to input from Z buffer. LSB apply to first lane \| Valid bits: 16b LSB.
zstep	Step between each column for selection in the zbuffer \| Valid bits: 4b LSB.

The contents of the input buffers will be considered as follows :xbuff (contains DXs of type cint16)

buffer
xbuff (HI)	D16	D17	D18	D19	D20	D21	D22	D23	D24	D25	D26	D27	D28	D29	D30	D31
xbuff (LO)	D0	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10	D11	D12	D13	D14	D15

zbuff (contains CXs of type int16)

coeffs
zbuff	C0	C1	C2	C3	C4	C5	C6	C7	C8	C9	C10	C11	C12	C13	C14	C15

For this given equation :

-The pattern for the coefficients from the Z buffer is 0x3310 so that will be the zoffsets value. Similarly, it will be 0x6420 for the xyoffsets.
-The first value from zbuff is C0 so zstart=0 and similarly xstart=0.
-For the pre-add we start at D25 so ystart=25.
-The step between columns is 2 for zbuf values and 1 for xbuf and ybuf values so zstep=2 and xystep=1.
-Finally, we want to start at D15 for the final column so ctap will be set to 15.

This gives this usage of the intrinsic :acc = mul4_sym_ct_cn(xbuff, 0, 0x6420, 1, 25, 15, zbuff, 0, 0x3310, 2);