Overview

Note: Lookup table functionality is only available from AIE-ML

Two abstractions are provided to represent lookup tables on AIE architectures:

aie::parallel_lookup which provides a direct lookup
aie::linear_approx which provides a linear approximation for non-linear functions

The primary purpose of these abstractions is to leverage hardware support for parallel accesses on certain AIE architectures.

Both of these abstractions are built upon the aie::lut type that is used to encapsulate the raw LUT data. This encapsulation is implemented in an attempt to ensure correct data layout for a given lookup type. Specifically, to achieve a given level of access parallelism, the LUT values are required to have a specific layout in memory, which is dependent on the required number of parallel loads. For details on the memory layout requirements, see the aie::lut documentation.

Example implementations of parallel lookup and linear approximation functions are given below:

template <typename Value>
void parallel_lookup(const int8* pIn, Value* pOut, const aie::lut<4, Value>& my_lut,
                     int samples, int step_bits, int bias, int LUT_elems)
{
    aie::parallel_lookup<int8, aie::lut<4, Value>> lookup(my_lut, step_bits, bias);
 
    auto it_in  = aie::begin_vector<32>(pIn);
    auto it_out = aie::begin_vector<32>(pOut);
 
    for (unsigned l = 0; l < samples / 32; ++l) 
        *it_out++ = lookup.fetch(*it_in++);
}
 
template <typename OffsetType, typename SlopeType>
void linear_approx(const int8* pIn, OffsetType* pOut, const aie::lut<4, OffsetType, SlopeType>& my_lut,
                   int samples, int step_bits, int bias, int LUT_elems, int shift_offset, int shift_out)
{
    aie::linear_approx<int8, aie::lut<4, OffsetType, SlopeType>> lin_approx(my_lut, step_bits, bias, shift_offset);
 
    auto it_in  = aie::begin_vector<32>(pIn);
    auto it_out = aie::begin_vector<32>(pOut);
 
    for (unsigned l = 0; l < samples / 32; ++l) 
        *it_out++ = lin_approx.compute(*it_in++).to_vector<LUT_T>(shift_out);
}

Classes
struct	aie::linear_approx< T, MyLUT >

struct	aie::lut< ParallelAccesses, OffsetType, SlopeType >
	Abstraction to represent a LUT that is stored in memory, instantiated with pointer(s) to the already appropriately populated memory and the number of elements. More...

struct	aie::parallel_lookup< T, MyLUT, oor_policy >

Class Documentation

◆ aie::linear_approx

struct aie::linear_approx

template<typename T, ParallelLUT MyLUT>
requires (arch::is(arch::AIE_ML))
struct aie::linear_approx< T, MyLUT >

Note: Linear approximation functionality is only available from AIE-ML

Type to support a linear approximation via interpolation with slope/offset values stored in a lookup table.

The offset values are simply the samples of the function to be approximated. The slope values, which are the slopes of the function at the corresponding sample, are used in conjunction with the input to more accurately estimate the function value between sample points.

The logical steps of the computation for an integer based linear approximation are:

index = (input >> step_bits) + bias
slope/offset pair read from LUT based on index
output = slope * (input & ((1 << step_bits) - 1)) + (offset << shift_offset)

while the steps for a floating point based approximation are:

index = (int(floor(input)) >> step_bits) + bias
slope/offset pair read from LUT based on index
output = slope * input + offset

Note that for integer based linear approximations, the slope is multiplied by an integer value in the range [0, 1 << step_bits) and therefore tweaking of the LUT values or linear_approx parameters may be required to ensure that offset[i] + slope[i] * ((1 << step_bits) - 1) approximately equals offset[i+1].

The slope and offset values are expected to be placed adjacent in memory. The number of achieved lookups per cycle is determined by the aie::lut object that encapsulates the contents of the lookup table. The following example shows the memory layout of a 128b bank width lookup table with 16b values and slopes, which achieves 4 lookups per cycle:

constexpr unsigned size = 8;
const int16 lut_ab[size*2*2] = {slope0, offset0, slope1, offset1, slope2, offset2, slope3, offset3,
                                slope0, offset0, slope1, offset1, slope2, offset2, slope3, offset3, //note 128b duplication
                                slope4, offset4, slope5, offset5, slope6, offset6, slope7, offset7,
                                slope4, offset4, slope5, offset5, slope6, offset6, slope7, offset7};
const int16 lut_cd[size*2*2] = {slope0, offset0, slope1, offset1, slope2, offset2, slope3, offset3,
                                slope0, offset0, slope1, offset1, slope2, offset2, slope3, offset3,
                                slope4, offset4, slope5, offset5, slope6, offset6, slope7, offset7,
                                slope4, offset4, slope5, offset5, slope6, offset6, slope7, offset7};
aie::lut<4, int16, int16> lookup_table(size, lut_ab, lut_cd);

Supported linear approximation types
Input	Offset	Slope	Accumulator type	Lanes	Minumum step_bits required
int8	int8	int8	acc32	32	2
int16	int16	int16	acc64	16	3
int16	int32	int32	acc64	16	4
bfloat16	float	bfloat16	accfloat	16	0

Note that while the floating point linear approx requires the offset data to be 32b floats, the slope data is required to be bfloat16. However, it is required that all values in the LUT be 32b to ensure the LUT is correctly aligned. While it is safe to use floats as the storage type for the lookup table, it is required that the low 16 mantissa bits of the floating point slope value be zero.

Template Parameters

T	Type of the input vector, containing values used to index the lookup table.
MyLUT	Definition of the LUT type, using the lut type.

Public Member Functions
	linear_approx (const MyLUT &l, unsigned step_bits, int bias=0, int shift_offset=0)
	Constructor, configures aspects of how the approximation is performed.

template<Vector Vec>
auto	compute (const Vec &input)
	Performs a linear approximation for the input values with the configured lookup table.

Constructor & Destructor Documentation

◆ linear_approx()

template<typename T , ParallelLUT MyLUT>

aie::linear_approx< T, MyLUT >::linear_approx	(	const MyLUT &	l,
		unsigned	step_bits,
		int	bias = `0`,
		int	shift_offset = `0`
	)

inline

Constructor, configures aspects of how the approximation is performed.

Parameters

l	LUT containing the stored slope/offset value pairs used for the linear approximation. Each value in the LUT has the slope in the LSB, the offset in the MSB.
step_bits	Lower bits that won't be used from the input to index the LUT. For integer input, these will be the remainder multiplied by the slope value at each point. For float values, the input values are used directly in the multiplication
bias	Optional offset added to the input values used to index, for example to center on 0 by adding half the number of LUT elements.
shift_offset	Optional scaling factor applied to the offset before adding it (to avoid loss of precision).

Member Function Documentation

◆ compute()

template<typename T , ParallelLUT MyLUT>

template<Vector Vec>

auto aie::linear_approx< T, MyLUT >::compute ( const Vec & input )

inline

Performs a linear approximation for the input values with the configured lookup table.

An accumulator of the same number of elements as the input is returned.
Input values are interpreted from MSB to LSB: headroom | LUT elements | step_bits

Parameters

input Vector of input values that are used to index the look-up table.

◆ aie::lut

struct aie::lut

template<unsigned ParallelAccesses, typename OffsetType, typename SlopeType = OffsetType>
requires (arch::is(arch::AIE_ML))
struct aie::lut< ParallelAccesses, OffsetType, SlopeType >

Abstraction to represent a LUT that is stored in memory, instantiated with pointer(s) to the already appropriately populated memory and the number of elements.

The requirement on memory layout is that for degree N parallel accesses, N copies of the LUT data are required; i.e.

For a single load without parallelism, the values required to be stored linearly in memory.
For 2 loads in parallel, the LUT needs to have 2 copies of the LUT values with repetition every bank width. For example with 32b values and a 128b bank width, in memory we would have the first 4 values (128b), then the same 4 again, then the next 4, which then repeat, etc.
For 4 loads in parallel, we require the same layout as for 2 loads, but two distinct copies in this layout, placed in different memory banks.

Currently the only supported implementation on this architecture is for 4 parallel accesses.

Template Parameters

ParallelAccesses	Defines how many parallel accesses will be done in a single LUT access, possibilities depend on the hardware available for the given architecture
OffsetType	Type of values stored within the lookup table.
SlopeType	Optional template parameter, only needed in certain cases of linear approximation where the offset/slope value pair uses two different types.

Public Types
using	lut_impl = detail::lut< ParallelAccesses, OffsetType, SlopeType >

using	offset_type = OffsetType

using	slope_type = SlopeType

Public Member Functions
	lut (unsigned LUT_elems, const void *LUT_a)
	Constructor for singular access.

	lut (unsigned LUT_elems, const void *LUT_ab)
	Constructor for two parallel accesses.

	lut (unsigned LUT_elems, const void LUT_ab, const void LUT_cd)
	Constructor for 4 parallel accesses.

Member Typedef Documentation

◆ lut_impl

template<unsigned ParallelAccesses, typename OffsetType , typename SlopeType = OffsetType>

using aie::lut< ParallelAccesses, OffsetType, SlopeType >::lut_impl = detail::lut<ParallelAccesses, OffsetType, SlopeType>

◆ offset_type

template<unsigned ParallelAccesses, typename OffsetType , typename SlopeType = OffsetType>

using aie::lut< ParallelAccesses, OffsetType, SlopeType >::offset_type = OffsetType

◆ slope_type

template<unsigned ParallelAccesses, typename OffsetType , typename SlopeType = OffsetType>

using aie::lut< ParallelAccesses, OffsetType, SlopeType >::slope_type = SlopeType

Constructor & Destructor Documentation

◆ lut() [1/3]

template<unsigned ParallelAccesses, typename OffsetType , typename SlopeType = OffsetType>

aie::lut< ParallelAccesses, OffsetType, SlopeType >::lut	(	unsigned	LUT_elems,
		const void *	LUT_ab,
		const void *	LUT_cd
	)

inline

Constructor for 4 parallel accesses.

Each pointer points to an equivalent LUT populated within which the values are repeated twice, interleaved at a bank width granularity. In total the same values need to be present 4 times in memory to allow for the 4 parallel accesses.

For example, with a 128b bank width:

constexpr unsigned size = 8;
const int32 lut_ab[size*2] = {value0, value1, value2, value3,
                              value0, value1, value2, value3, //note 128b duplication
                              value4, value5, value6, value7,
                              value4, value5, value6, value7};
const int32 lut_cd[size*2] = {value0, value1, value2, value3,
                              value0, value1, value2, value3,
                              value4, value5, value6, value7,
                              value4, value5, value6, value7};
aie::lut<4, int32> lookup_table(size, lut_ab, lut_cd);

Parameters

LUT_elems	Number elements in the LUT (not accounting for repetition).
LUT_ab	First two copies of the data, with the values repeated and interleaved at bank width granularity.
LUT_cd	Next two copies of the data, with the values repeated and interleaved at bank width granularity.

◆ lut() [2/3]

template<unsigned ParallelAccesses, typename OffsetType , typename SlopeType = OffsetType>

aie::lut< ParallelAccesses, OffsetType, SlopeType >::lut	(	unsigned	LUT_elems,
		const void *	LUT_ab
	)

inline

Constructor for two parallel accesses.

For example, with a 128b bank width:

constexpr unsigned size = 8;
const int32 lut_ab[size*2] = {value0, value1, value2, value3,
                              value0, value1, value2, value3, //note 128b duplication
                              value4, value5, value6, value7,
                              value4, value5, value6, value7};
aie::lut<2, int32> lookup_table(size, lut_ab);

Parameters

LUT_elems	Number of elements in the LUT (not accounting for repetition).
LUT_ab	Two copies of the data, with the values interleaved at bank width granularity.

◆ lut() [3/3]

template<unsigned ParallelAccesses, typename OffsetType , typename SlopeType = OffsetType>

aie::lut< ParallelAccesses, OffsetType, SlopeType >::lut	(	unsigned	LUT_elems,
		const void *	LUT_a
	)

inline

Constructor for singular access.

For example,

constexpr unsigned size = 8;
const int32 lut_a[size] = {value0, value1, value2, value3,
                           value4, value5, value6, value7};
aie::lut<1, int32> lookup_table(size, lut_a);

Parameters

LUT_elems	Number of elements in the LUT.
LUT_a	Pointer to the LUT values.

◆ aie::parallel_lookup

struct aie::parallel_lookup

template<typename T, ParallelLUT MyLUT, lut_oor_policy oor_policy = lut_oor_policy::saturate>
requires (arch::is(arch::AIE_ML))
struct aie::parallel_lookup< T, MyLUT, oor_policy >

Note: Parallel lookup functionality is only available from AIE-ML

Type with functionality to directly index a LUT based on input vector of values. The number of achieved lookups per cycle is determined by the aie::lut object that encapsulates the contents of the lookup table. Refer to aie::lut for more details.

Real signed and unsigned integer types (>=8b) are supported as indices. All types (>=8b) are supported as value types, including bfloat16, real, and complex types.

Note: 8b value type lookups require the data to be stored in the lookup tables as 16b values due to the granularity of the memory accesses.

Template Parameters

T	Type of the input vector, containing values used to index the lookup table.
MyLUT	Definition of the LUT type, using the lut type
oor_policy	Defines the "out of range policy" for when index values on the input go beyond the size of the LUT. It can either saturate, taking on the min/max valid index, or truncate, retaining the lower bits for unsigned indicies or wrapping in the interval [-bias,lut_size-bias) for signed indices. Saturating is the default behaviour, but for certain non-linear functions which repeat after an interval truncation may be required.

Public Member Functions
template<typename U = T> requires (std::is_unsigned_v<T>)
	parallel_lookup (const MyLUT &l, unsigned step_bits=0)
	Constructor for unsigned input types, configures aspects of how the lookup is performed.

template<typename U = T> requires (std::is_signed_v<T>)
	parallel_lookup (const MyLUT &l, unsigned step_bits=0, unsigned bias=0)
	Constructor for signed input types, configures aspects of how the lookup is performed.

template<Vector Vec, unsigned N = Vec::size()>
vector< typename MyLUT::offset_type, N >	fetch (const Vec &input)
	Accesses the lookup table based on the provided input values, will return a vector of the same number of elements as the input vector.

template<unsigned N, Vector Vec>
vector< typename MyLUT::offset_type, N >	fetch (const Vec &input)
	Accesses the lookup table based on the provided input values.

Constructor & Destructor Documentation

◆ parallel_lookup() [1/2]

template<typename T , ParallelLUT MyLUT, lut_oor_policy oor_policy = lut_oor_policy::saturate>

template<typename U = T>
requires (std::is_signed_v<T>)

aie::parallel_lookup< T, MyLUT, oor_policy >::parallel_lookup	(	const MyLUT &	l,
		unsigned	step_bits = `0`,
		unsigned	bias = `0`
	)

inline

Constructor for signed input types, configures aspects of how the lookup is performed.

Note that usage of step_bits requires either:

The rounding mode is set to the default aie::rounding_mode::floor
The lowest step_bits of the index are zero

Parameters

l	LUT containing the stored values used for the linear approximation.
step_bits	Optional lower bits that will be ignored for indexing the LUT.
bias	Optional offset added to the input values used to index, for example to center on 0 by adding half the number of LUT elements. This value, if supplied, must be a power of 2.

◆ parallel_lookup() [2/2]

template<typename T , ParallelLUT MyLUT, lut_oor_policy oor_policy = lut_oor_policy::saturate>

template<typename U = T>
requires (std::is_unsigned_v<T>)

aie::parallel_lookup< T, MyLUT, oor_policy >::parallel_lookup	(	const MyLUT &	l,
		unsigned	step_bits = `0`
	)

inline

Constructor for unsigned input types, configures aspects of how the lookup is performed.

Note that usage of step_bits requires either:

The rounding mode is set to the default aie::rounding_mode::floor
The lowest step_bits of the index are zero

Parameters

l	LUT containing the stored values used for the linear approximation.
step_bits	Optional lower bits that will be ignored for indexing the LUT.

Member Function Documentation

◆ fetch() [1/2]

template<typename T , ParallelLUT MyLUT, lut_oor_policy oor_policy = lut_oor_policy::saturate>

template<Vector Vec, unsigned N = Vec::size()>

vector< typename MyLUT::offset_type, N > aie::parallel_lookup< T, MyLUT, oor_policy >::fetch ( const Vec & input )

inline

Accesses the lookup table based on the provided input values, will return a vector of the same number of elements as the input vector.

Input values are interpreted from MSB to LSB: headroom | LUT elements | step_bits

Note the step_bits are required to be zeroed if the rounding mode is set to anything other than aie::rounding_mode::floor.

Parameters

input Vector of input values that are used to index the look-up table.

◆ fetch() [2/2]

template<typename T , ParallelLUT MyLUT, lut_oor_policy oor_policy = lut_oor_policy::saturate>

template<unsigned N, Vector Vec>

vector< typename MyLUT::offset_type, N > aie::parallel_lookup< T, MyLUT, oor_policy >::fetch ( const Vec & input )

inline

Accesses the lookup table based on the provided input values.

This overload allows the size of the returned vector to be specified as a template parameter. This may be required when mapping small index types to large value types as a direct mapping may not be valid. For example, mapping int8 to cint32 on a given architecture may require input to be 16 elements. fetch(input) would therefore deduce a return type of aie::vector<cint32, 16>, which may be unsupported. However, returning aie::vector<cint32, 8> by calling fetch<8>(input) may be valid.
Input values are interpreted from MSB to LSB: headroom | LUT elements | step_bits

Note the step_bits are required to be zeroed if the rounding mode is set to anything other than aie::rounding_mode::floor.

Template Parameters

N	The number of elements to lookup, which may be less than the input vector size

Parameters

input Vector of input values that are used to index the look-up table.

Overview

Classes

Class Documentation

◆ aie::linear_approx

Public Member Functions

Constructor & Destructor Documentation

◆ linear_approx()

Member Function Documentation

◆ compute()

◆ aie::lut

Public Types

Public Member Functions

Member Typedef Documentation

◆ lut_impl

◆ offset_type

◆ slope_type

Constructor & Destructor Documentation

◆ lut() [1/3]

◆ lut() [2/3]

◆ lut() [3/3]

◆ aie::parallel_lookup

Public Member Functions

Constructor & Destructor Documentation

◆ parallel_lookup() [1/2]

◆ parallel_lookup() [2/2]

Member Function Documentation

◆ fetch() [1/2]

◆ fetch() [2/2]