AI Engine API User Guide (AIE) 2023.2
Loading...
Searching...
No Matches
Lookup Tables

Overview

Note
Lookup table functionality is only available from AIE-ML

Two abstractions are provided to represent lookup tables on AIE architectures:

  1. aie::parallel_lookup which provides a direct lookup
  2. aie::linear_approx which provides a linear approximation for non-linear functions

The primary purpose of these abstractions is to leverage hardware support for parallel accesses on certain AIE architectures.

Both of these abstractions are built upon the aie::lut type that is used to encapsulate the raw LUT data. This encapsulation is implemented in an attempt to ensure correct data layout for a given lookup type. Specifically, to achieve a given level of access parallelism, the LUT values are required to have a specific layout in memory, which is dependent on the required number of parallel loads. For details on the memory layout requirements, see the aie::lut documentation.

Example implementations of parallel lookup and linear approximation functions are given below:

template <typename Value>
void parallel_lookup(const int8* pIn, Value* pOut, const aie::lut<4, Value>& my_lut,
int samples, int step_bits, int bias, int LUT_elems)
{
aie::parallel_lookup<int8, aie::lut<4, Value>> lookup(my_lut, step_bits, bias);
auto it_in = aie::begin_vector<32>(pIn);
auto it_out = aie::begin_vector<32>(pOut);
for (unsigned l = 0; l < samples / 32; ++l)
*it_out++ = lookup.fetch(*it_in++);
}
template <typename OffsetType, typename SlopeType>
void linear_approx(const int8* pIn, OffsetType* pOut, const aie::lut<4, OffsetType, SlopeType>& my_lut,
int samples, int step_bits, int bias, int LUT_elems, int shift_offset, int shift_out)
{
aie::linear_approx<int8, aie::lut<4, OffsetType, SlopeType>> lin_approx(my_lut, step_bits, bias, shift_offset);
auto it_in = aie::begin_vector<32>(pIn);
auto it_out = aie::begin_vector<32>(pOut);
for (unsigned l = 0; l < samples / 32; ++l)
*it_out++ = lin_approx.compute(*it_in++).to_vector<LUT_T>(shift_out);
}
Definition aie.hpp:7893
Abstraction to represent a LUT that is stored in memory, instantiated with pointer(s) to the already ...
Definition aie.hpp:7757
Definition aie.hpp:7945
int8_t int8
Definition types.hpp:62

Classes

struct  aie::linear_approx< T, MyLUT >
 
struct  aie::lut< ParallelAccesses, OffsetType, SlopeType >
 Abstraction to represent a LUT that is stored in memory, instantiated with pointer(s) to the already appropriately populated memory and the number of elements. More...
 
struct  aie::parallel_lookup< T, MyLUT, oor_policy >
 

Class Documentation

◆ aie::linear_approx

struct aie::linear_approx
template<typename T, ParallelLUT MyLUT>
requires (arch::is(arch::AIE_ML))
struct aie::linear_approx< T, MyLUT >
Note
Linear approximation functionality is only available from AIE-ML

Type to support a linear approximation via interpolation with slope/offset values stored in a lookup table.

The offset values are simply the samples of the function to be approximated. The slope values, which are the slopes of the function at the corresponding sample, are used in conjunction with the input to more accurately estimate the function value between sample points.

The logical steps of the computation for an integer based linear approximation are:

  • index = (input >> step_bits) + bias
  • slope/offset pair read from LUT based on index
  • output = slope * (input & ((1 << step_bits) - 1)) + (offset << shift_offset)

while the steps for a floating point based approximation are:

  • index = (int(floor(input)) >> step_bits) + bias
  • slope/offset pair read from LUT based on index
  • output = slope * input + offset

Note that for integer based linear approximations, the slope is multiplied by an integer value in the range [0, 1 << step_bits) and therefore tweaking of the LUT values or linear_approx parameters may be required to ensure that offset[i] + slope[i] * ((1 << step_bits) - 1) approximately equals offset[i+1].

The slope and offset values are expected to be placed adjacent in memory. The number of achieved lookups per cycle is determined by the aie::lut object that encapsulates the contents of the lookup table. The following example shows the memory layout of a 128b bank width lookup table with 16b values and slopes, which achieves 4 lookups per cycle:

constexpr unsigned size = 8;
const int16 lut_ab[size*2*2] = {slope0, offset0, slope1, offset1, slope2, offset2, slope3, offset3,
slope0, offset0, slope1, offset1, slope2, offset2, slope3, offset3, //note 128b duplication
slope4, offset4, slope5, offset5, slope6, offset6, slope7, offset7,
slope4, offset4, slope5, offset5, slope6, offset6, slope7, offset7};
const int16 lut_cd[size*2*2] = {slope0, offset0, slope1, offset1, slope2, offset2, slope3, offset3,
slope0, offset0, slope1, offset1, slope2, offset2, slope3, offset3,
slope4, offset4, slope5, offset5, slope6, offset6, slope7, offset7,
slope4, offset4, slope5, offset5, slope6, offset6, slope7, offset7};
aie::lut<4, int16, int16> lookup_table(size, lut_ab, lut_cd);
int16_t int16
Definition types.hpp:63
Supported linear approximation types
InputOffsetSlopeAccumulator typeLanesMinumum step_bits required
int8int8 int8 acc32 32 2
int16int16 int16 acc64 16 3
int16int32 int32 acc64 16 4
bfloat16floatbfloat16 accfloat 16 0

Note that while the floating point linear approx requires the offset data to be 32b floats, the slope data is required to be bfloat16. However, it is required that all values in the LUT be 32b to ensure the LUT is correctly aligned. While it is safe to use floats as the storage type for the lookup table, it is required that the low 16 mantissa bits of the floating point slope value be zero.

Template Parameters
TType of the input vector, containing values used to index the lookup table.
MyLUTDefinition of the LUT type, using the lut type.

Public Member Functions

 linear_approx (const MyLUT &l, unsigned step_bits, int bias=0, int shift_offset=0)
 Constructor, configures aspects of how the approximation is performed.
 
template<Vector Vec>
auto compute (const Vec &input)
 Performs a linear approximation for the input values with the configured lookup table.
 

Constructor & Destructor Documentation

◆ linear_approx()

template<typename T , ParallelLUT MyLUT>
aie::linear_approx< T, MyLUT >::linear_approx ( const MyLUT &  l,
unsigned  step_bits,
int  bias = 0,
int  shift_offset = 0 
)
inline

Constructor, configures aspects of how the approximation is performed.

Parameters
lLUT containing the stored slope/offset value pairs used for the linear approximation. Each value in the LUT has the slope in the LSB, the offset in the MSB.
step_bitsLower bits that won't be used from the input to index the LUT. For integer input, these will be the remainder multiplied by the slope value at each point. For float values, the input values are used directly in the multiplication
biasOptional offset added to the input values used to index, for example to center on 0 by adding half the number of LUT elements.
shift_offsetOptional scaling factor applied to the offset before adding it (to avoid loss of precision).

Member Function Documentation

◆ compute()

template<typename T , ParallelLUT MyLUT>
template<Vector Vec>
auto aie::linear_approx< T, MyLUT >::compute ( const Vec &  input)
inline

Performs a linear approximation for the input values with the configured lookup table.

An accumulator of the same number of elements as the input is returned.
Input values are interpreted from MSB to LSB: headroom | LUT elements | step_bits

Parameters
inputVector of input values that are used to index the look-up table.

◆ aie::lut

struct aie::lut
template<unsigned ParallelAccesses, typename OffsetType, typename SlopeType = OffsetType>
requires (arch::is(arch::AIE_ML))
struct aie::lut< ParallelAccesses, OffsetType, SlopeType >

Abstraction to represent a LUT that is stored in memory, instantiated with pointer(s) to the already appropriately populated memory and the number of elements.

The requirement on memory layout is that for degree N parallel accesses, N copies of the LUT data are required; i.e.

  • For a single load without parallelism, the values required to be stored linearly in memory.
  • For 2 loads in parallel, the LUT needs to have 2 copies of the LUT values with repetition every bank width. For example with 32b values and a 128b bank width, in memory we would have the first 4 values (128b), then the same 4 again, then the next 4, which then repeat, etc.
  • For 4 loads in parallel, we require the same layout as for 2 loads, but two distinct copies in this layout, placed in different memory banks.

Currently the only supported implementation on this architecture is for 4 parallel accesses.

Template Parameters
ParallelAccessesDefines how many parallel accesses will be done in a single LUT access, possibilities depend on the hardware available for the given architecture
OffsetTypeType of values stored within the lookup table.
SlopeTypeOptional template parameter, only needed in certain cases of linear approximation where the offset/slope value pair uses two different types.

Public Types

using lut_impl = detail::lut< ParallelAccesses, OffsetType, SlopeType >
 
using offset_type = OffsetType
 
using slope_type = SlopeType
 

Public Member Functions

 lut (unsigned LUT_elems, const void *LUT_a)
 Constructor for singular access.
 
 lut (unsigned LUT_elems, const void *LUT_ab)
 Constructor for two parallel accesses.
 
 lut (unsigned LUT_elems, const void *LUT_ab, const void *LUT_cd)
 Constructor for 4 parallel accesses.
 

Member Typedef Documentation

◆ lut_impl

template<unsigned ParallelAccesses, typename OffsetType , typename SlopeType = OffsetType>
using aie::lut< ParallelAccesses, OffsetType, SlopeType >::lut_impl = detail::lut<ParallelAccesses, OffsetType, SlopeType>

◆ offset_type

template<unsigned ParallelAccesses, typename OffsetType , typename SlopeType = OffsetType>
using aie::lut< ParallelAccesses, OffsetType, SlopeType >::offset_type = OffsetType

◆ slope_type

template<unsigned ParallelAccesses, typename OffsetType , typename SlopeType = OffsetType>
using aie::lut< ParallelAccesses, OffsetType, SlopeType >::slope_type = SlopeType

Constructor & Destructor Documentation

◆ lut() [1/3]

template<unsigned ParallelAccesses, typename OffsetType , typename SlopeType = OffsetType>
aie::lut< ParallelAccesses, OffsetType, SlopeType >::lut ( unsigned  LUT_elems,
const void *  LUT_ab,
const void *  LUT_cd 
)
inline

Constructor for 4 parallel accesses.

Each pointer points to an equivalent LUT populated within which the values are repeated twice, interleaved at a bank width granularity. In total the same values need to be present 4 times in memory to allow for the 4 parallel accesses.

For example, with a 128b bank width:

constexpr unsigned size = 8;
const int32 lut_ab[size*2] = {value0, value1, value2, value3,
value0, value1, value2, value3, //note 128b duplication
value4, value5, value6, value7,
value4, value5, value6, value7};
const int32 lut_cd[size*2] = {value0, value1, value2, value3,
value0, value1, value2, value3,
value4, value5, value6, value7,
value4, value5, value6, value7};
aie::lut<4, int32> lookup_table(size, lut_ab, lut_cd);
int32_t int32
Definition types.hpp:64
Parameters
LUT_elemsNumber elements in the LUT (not accounting for repetition).
LUT_abFirst two copies of the data, with the values repeated and interleaved at bank width granularity.
LUT_cdNext two copies of the data, with the values repeated and interleaved at bank width granularity.

◆ lut() [2/3]

template<unsigned ParallelAccesses, typename OffsetType , typename SlopeType = OffsetType>
aie::lut< ParallelAccesses, OffsetType, SlopeType >::lut ( unsigned  LUT_elems,
const void *  LUT_ab 
)
inline

Constructor for two parallel accesses.

For example, with a 128b bank width:

constexpr unsigned size = 8;
const int32 lut_ab[size*2] = {value0, value1, value2, value3,
value0, value1, value2, value3, //note 128b duplication
value4, value5, value6, value7,
value4, value5, value6, value7};
aie::lut<2, int32> lookup_table(size, lut_ab);
Parameters
LUT_elemsNumber of elements in the LUT (not accounting for repetition).
LUT_abTwo copies of the data, with the values interleaved at bank width granularity.

◆ lut() [3/3]

template<unsigned ParallelAccesses, typename OffsetType , typename SlopeType = OffsetType>
aie::lut< ParallelAccesses, OffsetType, SlopeType >::lut ( unsigned  LUT_elems,
const void *  LUT_a 
)
inline

Constructor for singular access.

For example,

constexpr unsigned size = 8;
const int32 lut_a[size] = {value0, value1, value2, value3,
value4, value5, value6, value7};
aie::lut<1, int32> lookup_table(size, lut_a);
Parameters
LUT_elemsNumber of elements in the LUT.
LUT_aPointer to the LUT values.

◆ aie::parallel_lookup

struct aie::parallel_lookup
template<typename T, ParallelLUT MyLUT, lut_oor_policy oor_policy = lut_oor_policy::saturate>
requires (arch::is(arch::AIE_ML))
struct aie::parallel_lookup< T, MyLUT, oor_policy >
Note
Parallel lookup functionality is only available from AIE-ML

Type with functionality to directly index a LUT based on input vector of values. The number of achieved lookups per cycle is determined by the aie::lut object that encapsulates the contents of the lookup table. Refer to aie::lut for more details.

Real signed and unsigned integer types (>=8b) are supported as indices. All types (>=8b) are supported as value types, including bfloat16, real, and complex types.

Note
8b value type lookups require the data to be stored in the lookup tables as 16b values due to the granularity of the memory accesses.
Template Parameters
TType of the input vector, containing values used to index the lookup table.
MyLUTDefinition of the LUT type, using the lut type
oor_policyDefines the "out of range policy" for when index values on the input go beyond the size of the LUT. It can either saturate, taking on the min/max valid index, or truncate, retaining the lower bits for unsigned indicies or wrapping in the interval [-bias,lut_size-bias) for signed indices. Saturating is the default behaviour, but for certain non-linear functions which repeat after an interval truncation may be required.

Public Member Functions

template<typename U = T>
requires (std::is_unsigned_v<T>)
 parallel_lookup (const MyLUT &l, unsigned step_bits=0)
 Constructor for unsigned input types, configures aspects of how the lookup is performed.
 
template<typename U = T>
requires (std::is_signed_v<T>)
 parallel_lookup (const MyLUT &l, unsigned step_bits=0, unsigned bias=0)
 Constructor for signed input types, configures aspects of how the lookup is performed.
 
template<Vector Vec, unsigned N = Vec::size()>
vector< typename MyLUT::offset_type, N > fetch (const Vec &input)
 Accesses the lookup table based on the provided input values, will return a vector of the same number of elements as the input vector.
 
template<unsigned N, Vector Vec>
vector< typename MyLUT::offset_type, N > fetch (const Vec &input)
 Accesses the lookup table based on the provided input values.
 

Constructor & Destructor Documentation

◆ parallel_lookup() [1/2]

template<typename T , ParallelLUT MyLUT, lut_oor_policy oor_policy = lut_oor_policy::saturate>
template<typename U = T>
requires (std::is_signed_v<T>)
aie::parallel_lookup< T, MyLUT, oor_policy >::parallel_lookup ( const MyLUT &  l,
unsigned  step_bits = 0,
unsigned  bias = 0 
)
inline

Constructor for signed input types, configures aspects of how the lookup is performed.

Note that usage of step_bits requires either:

  • The rounding mode is set to the default aie::rounding_mode::floor
  • The lowest step_bits of the index are zero
Parameters
lLUT containing the stored values used for the linear approximation.
step_bitsOptional lower bits that will be ignored for indexing the LUT.
biasOptional offset added to the input values used to index, for example to center on 0 by adding half the number of LUT elements. This value, if supplied, must be a power of 2.

◆ parallel_lookup() [2/2]

template<typename T , ParallelLUT MyLUT, lut_oor_policy oor_policy = lut_oor_policy::saturate>
template<typename U = T>
requires (std::is_unsigned_v<T>)
aie::parallel_lookup< T, MyLUT, oor_policy >::parallel_lookup ( const MyLUT &  l,
unsigned  step_bits = 0 
)
inline

Constructor for unsigned input types, configures aspects of how the lookup is performed.

Note that usage of step_bits requires either:

  • The rounding mode is set to the default aie::rounding_mode::floor
  • The lowest step_bits of the index are zero
Parameters
lLUT containing the stored values used for the linear approximation.
step_bitsOptional lower bits that will be ignored for indexing the LUT.

Member Function Documentation

◆ fetch() [1/2]

template<typename T , ParallelLUT MyLUT, lut_oor_policy oor_policy = lut_oor_policy::saturate>
template<Vector Vec, unsigned N = Vec::size()>
vector< typename MyLUT::offset_type, N > aie::parallel_lookup< T, MyLUT, oor_policy >::fetch ( const Vec &  input)
inline

Accesses the lookup table based on the provided input values, will return a vector of the same number of elements as the input vector.


Input values are interpreted from MSB to LSB: headroom | LUT elements | step_bits

Note the step_bits are required to be zeroed if the rounding mode is set to anything other than aie::rounding_mode::floor.

Parameters
inputVector of input values that are used to index the look-up table.

◆ fetch() [2/2]

template<typename T , ParallelLUT MyLUT, lut_oor_policy oor_policy = lut_oor_policy::saturate>
template<unsigned N, Vector Vec>
vector< typename MyLUT::offset_type, N > aie::parallel_lookup< T, MyLUT, oor_policy >::fetch ( const Vec &  input)
inline

Accesses the lookup table based on the provided input values.

This overload allows the size of the returned vector to be specified as a template parameter. This may be required when mapping small index types to large value types as a direct mapping may not be valid. For example, mapping int8 to cint32 on a given architecture may require input to be 16 elements. fetch(input) would therefore deduce a return type of aie::vector<cint32, 16>, which may be unsupported. However, returning aie::vector<cint32, 8> by calling fetch<8>(input) may be valid.
Input values are interpreted from MSB to LSB: headroom | LUT elements | step_bits

Note the step_bits are required to be zeroed if the rounding mode is set to anything other than aie::rounding_mode::floor.

Template Parameters
NThe number of elements to lookup, which may be less than the input vector size
Parameters
inputVector of input values that are used to index the look-up table.