AI Engine API User Guide (AIE-API) 2024.1
|
Two abstractions are provided to represent lookup tables on AIE architectures:
The primary purpose of these abstractions is to leverage hardware support for parallel accesses on certain AIE architectures.
Both of these abstractions are built upon the aie::lut type that is used to encapsulate the raw LUT data. This encapsulation is implemented in an attempt to ensure correct data layout for a given lookup type. Specifically, to achieve a given level of access parallelism, the LUT values are required to have a specific layout in memory, which is dependent on the required number of parallel loads. For details on the memory layout requirements, see the aie::lut documentation.
Example implementations of parallel lookup and linear approximation functions are given below:
Classes | |
struct | aie::linear_approx< T, MyLUT > |
struct | aie::lut< ParallelAccesses, OffsetType, SlopeType > |
Abstraction to represent a LUT that is stored in memory, instantiated with pointer(s) to the already appropriately populated memory and the number of elements. More... | |
struct | aie::parallel_lookup< T, MyLUT, oor_policy > |
struct aie::linear_approx |
Type to support a linear approximation via interpolation with slope/offset values stored in a lookup table.
The offset values are simply the samples of the function to be approximated. The slope values, which are the slopes of the function at the corresponding sample, are used in conjunction with the input to more accurately estimate the function value between sample points.
The logical steps of the computation for an integer based linear approximation are:
while the steps for a floating point based approximation are:
Note that for integer based linear approximations, the slope is multiplied by an integer value in the range [0, 1 << step_bits) and therefore tweaking of the LUT values or linear_approx parameters may be required to ensure that offset[i] + slope[i] * ((1 << step_bits) - 1) approximately equals offset[i+1].
The slope and offset values are expected to be placed adjacent in memory. The number of achieved lookups per cycle is determined by the aie::lut object that encapsulates the contents of the lookup table. The following example shows the memory layout of a 128b bank width lookup table with 16b values and slopes, which achieves 4 lookups per cycle:
Input | Offset | Slope | Accumulator type | Lanes | Minumum step_bits required |
---|---|---|---|---|---|
int8 | int8 | int8 | acc32 | 32 | 2 |
int16 | int16 | int16 | acc64 | 16 | 3 |
int16 | int32 | int32 | acc64 | 16 | 4 |
bfloat16 | float | bfloat16 | accfloat | 16 | 0 |
Note that while the floating point linear approx requires the offset data to be 32b floats, the slope data is required to be bfloat16. However, it is required that all values in the LUT be 32b to ensure the LUT is correctly aligned. While it is safe to use floats as the storage type for the lookup table, it is required that the low 16 mantissa bits of the floating point slope value be zero.
T | Type of the input vector, containing values used to index the lookup table. |
MyLUT | Definition of the LUT type, using the lut type. |
Public Member Functions | |
linear_approx (const MyLUT &l, unsigned step_bits, int bias=0, int shift_offset=0) | |
Constructor, configures aspects of how the approximation is performed. | |
template<Vector Vec> | |
auto | compute (const Vec &input) |
Performs a linear approximation for the input values with the configured lookup table. | |
|
inline |
Constructor, configures aspects of how the approximation is performed.
l | LUT containing the stored slope/offset value pairs used for the linear approximation. Each value in the LUT has the slope in the LSB, the offset in the MSB. |
step_bits | Lower bits that won't be used from the input to index the LUT. For integer input, these will be the remainder multiplied by the slope value at each point. For float values, the input values are used directly in the multiplication |
bias | Optional offset added to the input values used to index, for example to center on 0 by adding half the number of LUT elements. |
shift_offset | Optional scaling factor applied to the offset before adding it (to avoid loss of precision). |
|
inline |
Performs a linear approximation for the input values with the configured lookup table.
An accumulator of the same number of elements as the input is returned.
Input values are interpreted from MSB to LSB: headroom | LUT elements | step_bits
input | Vector of input values that are used to index the look-up table. |
struct aie::lut |
Abstraction to represent a LUT that is stored in memory, instantiated with pointer(s) to the already appropriately populated memory and the number of elements.
The requirement on memory layout is that for degree N parallel accesses, N copies of the LUT data are required; i.e.
Currently the only supported implementation on this architecture is for 4 parallel accesses.
ParallelAccesses | Defines how many parallel accesses will be done in a single LUT access, possibilities depend on the hardware available for the given architecture |
OffsetType | Type of values stored within the lookup table. |
SlopeType | Optional template parameter, only needed in certain cases of linear approximation where the offset/slope value pair uses two different types. |
Public Types | |
using | lut_impl = detail::lut< ParallelAccesses, OffsetType, SlopeType > |
using | offset_type = OffsetType |
using | slope_type = SlopeType |
Public Member Functions | |
lut (unsigned LUT_elems, const void *LUT_a) | |
Constructor for singular access. | |
lut (unsigned LUT_elems, const void *LUT_ab) | |
Constructor for two parallel accesses. | |
lut (unsigned LUT_elems, const void *LUT_ab, const void *LUT_cd) | |
Constructor for 4 parallel accesses. | |
using aie::lut< ParallelAccesses, OffsetType, SlopeType >::lut_impl = detail::lut<ParallelAccesses, OffsetType, SlopeType> |
using aie::lut< ParallelAccesses, OffsetType, SlopeType >::offset_type = OffsetType |
using aie::lut< ParallelAccesses, OffsetType, SlopeType >::slope_type = SlopeType |
|
inline |
Constructor for 4 parallel accesses.
Each pointer points to an equivalent LUT populated within which the values are repeated twice, interleaved at a bank width granularity. In total the same values need to be present 4 times in memory to allow for the 4 parallel accesses.
For example, with a 128b bank width:
LUT_elems | Number elements in the LUT (not accounting for repetition). |
LUT_ab | First two copies of the data, with the values repeated and interleaved at bank width granularity. |
LUT_cd | Next two copies of the data, with the values repeated and interleaved at bank width granularity. |
|
inline |
Constructor for two parallel accesses.
For example, with a 128b bank width:
LUT_elems | Number of elements in the LUT (not accounting for repetition). |
LUT_ab | Two copies of the data, with the values interleaved at bank width granularity. |
|
inline |
Constructor for singular access.
For example,
LUT_elems | Number of elements in the LUT. |
LUT_a | Pointer to the LUT values. |
struct aie::parallel_lookup |
Type with functionality to directly index a LUT based on input vector of values. The number of achieved lookups per cycle is determined by the aie::lut object that encapsulates the contents of the lookup table. Refer to aie::lut for more details.
Real signed and unsigned integer types (>=8b) are supported as indices. All types (>=8b) are supported as value types, including bfloat16, real, and complex types.
T | Type of the input vector, containing values used to index the lookup table. |
MyLUT | Definition of the LUT type, using the lut type |
oor_policy | Defines the "out of range policy" for when index values on the input go beyond the size of the LUT. It can either saturate, taking on the min/max valid index, or truncate, retaining the lower bits for unsigned indicies or wrapping in the interval [-bias,lut_size-bias) for signed indices. Saturating is the default behaviour, but for certain non-linear functions which repeat after an interval truncation may be required. |
Public Member Functions | |
template<typename U = T> requires (std::is_unsigned_v<T>) | |
parallel_lookup (const MyLUT &l, unsigned step_bits=0) | |
Constructor for unsigned input types, configures aspects of how the lookup is performed. | |
template<typename U = T> requires (std::is_signed_v<T>) | |
parallel_lookup (const MyLUT &l, unsigned step_bits=0, unsigned bias=0) | |
Constructor for signed input types, configures aspects of how the lookup is performed. | |
template<Vector Vec, unsigned N = Vec::size()> | |
vector< typename MyLUT::offset_type, N > | fetch (const Vec &input) |
Accesses the lookup table based on the provided input values, will return a vector of the same number of elements as the input vector. | |
template<unsigned N, Vector Vec> | |
vector< typename MyLUT::offset_type, N > | fetch (const Vec &input) |
Accesses the lookup table based on the provided input values. | |
|
inline |
Constructor for signed input types, configures aspects of how the lookup is performed.
Note that usage of step_bits requires either:
aie::rounding_mode::floor
l | LUT containing the stored values used for the linear approximation. |
step_bits | Optional lower bits that will be ignored for indexing the LUT. |
bias | Optional offset added to the input values used to index, for example to center on 0 by adding half the number of LUT elements. This value, if supplied, must be a power of 2. |
|
inline |
Constructor for unsigned input types, configures aspects of how the lookup is performed.
Note that usage of step_bits requires either:
aie::rounding_mode::floor
l | LUT containing the stored values used for the linear approximation. |
step_bits | Optional lower bits that will be ignored for indexing the LUT. |
|
inline |
Accesses the lookup table based on the provided input values, will return a vector of the same number of elements as the input vector.
Input values are interpreted from MSB to LSB: headroom | LUT elements | step_bits
Note the step_bits are required to be zeroed if the rounding mode is set to anything other than aie::rounding_mode::floor
.
input | Vector of input values that are used to index the look-up table. |
|
inline |
Accesses the lookup table based on the provided input values.
This overload allows the size of the returned vector to be specified as a template parameter. This may be required when mapping small index types to large value types as a direct mapping may not be valid. For example, mapping int8
to cint32
on a given architecture may require input
to be 16 elements. fetch(input)
would therefore deduce a return type of aie::vector<cint32, 16>
, which may be unsupported. However, returning aie::vector<cint32, 8>
by calling fetch<8>(input)
may be valid.
Input values are interpreted from MSB to LSB: headroom | LUT elements | step_bits
Note the step_bits are required to be zeroed if the rounding mode is set to anything other than aie::rounding_mode::floor
.
N | The number of elements to lookup, which may be less than the input vector size |
input | Vector of input values that are used to index the look-up table. |