AI Engine API User Guide (AIE) 2022.1
Fast Fourier Transform (FFT)

Overview

The AIE API offers the aie::fft_dit class template that provides the building blocks to implement all the stages of the FFT decimation-in-time computation. This class template is parametrized with the data type, the radix to be used and the vectorization of the FFT stage to be computed. The resulting class defines a stage iterator that contains the state of the computation (pointers to the input and twiddle arrays), and a function that performs the decimation in time.

By default the input and output type are the same, but you can optionally specify a second type for a different output type.

The following snippet shows the sample implementation of a radix-2 FFT stage for any vectorization value.

template <unsigned Vectorization>
void radix2_dit(const cint32 * __restrict x,
const cint16 * __restrict tw,
unsigned n, unsigned shift_tw, unsigned shift, bool inv,
cint32 * __restrict y)
{
using FFT = aie::fft_dit<Vectorization, 2, cint32>; // radix = 2, input_type = cint32, output_type = cint32 (output defaults to input)
FFT fft;
auto it_stage = fft.begin_stage(x, tw);
auto it_out0 = aie::begin_vector<FFT::out_vector_size>(y);
auto it_out1 = aie::begin_vector<FFT::out_vector_size>(y + n / 2);
for (int j = 0; j < n / (2 * FFT::out_vector_size); ++j)
chess_prepare_for_pipelining
chess_loop_range(1,)
{
const auto out = fft.dit(*it_stage++, shift_tw, shift, inv);
*it_out0++ = out[0];
*it_out1++ = out[1];
}
}
auto inv(E a)
Definition: aie.hpp:7513
shift_bits< T, type_bits_v< T >, Elems > shift
Definition: shift.hpp:119
Definition: aie_declaration.hpp:85

Typedefs

template<unsigned Vectorization, unsigned Radix, typename T1 , typename T2 = T1>
using aie::fft_dit = detail::fft_dit< Vectorization, detail::fft_get_stage< Vectorization, Radix, T1, T2 >(), Radix, T1, T2 >
  More...
 

Supported Fast Fourier Transform Modes

Supported FFT/IFFT modes
Data typeRadix
c16b 2
c32b 2
c32b 4
cfloat 2
For a given radix R with n number of points, there will be R number of outputs spaced by n/R. The above example shows 2 outputs for a radix 2, for radix 4 we would have:
template <unsigned Vectorization>
void radix4_dit(const cint32 * __restrict x,
const cint16 * __restrict tw1,
const cint16 * __restrict tw2,
unsigned n, unsigned shift_tw, unsigned shift, bool inv,
cint32 * __restrict y)
{
FFT fft;
auto it_stage = fft.begin_stage(x, tw1, tw2);
auto it_out0 = aie::begin_restrict_vector<FFT::out_vector_size>(y);
auto it_out1 = aie::begin_restrict_vector<FFT::out_vector_size>(y + n / 4);
auto it_out2 = aie::begin_restrict_vector<FFT::out_vector_size>(y + 2*n / 4);
auto it_out3 = aie::begin_restrict_vector<FFT::out_vector_size>(y + 3*n / 4);
chess_report(FFT::out_vector_size);
for (int j = 0; j < n / (4 * FFT::out_vector_size); ++j)
chess_prepare_for_pipelining
chess_loop_range(1,)
{
const auto out = fft.dit(*it_stage++, shift_tw, shift, inv);
*it_out0++ = out[0];
*it_out1++ = out[1];
*it_out2++ = out[2];
*it_out3++ = out[3];
}
}
However you can also use less with different incrementation schemes to achieve the same result. This approach is currently recommended for optimal performance on c32b radix 4 for Vectorization > 1 due to internal pointer requirements:
template <unsigned Vectorization>
void radix4_dit(const cint32 * __restrict x,
const cint16 * __restrict tw1,
const cint16 * __restrict tw2,
unsigned n, unsigned shift_tw, unsigned shift, bool inv,
cint32 * __restrict y)
{
FFT fft;
auto it_stage = fft.begin_stage(x, tw1, tw2);
auto it_out0 = aie::begin_restrict_vector<FFT::out_vector_size>(y);
auto it_out1 = aie::begin_restrict_vector<FFT::out_vector_size>(y + n / 2);
for (int j = 0; j < n / (4 * FFT::out_vector_size); ++j)
chess_prepare_for_pipelining
chess_loop_range(1,)
{
const auto out = fft.dit(*it_stage++, shift_tw, shift, inv);
*it_out0 = out[0]; it_out0 += n/16;
*it_out0 = out[1]; it_out0 += -(int)(n/16)+1;
*it_out1 = out[2]; it_out1 += n/16;
*it_out1 = out[3]; it_out1 += -(int)(n/16)+1;
}
}
Going to/from cint16/cint32 is also supported. Here is an example of a radix 2 downscaling from cint32 to cint16:
template <unsigned Vectorization>
void radix2_down_dit(const cint32_t * __restrict x,
const cint16_t * __restrict tw,
int n, int shift_tw, int shift, bool inv,
cint16_t * __restrict y)
{
FFT fft;
auto it_stage = fft.begin_stage(x, tw);
auto it_out0 = aie::begin_vector<FFT::out_vector_size>(y);
auto it_out1 = aie::begin_vector<FFT::out_vector_size>(y + n / 2);
for (int j = 0; j < n / (2 * FFT::out_vector_size); ++j)
chess_prepare_for_pipelining
chess_loop_range(1,)
{
const auto out = fft.dit(*it_stage++, shift_tw, shift, inv);
*it_out0++ = out[0];
*it_out1++ = out[1];
}
}
cint16 cint16_t
Definition: types.hpp:197
cint32 cint32_t
Definition: types.hpp:198

Typedef Documentation

◆ fft_dit

template<unsigned Vectorization, unsigned Radix, typename T1 , typename T2 = T1>
using aie::fft_dit = typedef detail::fft_dit<Vectorization, detail::fft_get_stage<Vectorization, Radix, T1, T2>(), Radix, T1, T2>

Type that encapsulates the functionality for decimation-in-time FFTs.

Template Parameters
VectorizationVectorization of the FFT stage
RadixNumber which selects the FFT radix.
T1Type of the input elements.
T2Type of the output elements, defaults to input type.