Overview

The AIE API offers the aie::fft_dit class template that provides the building blocks to implement all the stages of the FFT decimation-in-time computation. This class template is parametrized with the data type, the radix to be used and the vectorization of the FFT stage to be computed. The resulting class defines a stage iterator that contains the state of the computation (pointers to the input and twiddle arrays), and a function that performs the decimation in time.

By default the input and output type are the same, but you can optionally specify a second type for a different output type.

The following snippet shows the sample implementation of a radix-2 FFT stage for any vectorization value.

template <unsigned Vectorization>
void radix2_dit(const cint32 * __restrict x,
                const cint16 * __restrict tw,
                unsigned n,  unsigned shift_tw, unsigned shift, bool inv,
                cint32 * __restrict y)
{
    using FFT = aie::fft_dit<Vectorization, 2, cint32>; // radix = 2, input_type = cint32, output_type = cint32 (output defaults to input)
 
    FFT fft;
 
    auto it_stage  = fft.begin_stage(x, tw);
    auto it_out0 = aie::begin_vector<FFT::out_vector_size>(y);
    auto it_out1 = aie::begin_vector<FFT::out_vector_size>(y + n / 2);
 
    for (int j = 0; j < n / (2 * FFT::out_vector_size); ++j)
        chess_prepare_for_pipelining
        chess_loop_range(1,)
    {
        const auto out = fft.dit(*it_stage++, shift_tw, shift, inv);
        *it_out0++ = out[0];
        *it_out1++ = out[1];
    }
}

Typedefs
template<unsigned Vectorization, unsigned Radix, typename T1 , typename T2 = T1>
using	aie::fft_dit = detail::fft_dit< Vectorization, detail::fft_get_stage< Vectorization, Radix, T1, T2 >(), Radix, T1, T2 >
	More...

Supported Fast Fourier Transform Modes

Supported FFT/IFFT modes
Data type	Radix
c16b	2
c32b	2
c32b	4
cfloat	2

For a given radix R with n number of points, there will be R number of outputs spaced by n/R. The above example shows 2 outputs for a radix 2, for radix 4 we would have:

template <unsigned Vectorization>
void radix4_dit(const cint32 * __restrict x,
                const cint16 * __restrict tw1,
                const cint16 * __restrict tw2,
                unsigned n, unsigned shift_tw, unsigned shift, bool inv,
                cint32 * __restrict y)
{
    using FFT = aie::fft_dit<Vectorization, 4, cint32>;
 
    FFT fft;
 
    auto it_stage  = fft.begin_stage(x, tw1, tw2);
 
    auto it_out0 = aie::begin_restrict_vector<FFT::out_vector_size>(y);
    auto it_out1 = aie::begin_restrict_vector<FFT::out_vector_size>(y +   n / 4);
    auto it_out2 = aie::begin_restrict_vector<FFT::out_vector_size>(y + 2*n / 4);
    auto it_out3 = aie::begin_restrict_vector<FFT::out_vector_size>(y + 3*n / 4);
 
    chess_report(FFT::out_vector_size);
    for (int j = 0; j < n / (4 * FFT::out_vector_size); ++j)
        chess_prepare_for_pipelining
        chess_loop_range(1,)
    {
        const auto out = fft.dit(*it_stage++, shift_tw, shift, inv);
        *it_out0++ = out[0];
        *it_out1++ = out[1];
        *it_out2++ = out[2];
        *it_out3++ = out[3];
    }
}

However you can also use less with different incrementation schemes to achieve the same result. This approach is currently recommended for optimal performance on c32b radix 4 for Vectorization > 1 due to internal pointer requirements:

template <unsigned Vectorization>
void radix4_dit(const cint32 * __restrict x,
                const cint16 * __restrict tw1,
                const cint16 * __restrict tw2,
                unsigned n, unsigned shift_tw, unsigned shift, bool inv,
                cint32 * __restrict y)
{
    using FFT = aie::fft_dit<Vectorization, 4, cint32>;
 
    FFT fft;
 
    auto it_stage  = fft.begin_stage(x, tw1, tw2);
 
    auto it_out0 = aie::begin_restrict_vector<FFT::out_vector_size>(y);
    auto it_out1 = aie::begin_restrict_vector<FFT::out_vector_size>(y + n / 2);
 
    for (int j = 0; j < n / (4 * FFT::out_vector_size); ++j)
        chess_prepare_for_pipelining
        chess_loop_range(1,)
    {
        const auto out = fft.dit(*it_stage++, shift_tw, shift, inv);
        *it_out0 = out[0]; it_out0 +=        n/16;
        *it_out0 = out[1]; it_out0 += -(int)(n/16)+1;
        *it_out1 = out[2]; it_out1 +=        n/16;
        *it_out1 = out[3]; it_out1 += -(int)(n/16)+1;
    }
}

Going to/from cint16/cint32 is also supported. Here is an example of a radix 2 downscaling from cint32 to cint16:

template <unsigned Vectorization>
void radix2_down_dit(const cint32_t * __restrict x,
                     const cint16_t * __restrict tw,
                     int n, int shift_tw, int shift, bool inv,
                     cint16_t * __restrict y)
{
    using FFT = aie::fft_dit<Vectorization, 2, cint32, cint16>;
 
    FFT fft;
 
    auto it_stage  = fft.begin_stage(x, tw);
    auto it_out0 = aie::begin_vector<FFT::out_vector_size>(y);
    auto it_out1 = aie::begin_vector<FFT::out_vector_size>(y + n / 2);
 
    for (int j = 0; j < n / (2 * FFT::out_vector_size); ++j)
        chess_prepare_for_pipelining
        chess_loop_range(1,)
    {
        const auto out = fft.dit(*it_stage++, shift_tw, shift, inv);
        *it_out0++ = out[0];
        *it_out1++ = out[1];
    }
}

Typedef Documentation

◆ fft_dit

template<unsigned Vectorization, unsigned Radix, typename T1 , typename T2 = T1>

using aie::fft_dit = typedef detail::fft_dit<Vectorization, detail::fft_get_stage<Vectorization, Radix, T1, T2>(), Radix, T1, T2>

Type that encapsulates the functionality for decimation-in-time FFTs.

Template Parameters

Vectorization	Vectorization of the FFT stage
Radix	Number which selects the FFT radix.
T1	Type of the input elements.
T2	Type of the output elements, defaults to input type.

Overview

Typedefs

Supported Fast Fourier Transform Modes

Typedef Documentation

◆ fft_dit