![]() |
AI Engine-ML Intrinsics User Guide (v2024.2)
|
Elementwise-multiplication and matrix multiplication using bfloat16 datapath. 2 options available. With or without set_rnd(0) for truncation before using these intrinsics. Use flag AIE_FP32_EMULATION_SET_RND_MODE flag to set rnd mode to truncation. For an explanation how these operations works see Multiply Accumulate. More...
Elementwise-multiplication and matrix multiplication using bfloat16 datapath. 2 options available. With or without set_rnd(0) for truncation before using these intrinsics. Use flag AIE_FP32_EMULATION_SET_RND_MODE flag to set rnd mode to truncation. For an explanation how these operations works see Multiply Accumulate.
Element-wise multiplication using bf16 data-path | |
v16accfloat | mul_elem_16 (v16float v1, v16float v2) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | mul_elem_16_accuracy_low (v16float v1, v16float v2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. | |
v16accfloat | mul_elem_16_accuracy_fast (v16float v1, v16float v2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. | |
v16accfloat | mul_elem_16_accuracy_safe (v16float v1, v16float v2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic. | |
v8caccfloat | mul_elem_8 (v8float v1, v8cfloat v2) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_8 intrinsic is same as mul_elem_8_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_8 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mul_elem_8 (v8cfloat v1, v8float v2) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mul_elem_8 (v8cfloat v1, v8cfloat v2) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mul_elem_8_accuracy_low (v8float v1, v8cfloat v2) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_low. | |
v8caccfloat | mul_elem_8_accuracy_low (v8cfloat v1, v8float v2) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mul_elem_8_accuracy_low (v8cfloat v1, v8cfloat v2) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mul_elem_8_accuracy_fast (v8float v1, v8cfloat v2) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_fast. | |
v8caccfloat | mul_elem_8_accuracy_fast (v8cfloat v1, v8float v2) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mul_elem_8_accuracy_fast (v8cfloat v1, v8cfloat v2) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mul_elem_8_accuracy_safe (v8float v1, v8cfloat v2) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of mul_elem_8 intrinsic is same as mul_elem_8_accuracy_safe intrinsic. | |
v8caccfloat | mul_elem_8_accuracy_safe (v8cfloat v1, v8float v2) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mul_elem_8_accuracy_safe (v8cfloat v1, v8cfloat v2) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | negmul_elem_16 (v16float v1, v16float v2) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_16 intrinsic is same as neg(mul_elem_16_accuracy_safe) intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_16 intrinsic on negmul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_16 intrinsic on negmul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | negmul_elem_8 (v8float v1, v8cfloat v2) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | negmul_elem_8 (v8cfloat v1, v8float v2) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | negmul_elem_8 (v8cfloat v1, v8cfloat v2) |
Elementwise multiplication of cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | negmul_elem_16_accuracy_low (v16float v1, v16float v2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_16 intrinsic on negmul_ele_16_accuracy_low. | |
v8caccfloat | negmul_elem_8_accuracy_low (v8float v1, v8cfloat v2) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low. | |
v8caccfloat | negmul_elem_8_accuracy_low (v8cfloat v1, v8float v2) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low. | |
v8caccfloat | negmul_elem_8_accuracy_low (v8cfloat v1, v8cfloat v2) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low. | |
v16accfloat | negmul_elem_16_accuracy_fast (v16float v1, v16float v2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_16 intrinsic on negmul_elem_16_accuracy_fast. | |
v8caccfloat | negmul_elem_8_accuracy_fast (v8float v1, v8cfloat v2) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. | |
v8caccfloat | negmul_elem_8_accuracy_fast (v8cfloat v1, v8float v2) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. | |
v8caccfloat | negmul_elem_8_accuracy_fast (v8cfloat v1, v8cfloat v2) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. | |
v16accfloat | negmul_elem_16_accuracy_safe (v16float v1, v16float v2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_16 intrinsic is same as negmul_elem_16_accuracy_safe intrinsic. | |
v8caccfloat | negmul_elem_8_accuracy_safe (v8float v1, v8cfloat v2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic. | |
v8caccfloat | negmul_elem_8_accuracy_safe (v8cfloat v1, v8float v2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic. | |
v8caccfloat | negmul_elem_8_accuracy_safe (v8cfloat v1, v8cfloat v2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic. | |
v16accfloat | mac_elem_16 (v16float v1, v16float v2, v16accfloat acc) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_elem_16 intrinsic is same as mac_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_16 intrinsic on mac_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mac_elem_8 (v8float v1, v8cfloat v2, v8caccfloat acc) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mac_elem_8 (v8cfloat v1, v8float v2, v8caccfloat acc) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mac_elem_8 (v8cfloat v1, v8cfloat v2, v8caccfloat acc) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | mac_elem_16_accuracy_safe (v16float v1, v16float v2, v16accfloat acc) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_elem_16 intrinsic is same as mac_elem_16_accuracy_safe intrinsic For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mac_elem_8_accuracy_safe (v8float v1, v8cfloat v2, v8caccfloat acc) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mac_elem_8_accuracy_safe (v8cfloat v1, v8float v2, v8caccfloat acc) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mac_elem_8_accuracy_safe (v8cfloat v1, v8cfloat v2, v8caccfloat acc) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | mac_elem_16_accuracy_fast (v16float v1, v16float v2, v16accfloat acc) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mac_elem_8_accuracy_fast (v8float v1, v8cfloat v2, v8caccfloat acc) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mac_elem_8_accuracy_fast (v8cfloat v1, v8float v2, v8caccfloat acc) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mac_elem_8_accuracy_fast (v8cfloat v1, v8cfloat v2, v8caccfloat acc) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | mac_elem_16_accuracy_low (v16float v1, v16float v2, v16accfloat acc) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mac_elem_8_accuracy_low (v8float v1, v8cfloat v2, v8caccfloat acc) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mac_elem_8_accuracy_low (v8cfloat v1, v8float v2, v8caccfloat acc) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | mac_elem_8_accuracy_low (v8cfloat v1, v8cfloat v2, v8caccfloat acc) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | addmac_elem_16 (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_elem_16 intrinsic is same as addmac_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_elem_16 intrinsic on addmac_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | addmac_elem_16_accuracy_safe (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_elem_16 intrinsic is same as addmac_elem_16_accuracy_safe intrinsic For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | addmac_elem_16_accuracy_fast (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | addmac_elem_16_accuracy_low (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | msc_elem_16 (v16float v1, v16float v2, v16accfloat acc) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_elem_16 intrinsic is same as msc_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | msc_elem_8 (v8float v1, v8cfloat v2, v8caccfloat acc) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | msc_elem_8 (v8cfloat v1, v8float v2, v8caccfloat acc) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | msc_elem_8 (v8cfloat v1, v8cfloat v2, v8caccfloat acc) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | msc_elem_16_accuracy_safe (v16float v1, v16float v2, v16accfloat acc) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_elem_16 intrinsic is same as msc_elem_16_accuracy_safe. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | msc_elem_8_accuracy_safe (v8float v1, v8cfloat v2, v8caccfloat acc) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | msc_elem_8_accuracy_safe (v8cfloat v1, v8float v2, v8caccfloat acc) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | msc_elem_8_accuracy_safe (v8cfloat v1, v8cfloat v2, v8caccfloat acc) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | msc_elem_16_accuracy_fast (v16float v1, v16float v2, v16accfloat acc) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | msc_elem_8_accuracy_fast (v8float v1, v8cfloat v2, v8caccfloat acc) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | msc_elem_8_accuracy_fast (v8cfloat v1, v8float v2, v8caccfloat acc) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | msc_elem_8_accuracy_fast (v8cfloat v1, v8cfloat v2, v8caccfloat acc) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | msc_elem_16_accuracy_low (v16float v1, v16float v2, v16accfloat acc) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | msc_elem_8_accuracy_low (v8float v1, v8cfloat v2, v8caccfloat acc) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | msc_elem_8_accuracy_low (v8cfloat v1, v8float v2, v8caccfloat acc) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v8caccfloat | msc_elem_8_accuracy_low (v8cfloat v1, v8cfloat v2, v8caccfloat acc) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | addmsc_elem_16 (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_elem_16 intrinsic is same as addmsc_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | addmsc_elem_16_accuracy_safe (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_elem_16 intrinsic is same as addmsc_elem_16_accuracy_safe. For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | addmsc_elem_16_accuracy_fast (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | addmsc_elem_16_accuracy_low (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate. | |
Matrix multiplication using bf16 data-path <br> | |
v16accfloat | mul_4x8_8x4 (v32float v1, v32float v2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. | |
v4caccfloat | mul_2x8_8x2 (v16float v1, v16cfloat v2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | mul_4x8_8x4_accuracy_safe (v32float v1, v32float v2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of mul_4x8_8x4 intrinsic is same as mul_4x8_8x4_accuracy_safe intrinsic. | |
v4caccfloat | mul_2x8_8x2_accuracy_safe (v16float v1, v16cfloat v2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | mul_4x8_8x4_accuracy_fast (v32float v1, v32float v2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast intrinsic. | |
v4caccfloat | mul_2x8_8x2_accuracy_fast (v16float v1, v16cfloat v2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | mul_4x8_8x4_accuracy_low (v32float v1, v32float v2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath. 16 bits in mantissa used). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low. | |
v4caccfloat | mul_2x8_8x2_accuracy_low (v16float v1, v16cfloat v2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | negmul_4x8_8x4 (v32float v1, v32float v2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of negmul_4x8_8x4 is same as negmul_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | negmul_4x8_8x4_accuracy_safe (v32float v1, v32float v2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of negmul_4x8_8x4 intrinsic is same as negmul_4x8_8x4_accuracy_safe intrinsic. | |
v16accfloat | negmul_4x8_8x4_accuracy_fast (v32float v1, v32float v2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_fast intrinsic. | |
v16accfloat | negmul_4x8_8x4_accuracy_low (v32float v1, v32float v2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_low intrinsic. | |
v16accfloat | mac_4x8_8x4 (v32float v1, v32float v2, v16accfloat acc) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_4x8_8x4 is same as mac_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | mac_4x8_8x4_accuracy_safe (v32float v1, v32float v2, v16accfloat acc) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of mac_4x8_8x4 intrinsic is same as mac_4x8_8x4_accuracy_safe intrinsic. | |
v16accfloat | mac_4x8_8x4_accuracy_fast (v32float v1, v32float v2, v16accfloat acc) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_fast intrinsic. | |
v16accfloat | mac_4x8_8x4_accuracy_low (v32float v1, v32float v2, v16accfloat acc) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_low intrinsic. | |
v16accfloat | addmac_4x8_8x4 (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_4x8_8x4 is same as addmac_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | addmac_4x8_8x4_accuracy_safe (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of addmac_4x8_8x4 intrinsic is same as addmac_4x8_8x4_accuracy_safe intrinsic. | |
v16accfloat | addmac_4x8_8x4_accuracy_fast (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_fast intrinsic. | |
v16accfloat | addmac_4x8_8x4_accuracy_low (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_low intrinsic. | |
v16accfloat | msc_4x8_8x4 (v32float v1, v32float v2, v16accfloat acc) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_4x8_8x4 is same as msc_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | msc_4x8_8x4_accuracy_safe (v32float v1, v32float v2, v16accfloat acc) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of msc_4x8_8x4 intrinsic is same as msc_4x8_8x4_accuracy_safe intrinsic. | |
v16accfloat | msc_4x8_8x4_accuracy_fast (v32float v1, v32float v2, v16accfloat acc) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_fast intrinsic. | |
v16accfloat | msc_4x8_8x4_accuracy_low (v32float v1, v32float v2, v16accfloat acc) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_low intrinsic. | |
v16accfloat | addmsc_4x8_8x4 (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_4x8_8x4 is same as addmsc_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate. | |
v16accfloat | addmsc_4x8_8x4_accuracy_safe (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of addmsc_4x8_8x4 intrinsic is same as addmsc_4x8_8x4_accuracy_safe intrinsic. | |
v16accfloat | addmsc_4x8_8x4_accuracy_fast (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_fast intrinsic. | |
v16accfloat | addmsc_4x8_8x4_accuracy_low (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_low intrinsic. | |
v16accfloat addmac_4x8_8x4 | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_4x8_8x4 is same as addmac_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
acc1 | Accumulator 1 input |
acc2 | Accumulator 2 input |
v16accfloat addmac_4x8_8x4_accuracy_fast | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_fast intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
acc1 | Accumulator 1 input |
acc2 | Accumulator 2 input |
v16accfloat addmac_4x8_8x4_accuracy_low | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_low intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
acc1 | Accumulator 1 input |
acc2 | Accumulator 2 input |
v16accfloat addmac_4x8_8x4_accuracy_safe | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of addmac_4x8_8x4 intrinsic is same as addmac_4x8_8x4_accuracy_safe intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
acc1 | Accumulator 1 input |
acc2 | Accumulator 2 input |
v16accfloat addmac_elem_16 | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_elem_16 intrinsic is same as addmac_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_elem_16 intrinsic on addmac_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc1 | accumulator 1 input |
acc2 | accumulator 2 input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat addmac_elem_16_accuracy_fast | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.
acc1 | accumulator 1 input |
acc2 | accumulator 2 input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat addmac_elem_16_accuracy_low | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc1 | accumulator 1 input |
acc2 | accumulator 2 input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat addmac_elem_16_accuracy_safe | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_elem_16 intrinsic is same as addmac_elem_16_accuracy_safe intrinsic For an explanation how these operations works see Multiply Accumulate.
acc1 | accumulator 1 input |
acc2 | accumulator 2 input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat addmsc_4x8_8x4 | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_4x8_8x4 is same as addmsc_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
acc1 | Accumulator 1 input |
acc2 | Accumulator 2 input |
v16accfloat addmsc_4x8_8x4_accuracy_fast | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_fast intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
acc1 | Accumulator 1 input |
acc2 | Accumulator 2 input |
v16accfloat addmsc_4x8_8x4_accuracy_low | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_low intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
acc1 | Accumulator 1 input |
acc2 | Accumulator 2 input |
v16accfloat addmsc_4x8_8x4_accuracy_safe | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of addmsc_4x8_8x4 intrinsic is same as addmsc_4x8_8x4_accuracy_safe intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
acc1 | Accumulator 1 input |
acc2 | Accumulator 2 input |
v16accfloat addmsc_elem_16 | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_elem_16 intrinsic is same as addmsc_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc1 | accumulator 1 input |
acc2 | accumulator 2 input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat addmsc_elem_16_accuracy_fast | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.
acc1 | accumulator 1 input |
acc2 | accumulator 2 input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat addmsc_elem_16_accuracy_low | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc1 | accumulator 1 input |
acc2 | accumulator 2 input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat addmsc_elem_16_accuracy_safe | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc1, | ||
v16accfloat | acc2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_elem_16 intrinsic is same as addmsc_elem_16_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.
acc1 | accumulator 1 input |
acc2 | accumulator 2 input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat mac_4x8_8x4 | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_4x8_8x4 is same as mac_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
acc | acc input |
v16accfloat mac_4x8_8x4_accuracy_fast | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_fast intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
acc | acc input |
v16accfloat mac_4x8_8x4_accuracy_low | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_low intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
acc | acc input |
v16accfloat mac_4x8_8x4_accuracy_safe | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of mac_4x8_8x4 intrinsic is same as mac_4x8_8x4_accuracy_safe intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
acc | acc input |
v16accfloat mac_elem_16 | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc | ||
) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_elem_16 intrinsic is same as mac_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_16 intrinsic on mac_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat mac_elem_16_accuracy_fast | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat mac_elem_16_accuracy_low | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat mac_elem_16_accuracy_safe | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_elem_16 intrinsic is same as mac_elem_16_accuracy_safe intrinsic For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mac_elem_8 | ( | v8cfloat | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mac_elem_8 | ( | v8cfloat | v1, |
v8float | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mac_elem_8 | ( | v8float | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mac_elem_8_accuracy_fast | ( | v8cfloat | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mac_elem_8_accuracy_fast | ( | v8cfloat | v1, |
v8float | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mac_elem_8_accuracy_fast | ( | v8float | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mac_elem_8_accuracy_low | ( | v8cfloat | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mac_elem_8_accuracy_low | ( | v8cfloat | v1, |
v8float | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mac_elem_8_accuracy_low | ( | v8float | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mac_elem_8_accuracy_safe | ( | v8cfloat | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mac_elem_8_accuracy_safe | ( | v8cfloat | v1, |
v8float | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mac_elem_8_accuracy_safe | ( | v8float | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat msc_4x8_8x4 | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_4x8_8x4 is same as msc_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
acc | acc input |
v16accfloat msc_4x8_8x4_accuracy_fast | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_fast intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
acc | acc input |
v16accfloat msc_4x8_8x4_accuracy_low | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_low intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
acc | acc input |
v16accfloat msc_4x8_8x4_accuracy_safe | ( | v32float | v1, |
v32float | v2, | ||
v16accfloat | acc | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of msc_4x8_8x4 intrinsic is same as msc_4x8_8x4_accuracy_safe intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
acc | acc input |
v16accfloat msc_elem_16 | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_elem_16 intrinsic is same as msc_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat msc_elem_16_accuracy_fast | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat msc_elem_16_accuracy_low | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat msc_elem_16_accuracy_safe | ( | v16float | v1, |
v16float | v2, | ||
v16accfloat | acc | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_elem_16 intrinsic is same as msc_elem_16_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat msc_elem_8 | ( | v8cfloat | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat msc_elem_8 | ( | v8cfloat | v1, |
v8float | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat msc_elem_8 | ( | v8float | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat msc_elem_8_accuracy_fast | ( | v8cfloat | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat msc_elem_8_accuracy_fast | ( | v8cfloat | v1, |
v8float | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat msc_elem_8_accuracy_fast | ( | v8float | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat msc_elem_8_accuracy_low | ( | v8cfloat | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat msc_elem_8_accuracy_low | ( | v8cfloat | v1, |
v8float | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat msc_elem_8_accuracy_low | ( | v8float | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat msc_elem_8_accuracy_safe | ( | v8cfloat | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat msc_elem_8_accuracy_safe | ( | v8cfloat | v1, |
v8float | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat msc_elem_8_accuracy_safe | ( | v8float | v1, |
v8cfloat | v2, | ||
v8caccfloat | acc | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.
acc | accumulator input |
v1 | Vector v1 |
v2 | Vector v2 |
v4caccfloat mul_2x8_8x2 | ( | v16float | v1, |
v16cfloat | v2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v4caccfloat mul_2x8_8x2_accuracy_fast | ( | v16float | v1, |
v16cfloat | v2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v4caccfloat mul_2x8_8x2_accuracy_low | ( | v16float | v1, |
v16cfloat | v2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v4caccfloat mul_2x8_8x2_accuracy_safe | ( | v16float | v1, |
v16cfloat | v2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat mul_4x8_8x4 | ( | v32float | v1, |
v32float | v2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat mul_4x8_8x4_accuracy_fast | ( | v32float | v1, |
v32float | v2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat mul_4x8_8x4_accuracy_low | ( | v32float | v1, |
v32float | v2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath. 16 bits in mantissa used). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat mul_4x8_8x4_accuracy_safe | ( | v32float | v1, |
v32float | v2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of mul_4x8_8x4 intrinsic is same as mul_4x8_8x4_accuracy_safe intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat mul_elem_16 | ( | v16float | v1, |
v16float | v2 | ||
) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat mul_elem_16_accuracy_fast | ( | v16float | v1, |
v16float | v2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat mul_elem_16_accuracy_low | ( | v16float | v1, |
v16float | v2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat mul_elem_16_accuracy_safe | ( | v16float | v1, |
v16float | v2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mul_elem_8 | ( | v8cfloat | v1, |
v8cfloat | v2 | ||
) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mul_elem_8 | ( | v8cfloat | v1, |
v8float | v2 | ||
) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mul_elem_8 | ( | v8float | v1, |
v8cfloat | v2 | ||
) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_8 intrinsic is same as mul_elem_8_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_8 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mul_elem_8_accuracy_fast | ( | v8cfloat | v1, |
v8cfloat | v2 | ||
) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mul_elem_8_accuracy_fast | ( | v8cfloat | v1, |
v8float | v2 | ||
) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mul_elem_8_accuracy_fast | ( | v8float | v1, |
v8cfloat | v2 | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_fast.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mul_elem_8_accuracy_low | ( | v8cfloat | v1, |
v8cfloat | v2 | ||
) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mul_elem_8_accuracy_low | ( | v8cfloat | v1, |
v8float | v2 | ||
) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mul_elem_8_accuracy_low | ( | v8float | v1, |
v8cfloat | v2 | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_low.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mul_elem_8_accuracy_safe | ( | v8cfloat | v1, |
v8cfloat | v2 | ||
) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mul_elem_8_accuracy_safe | ( | v8cfloat | v1, |
v8float | v2 | ||
) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat mul_elem_8_accuracy_safe | ( | v8float | v1, |
v8cfloat | v2 | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of mul_elem_8 intrinsic is same as mul_elem_8_accuracy_safe intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat negmul_4x8_8x4 | ( | v32float | v1, |
v32float | v2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of negmul_4x8_8x4 is same as negmul_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat negmul_4x8_8x4_accuracy_fast | ( | v32float | v1, |
v32float | v2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_fast intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat negmul_4x8_8x4_accuracy_low | ( | v32float | v1, |
v32float | v2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_low intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat negmul_4x8_8x4_accuracy_safe | ( | v32float | v1, |
v32float | v2 | ||
) |
Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of negmul_4x8_8x4 intrinsic is same as negmul_4x8_8x4_accuracy_safe intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat negmul_elem_16 | ( | v16float | v1, |
v16float | v2 | ||
) |
Elementwise multiplication of fp32 data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_16 intrinsic is same as neg(mul_elem_16_accuracy_safe) intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_16 intrinsic on negmul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_16 intrinsic on negmul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat negmul_elem_16_accuracy_fast | ( | v16float | v1, |
v16float | v2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_16 intrinsic on negmul_elem_16_accuracy_fast.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat negmul_elem_16_accuracy_low | ( | v16float | v1, |
v16float | v2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_16 intrinsic on negmul_ele_16_accuracy_low.
v1 | Vector v1 |
v2 | Vector v2 |
v16accfloat negmul_elem_16_accuracy_safe | ( | v16float | v1, |
v16float | v2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_16 intrinsic is same as negmul_elem_16_accuracy_safe intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat negmul_elem_8 | ( | v8cfloat | v1, |
v8cfloat | v2 | ||
) |
Elementwise multiplication of cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat negmul_elem_8 | ( | v8cfloat | v1, |
v8float | v2 | ||
) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat negmul_elem_8 | ( | v8float | v1, |
v8cfloat | v2 | ||
) |
Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat negmul_elem_8_accuracy_fast | ( | v8cfloat | v1, |
v8cfloat | v2 | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat negmul_elem_8_accuracy_fast | ( | v8cfloat | v1, |
v8float | v2 | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat negmul_elem_8_accuracy_fast | ( | v8float | v1, |
v8cfloat | v2 | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat negmul_elem_8_accuracy_low | ( | v8cfloat | v1, |
v8cfloat | v2 | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat negmul_elem_8_accuracy_low | ( | v8cfloat | v1, |
v8float | v2 | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat negmul_elem_8_accuracy_low | ( | v8float | v1, |
v8cfloat | v2 | ||
) |
Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat negmul_elem_8_accuracy_safe | ( | v8cfloat | v1, |
v8cfloat | v2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat negmul_elem_8_accuracy_safe | ( | v8cfloat | v1, |
v8float | v2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |
v8caccfloat negmul_elem_8_accuracy_safe | ( | v8float | v1, |
v8cfloat | v2 | ||
) |
Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per lane (v1.lane0 * v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic.
v1 | Vector v1 |
v2 | Vector v2 |