Elementwise-multiplication and matrix multiplication using bfloat16 datapath. 2 options available. With or without set_rnd(0) for truncation before using these intrinsics. Use flag AIE_FP32_EMULATION_SET_RND_MODE flag to set rnd mode to truncation. For an explanation how these operations works see Multiply Accumulate. More...

Overview

Elementwise-multiplication and matrix multiplication using bfloat16 datapath. 2 options available. With or without set_rnd(0) for truncation before using these intrinsics. Use flag AIE_FP32_EMULATION_SET_RND_MODE flag to set rnd mode to truncation. For an explanation how these operations works see Multiply Accumulate.

Element-wise multiplication using bf16 data-path
v16accfloat	mul_elem_16 (v16float v1, v16float v2)
	Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v16accfloat	mul_elem_16_accuracy_low (v16float v1, v16float v2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low.

v16accfloat	mul_elem_16_accuracy_fast (v16float v1, v16float v2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast.

v16accfloat	mul_elem_16_accuracy_safe (v16float v1, v16float v2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (33) per lane (v1.lane0 v2.lane0) Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic.

v8caccfloat	mul_elem_8 (v8float v1, v8cfloat v2)
	Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_8 intrinsic is same as mul_elem_8_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_8 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mul_elem_8 (v8cfloat v1, v8float v2)
	Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mul_elem_8 (v8cfloat v1, v8cfloat v2)
	Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mul_elem_8_accuracy_low (v8float v1, v8cfloat v2)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_low.

v8caccfloat	mul_elem_8_accuracy_low (v8cfloat v1, v8float v2)
	Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mul_elem_8_accuracy_low (v8cfloat v1, v8cfloat v2)
	Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mul_elem_8_accuracy_fast (v8float v1, v8cfloat v2)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_8 intrinsic on mul_elem_8_accuracy_fast.

v8caccfloat	mul_elem_8_accuracy_fast (v8cfloat v1, v8float v2)
	Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mul_elem_8_accuracy_fast (v8cfloat v1, v8cfloat v2)
	Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mul_elem_8_accuracy_safe (v8float v1, v8cfloat v2)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (33) per lane (v1.lane0 v2.lane0) Default behavior of mul_elem_8 intrinsic is same as mul_elem_8_accuracy_safe intrinsic.

v8caccfloat	mul_elem_8_accuracy_safe (v8cfloat v1, v8float v2)
	Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mul_elem_8_accuracy_safe (v8cfloat v1, v8cfloat v2)
	Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_elem_16 intrinsic is same as mul_elem_16_accuracy_safe intrinsic (all the bits of mantissa are used AND mac output of least significant terms is not discarded.) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_elem_16 intrinsic on mul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_elem_16 intrinsic on mul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v16accfloat	negmul_elem_16 (v16float v1, v16float v2)
	Elementwise multiplication of fp32 data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_16 intrinsic is same as neg(mul_elem_16_accuracy_safe) intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_16 intrinsic on negmul_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_16 intrinsic on negmul_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	negmul_elem_8 (v8float v1, v8cfloat v2)
	Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	negmul_elem_8 (v8cfloat v1, v8float v2)
	Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	negmul_elem_8 (v8cfloat v1, v8cfloat v2)
	Elementwise multiplication of cfloat data elements (emulation using bf16 datapath) and negation of result. Default behavior of negmul_elem_8 intrinsic is same as neg(mul_elem_8_accuracy_safe) intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v16accfloat	negmul_elem_16_accuracy_low (v16float v1, v16float v2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_16 intrinsic on negmul_ele_16_accuracy_low.

v8caccfloat	negmul_elem_8_accuracy_low (v8float v1, v8cfloat v2)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low.

v8caccfloat	negmul_elem_8_accuracy_low (v8cfloat v1, v8float v2)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low.

v8caccfloat	negmul_elem_8_accuracy_low (v8cfloat v1, v8cfloat v2)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of result. fp32 mantissa extracted as 2 bfloat16 numbers. Hence 4 mac operations per output lane. Out of which last mac operation involving LSBs is ignored to improve cycle count. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_low.

v16accfloat	negmul_elem_16_accuracy_fast (v16float v1, v16float v2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_16 intrinsic on negmul_elem_16_accuracy_fast.

v8caccfloat	negmul_elem_8_accuracy_fast (v8float v1, v8cfloat v2)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast.

v8caccfloat	negmul_elem_8_accuracy_fast (v8cfloat v1, v8float v2)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast.

v8caccfloat	negmul_elem_8_accuracy_fast (v8cfloat v1, v8cfloat v2)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath) and negation of output. Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (3*3) per output lane. Out of which 3 least significant mac operation results are ignored in the implementation to save cycles and improved cycle count. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_elem_8 intrinsic on negmul_elem_8_accuracy_fast.

v16accfloat	negmul_elem_16_accuracy_safe (v16float v1, v16float v2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (33) per lane (v1.lane0 v2.lane0) Default behavior of negmul_elem_16 intrinsic is same as negmul_elem_16_accuracy_safe intrinsic.

v8caccfloat	negmul_elem_8_accuracy_safe (v8float v1, v8cfloat v2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (33) per lane (v1.lane0 v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic.

v8caccfloat	negmul_elem_8_accuracy_safe (v8cfloat v1, v8float v2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (33) per lane (v1.lane0 v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic.

v8caccfloat	negmul_elem_8_accuracy_safe (v8cfloat v1, v8cfloat v2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath) and negation of result Input FP32 number is divided in to 3 bfloat16 numbers. Hence there would be 9 mac operations (33) per lane (v1.lane0 v2.lane0) Default behavior of negmul_elem_8 intrinsic is same as negmul_elem_8_accuracy_safe intrinsic.

v16accfloat	mac_elem_16 (v16float v1, v16float v2, v16accfloat acc)
	Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_elem_16 intrinsic is same as mac_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_16 intrinsic on mac_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mac_elem_8 (v8float v1, v8cfloat v2, v8caccfloat acc)
	Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mac_elem_8 (v8cfloat v1, v8float v2, v8caccfloat acc)
	Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mac_elem_8 (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
	Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v16accfloat	mac_elem_16_accuracy_safe (v16float v1, v16float v2, v16accfloat acc)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_elem_16 intrinsic is same as mac_elem_16_accuracy_safe intrinsic For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mac_elem_8_accuracy_safe (v8float v1, v8cfloat v2, v8caccfloat acc)
	Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mac_elem_8_accuracy_safe (v8cfloat v1, v8float v2, v8caccfloat acc)
	Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mac_elem_8_accuracy_safe (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
	Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

v16accfloat	mac_elem_16_accuracy_fast (v16float v1, v16float v2, v16accfloat acc)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mac_elem_8_accuracy_fast (v8float v1, v8cfloat v2, v8caccfloat acc)
	Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mac_elem_8_accuracy_fast (v8cfloat v1, v8float v2, v8caccfloat acc)
	Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mac_elem_8_accuracy_fast (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
	Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

v16accfloat	mac_elem_16_accuracy_low (v16float v1, v16float v2, v16accfloat acc)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mac_elem_8_accuracy_low (v8float v1, v8cfloat v2, v8caccfloat acc)
	Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mac_elem_8_accuracy_low (v8cfloat v1, v8float v2, v8caccfloat acc)
	Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	mac_elem_8_accuracy_low (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
	Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). For an explanation how these operations works see Multiply Accumulate.

v16accfloat	addmac_elem_16 (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
	Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_elem_16 intrinsic is same as addmac_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_elem_16 intrinsic on addmac_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v16accfloat	addmac_elem_16_accuracy_safe (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_elem_16 intrinsic is same as addmac_elem_16_accuracy_safe intrinsic For an explanation how these operations works see Multiply Accumulate.

v16accfloat	addmac_elem_16_accuracy_fast (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

v16accfloat	addmac_elem_16_accuracy_low (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v16accfloat	msc_elem_16 (v16float v1, v16float v2, v16accfloat acc)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_elem_16 intrinsic is same as msc_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	msc_elem_8 (v8float v1, v8cfloat v2, v8caccfloat acc)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	msc_elem_8 (v8cfloat v1, v8float v2, v8caccfloat acc)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	msc_elem_8 (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v16accfloat	msc_elem_16_accuracy_safe (v16float v1, v16float v2, v16accfloat acc)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_elem_16 intrinsic is same as msc_elem_16_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	msc_elem_8_accuracy_safe (v8float v1, v8cfloat v2, v8caccfloat acc)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	msc_elem_8_accuracy_safe (v8cfloat v1, v8float v2, v8caccfloat acc)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	msc_elem_8_accuracy_safe (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of msc_elem_8 intrinsic is same as msc_elem_8_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.

v16accfloat	msc_elem_16_accuracy_fast (v16float v1, v16float v2, v16accfloat acc)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	msc_elem_8_accuracy_fast (v8float v1, v8cfloat v2, v8caccfloat acc)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	msc_elem_8_accuracy_fast (v8cfloat v1, v8float v2, v8caccfloat acc)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	msc_elem_8_accuracy_fast (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
	Elementwise mutiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

v16accfloat	msc_elem_16_accuracy_low (v16float v1, v16float v2, v16accfloat acc)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_16 intrinsic on msc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	msc_elem_8_accuracy_low (v8float v1, v8cfloat v2, v8caccfloat acc)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	msc_elem_8_accuracy_low (v8cfloat v1, v8float v2, v8caccfloat acc)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v8caccfloat	msc_elem_8_accuracy_low (v8cfloat v1, v8cfloat v2, v8caccfloat acc)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_elem_8 intrinsic on msc_elem_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v16accfloat	addmsc_elem_16 (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_elem_16 intrinsic is same as addmsc_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

v16accfloat	addmsc_elem_16_accuracy_safe (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_elem_16 intrinsic is same as addmsc_elem_16_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.

v16accfloat	addmsc_elem_16_accuracy_fast (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

v16accfloat	addmsc_elem_16_accuracy_low (v16float v1, v16float v2, v16accfloat acc1, v16accfloat acc2)
	Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Matrix multiplication using bf16 data-path <br>
v16accfloat	mul_4x8_8x4 (v32float v1, v32float v2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

v4caccfloat	mul_2x8_8x2 (v16float v1, v16cfloat v2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

v16accfloat	mul_4x8_8x4_accuracy_safe (v32float v1, v32float v2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of mul_4x8_8x4 intrinsic is same as mul_4x8_8x4_accuracy_safe intrinsic.

v4caccfloat	mul_2x8_8x2_accuracy_safe (v16float v1, v16cfloat v2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

v16accfloat	mul_4x8_8x4_accuracy_fast (v32float v1, v32float v2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast intrinsic.

v4caccfloat	mul_2x8_8x2_accuracy_fast (v16float v1, v16cfloat v2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

v16accfloat	mul_4x8_8x4_accuracy_low (v32float v1, v32float v2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath. 16 bits in mantissa used). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low.

v4caccfloat	mul_2x8_8x2_accuracy_low (v16float v1, v16cfloat v2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mul_4x8_8x4 is same as mul_4x8_8x4_accuracy_safe. (slow in performance but better accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mul_4x8_8x4 intrinsic on mul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

v16accfloat	negmul_4x8_8x4 (v32float v1, v32float v2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of negmul_4x8_8x4 is same as negmul_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

v16accfloat	negmul_4x8_8x4_accuracy_safe (v32float v1, v32float v2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of negmul_4x8_8x4 intrinsic is same as negmul_4x8_8x4_accuracy_safe intrinsic.

v16accfloat	negmul_4x8_8x4_accuracy_fast (v32float v1, v32float v2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_fast intrinsic.

v16accfloat	negmul_4x8_8x4_accuracy_low (v32float v1, v32float v2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map negmul_4x8_8x4 intrinsic on negmul_4x8_8x4_accuracy_low intrinsic.

v16accfloat	mac_4x8_8x4 (v32float v1, v32float v2, v16accfloat acc)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_4x8_8x4 is same as mac_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

v16accfloat	mac_4x8_8x4_accuracy_safe (v32float v1, v32float v2, v16accfloat acc)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of mac_4x8_8x4 intrinsic is same as mac_4x8_8x4_accuracy_safe intrinsic.

v16accfloat	mac_4x8_8x4_accuracy_fast (v32float v1, v32float v2, v16accfloat acc)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_fast intrinsic.

v16accfloat	mac_4x8_8x4_accuracy_low (v32float v1, v32float v2, v16accfloat acc)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_low intrinsic.

v16accfloat	addmac_4x8_8x4 (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_4x8_8x4 is same as addmac_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

v16accfloat	addmac_4x8_8x4_accuracy_safe (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of addmac_4x8_8x4 intrinsic is same as addmac_4x8_8x4_accuracy_safe intrinsic.

v16accfloat	addmac_4x8_8x4_accuracy_fast (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_fast intrinsic.

v16accfloat	addmac_4x8_8x4_accuracy_low (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_low intrinsic.

v16accfloat	msc_4x8_8x4 (v32float v1, v32float v2, v16accfloat acc)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of msc_4x8_8x4 is same as msc_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

v16accfloat	msc_4x8_8x4_accuracy_safe (v32float v1, v32float v2, v16accfloat acc)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of msc_4x8_8x4 intrinsic is same as msc_4x8_8x4_accuracy_safe intrinsic.

v16accfloat	msc_4x8_8x4_accuracy_fast (v32float v1, v32float v2, v16accfloat acc)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_fast intrinsic.

v16accfloat	msc_4x8_8x4_accuracy_low (v32float v1, v32float v2, v16accfloat acc)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map msc_4x8_8x4 intrinsic on msc_4x8_8x4_accuracy_low intrinsic.

v16accfloat	addmsc_4x8_8x4 (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_4x8_8x4 is same as addmsc_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

v16accfloat	addmsc_4x8_8x4_accuracy_safe (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of addmsc_4x8_8x4 intrinsic is same as addmsc_4x8_8x4_accuracy_safe intrinsic.

v16accfloat	addmsc_4x8_8x4_accuracy_fast (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_fast intrinsic.

v16accfloat	addmsc_4x8_8x4_accuracy_low (v32float v1, v32float v2, v16accfloat acc1, v16accfloat acc2)
	Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_low intrinsic.

Function Documentation

◆ addmac_4x8_8x4()

v16accfloat addmac_4x8_8x4	(	v32float	v1,
		v32float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_4x8_8x4 is same as addmac_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

Parameters

v1	Vector v1
v2	Vector v2
acc1	Accumulator 1 input
acc2	Accumulator 2 input

Returns: Result of operation

◆ addmac_4x8_8x4_accuracy_fast()

v16accfloat addmac_4x8_8x4_accuracy_fast	(	v32float	v1,
		v32float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_fast intrinsic.

Parameters

v1	Vector v1
v2	Vector v2
acc1	Accumulator 1 input
acc2	Accumulator 2 input

Returns: Result of operation

◆ addmac_4x8_8x4_accuracy_low()

v16accfloat addmac_4x8_8x4_accuracy_low	(	v32float	v1,
		v32float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_4x8_8x4 intrinsic on addmac_4x8_8x4_accuracy_low intrinsic.

Parameters

v1	Vector v1
v2	Vector v2
acc1	Accumulator 1 input
acc2	Accumulator 2 input

Returns: Result of operation

◆ addmac_4x8_8x4_accuracy_safe()

v16accfloat addmac_4x8_8x4_accuracy_safe	(	v32float	v1,
		v32float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of addmac_4x8_8x4 intrinsic is same as addmac_4x8_8x4_accuracy_safe intrinsic.

Parameters

v1	Vector v1
v2	Vector v2
acc1	Accumulator 1 input
acc2	Accumulator 2 input

Returns: Result of operation

◆ addmac_elem_16()

v16accfloat addmac_elem_16	(	v16float	v1,
		v16float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_elem_16 intrinsic is same as addmac_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_elem_16 intrinsic on addmac_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters

acc1	accumulator 1 input
acc2	accumulator 2 input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutiplication and accumulate Result

◆ addmac_elem_16_accuracy_fast()

v16accfloat addmac_elem_16_accuracy_fast	(	v16float	v1,
		v16float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

Parameters

acc1	accumulator 1 input
acc2	accumulator 2 input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutiplication and accumulate Result

◆ addmac_elem_16_accuracy_low()

v16accfloat addmac_elem_16_accuracy_low	(	v16float	v1,
		v16float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmac_elem_16 intrinsic on addmac_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters

acc1	accumulator 1 input
acc2	accumulator 2 input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutiplication and accumulate Result

◆ addmac_elem_16_accuracy_safe()

v16accfloat addmac_elem_16_accuracy_safe	(	v16float	v1,
		v16float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmac_elem_16 intrinsic is same as addmac_elem_16_accuracy_safe intrinsic For an explanation how these operations works see Multiply Accumulate.

Parameters

acc1	accumulator 1 input
acc2	accumulator 2 input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutiplication and accumulate Result

◆ addmsc_4x8_8x4()

v16accfloat addmsc_4x8_8x4	(	v32float	v1,
		v32float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_4x8_8x4 is same as addmsc_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

Parameters

v1	Vector v1
v2	Vector v2
acc1	Accumulator 1 input
acc2	Accumulator 2 input

Returns: Result of operation

◆ addmsc_4x8_8x4_accuracy_fast()

v16accfloat addmsc_4x8_8x4_accuracy_fast	(	v32float	v1,
		v32float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_fast intrinsic.

Parameters

v1	Vector v1
v2	Vector v2
acc1	Accumulator 1 input
acc2	Accumulator 2 input

Returns: Result of operation

◆ addmsc_4x8_8x4_accuracy_low()

v16accfloat addmsc_4x8_8x4_accuracy_low	(	v32float	v1,
		v32float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_4x8_8x4 intrinsic on addmsc_4x8_8x4_accuracy_low intrinsic.

Parameters

v1	Vector v1
v2	Vector v2
acc1	Accumulator 1 input
acc2	Accumulator 2 input

Returns: Result of operation

◆ addmsc_4x8_8x4_accuracy_safe()

v16accfloat addmsc_4x8_8x4_accuracy_safe	(	v32float	v1,
		v32float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of addmsc_4x8_8x4 intrinsic is same as addmsc_4x8_8x4_accuracy_safe intrinsic.

Parameters

v1	Vector v1
v2	Vector v2
acc1	Accumulator 1 input
acc2	Accumulator 2 input

Returns: Result of operation

◆ addmsc_elem_16()

v16accfloat addmsc_elem_16	(	v16float	v1,
		v16float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_elem_16 intrinsic is same as addmsc_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters

acc1	accumulator 1 input
acc2	accumulator 2 input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutipliy. multiplication result is subtracted from acc (acc1+acc2-mul_out)

◆ addmsc_elem_16_accuracy_fast()

v16accfloat addmsc_elem_16_accuracy_fast	(	v16float	v1,
		v16float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

Parameters

acc1	accumulator 1 input
acc2	accumulator 2 input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutipliy. multiplication result is subtracted from acc (acc1+acc2-mul_out)

◆ addmsc_elem_16_accuracy_low()

v16accfloat addmsc_elem_16_accuracy_low	(	v16float	v1,
		v16float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map addmsc_elem_16 intrinsic on addmsc_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters

acc1	accumulator 1 input
acc2	accumulator 2 input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutipliy. multiplication result is subtracted from acc (acc1+acc2-mul_out)

◆ addmsc_elem_16_accuracy_safe()

v16accfloat addmsc_elem_16_accuracy_safe	(	v16float	v1,
		v16float	v2,
		v16accfloat	acc1,
		v16accfloat	acc2
	)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of addmsc_elem_16 intrinsic is same as addmsc_elem_16_accuracy_safe. For an explanation how these operations works see Multiply Accumulate.

Parameters

acc1	accumulator 1 input
acc2	accumulator 2 input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutipliy. multiplication result is subtracted from acc (acc1+acc2-mul_out)

◆ mac_4x8_8x4()

v16accfloat mac_4x8_8x4	(	v32float	v1,
		v32float	v2,
		v16accfloat	acc
	)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_4x8_8x4 is same as mac_4x8_8x4_accuracy_safe. (slow in performance but better in accuracy) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_fast. (improved performance at the risk of slight reduction in accuracy) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_low (Best performance at the risk of accuracy loss) For an explanation how these operations works see Multiply Accumulate.

Parameters

v1	Vector v1
v2	Vector v2
acc	acc input

Returns: Result of operation

◆ mac_4x8_8x4_accuracy_fast()

v16accfloat mac_4x8_8x4_accuracy_fast	(	v32float	v1,
		v32float	v2,
		v16accfloat	acc
	)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_fast intrinsic.

Parameters

v1	Vector v1
v2	Vector v2
acc	acc input

Returns: Result of operation

◆ mac_4x8_8x4_accuracy_low()

v16accfloat mac_4x8_8x4_accuracy_low	(	v32float	v1,
		v32float	v2,
		v16accfloat	acc
	)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_4x8_8x4 intrinsic on mac_4x8_8x4_accuracy_low intrinsic.

Parameters

v1	Vector v1
v2	Vector v2
acc	acc input

Returns: Result of operation

◆ mac_4x8_8x4_accuracy_safe()

v16accfloat mac_4x8_8x4_accuracy_safe	(	v32float	v1,
		v32float	v2,
		v16accfloat	acc
	)

Matrix mutiplication of fp32 data elements (emulation using bf16 datapath) Default behavior of mac_4x8_8x4 intrinsic is same as mac_4x8_8x4_accuracy_safe intrinsic.

Parameters

v1	Vector v1
v2	Vector v2
acc	acc input

Returns: Result of operation

◆ mac_elem_16()

v16accfloat mac_elem_16	(	v16float	v1,
		v16float	v2,
		v16accfloat	acc
	)

Elementwise multiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_elem_16 intrinsic is same as mac_elem_16_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_16 intrinsic on mac_ele_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters

acc	accumulator input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutiplication Result

◆ mac_elem_16_accuracy_fast()

v16accfloat mac_elem_16_accuracy_fast	(	v16float	v1,
		v16float	v2,
		v16accfloat	acc
	)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_fast. For an explanation how these operations works see Multiply Accumulate.

Parameters

acc	accumulator input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutipliy and accumulate Result

◆ mac_elem_16_accuracy_low()

v16accfloat mac_elem_16_accuracy_low	(	v16float	v1,
		v16float	v2,
		v16accfloat	acc
	)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_16 intrinsic on mac_elem_16_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters

acc	accumulator input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutipliy and accumulate Result

◆ mac_elem_16_accuracy_safe()

v16accfloat mac_elem_16_accuracy_safe	(	v16float	v1,
		v16float	v2,
		v16accfloat	acc
	)

Elementwise mutiplication of fp32 data elements (emulation using bf16 datapath). Default behavior of mac_elem_16 intrinsic is same as mac_elem_16_accuracy_safe intrinsic For an explanation how these operations works see Multiply Accumulate.

Parameters

acc	accumulator input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutipliy and accumulate Result

◆ mac_elem_8() [1/3]

v8caccfloat mac_elem_8	(	v8cfloat	v1,
		v8cfloat	v2,
		v8caccfloat	acc
	)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters

acc	accumulator input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutiplication Result

◆ mac_elem_8() [2/3]

v8caccfloat mac_elem_8	(	v8cfloat	v1,
		v8float	v2,
		v8caccfloat	acc
	)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters

acc	accumulator input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutiplication Result

◆ mac_elem_8() [3/3]

v8caccfloat mac_elem_8	(	v8float	v1,
		v8cfloat	v2,
		v8caccfloat	acc
	)

Elementwise multiplication of fp32 and cfloat data elements (emulation using bf16 datapath). Default behavior of mac_elem_8 intrinsic is same as mac_elem_8_accuracy_safe intrinsic. Define AIE_FP32_EMULATION_ACCURACY_FAST flag to map mac_elem_8 intrinsic on mac_elem_8_accuracy_fast. Define AIE_FP32_EMULATION_ACCURACY_LOW flag to map mac_elem_8 intrinsic on mac_ele_8_accuracy_low. For an explanation how these operations works see Multiply Accumulate.

Parameters

acc	accumulator input
v1	Vector v1
v2	Vector v2

Returns: Elementwise mutiplication Result