MMX register (64-bit) instructions are omitted.
S1=SSE S2=SSE2 S3=SSE3 SS3=SSSE3 S4.1=SSE4.1 S4.2=SSE4.2 V1=AVX V2=AVX2 V5=AVX512 #=64-bit mode only
Instructions marked with * become scalar instructions (only the lowest element is calculated) when PS/PD/DQ is changed to SS/SD/SI.
C/C++ intrinsic name is written below each instruction in blue.
AVX/AVX2
AVX512
This document is intended that you can find the correct instruction name that you are not sure of, and make it possible to search in the manuals. Refer to the manuals before coding.
Intel's manuals -> https://software.intel.com/en-us/articles/intel-sdm
When you find any error or something please post this feedback form or email me to the address at the bottom of this page.
Highlighter to Color To make these default, bookmark this page after clicking here.
Integer | Floating-Point | YMM lane (128-bit) | ||||||
---|---|---|---|---|---|---|---|---|
QWORD | DWORD | WORD | BYTE | Double | Single | Half | ||
?MM whole ?MM/mem |
MOVDQA (S2 _mm_load_si128 _mm_store_si128 MOVDQU (S2 _mm_loadu_si128 _mm_storeu_si128 |
MOVAPD (S2 _mm_load_pd _mm_loadr_pd _mm_store_pd _mm_storer_pd MOVUPD (S2 _mm_loadu_pd _mm_storeu_pd |
MOVAPS (S1 _mm_load_ps _mm_loadr_ps _mm_store_ps _mm_storer_ps MOVUPS (S1 _mm_loadu_ps _mm_storeu_ps |
|||||
VMOVDQA64 (V5... _mm_mask_load_epi64 _mm_mask_store_epi64 etc VMOVDQU64 (V5... _mm_mask_loadu_epi64 _mm_mask_store_epi64 etc |
VMOVDQA32 (V5... _mm_mask_load_epi32 _mm_mask_store_epi32 etc VMOVDQU32 (V5... _mm_mask_loadu_epi32 _mm_mask_storeu_epi32 etc |
VMOVDQU16 (V5+BW... _mm_mask_loadu_epi16 _mm_mask_storeu_epi16 etc |
VMOVDQU8 (V5+BW... _mm_mask_loadu_epi8 _mm_mask_storeu_epi8 etc |
|||||
XMM upper half mem |
MOVHPD (S2 _mm_loadh_pd _mm_storeh_pd |
MOVHPS (S1 _mm_loadh_pi _mm_storeh_pi |
||||||
XMM upper half XMM lower half |
MOVHLPS (S1 _mm_movehl_ps MOVLHPS (S1 _mm_movelh_ps |
|||||||
XMM lower half mem |
MOVQ (S2 _mm_loadl_epi64 _mm_storel_epi64 |
MOVLPD (S2 _mm_loadl_pd _mm_storel_pd |
MOVLPS (S1 _mm_loadl_pi _mm_storel_pi |
|||||
XMM lowest 1 elem r/m |
MOVQ (S2# _mm_cvtsi64_si128 _mm_cvtsi128_si64 |
MOVD (S2 _mm_cvtsi32_si128 _mm_cvtsi128_si32 |
VMOVW (V5+FP16 _mm_cvtsi16_si128 _mm_cvtsi128_si16 |
|||||
XMM lowest 1 elem XMM/mem |
MOVQ (S2 _mm_move_epi64 |
MOVSD (S2 _mm_load_sd _mm_store_sd _mm_move_sd |
MOVSS (S1 _mm_load_ss _mm_store_ss _mm_move_ss |
VMOVSH (V5+FP16 _mm_load_sh _mm_store_sh _mm_move_sh |
||||
XMM whole 1 elem |
TIP 2 _mm_set1_epi64x VPBROADCASTQ (V2 _mm_broadcastq_epi64 |
TIP 2 _mm_set1_epi32 VPBROADCASTD (V2 _mm_broadcastd_epi32 |
TIP 2 _mm_set1_epi16 VPBROADCASTW (V2 _mm_broadcastw_epi16 |
_mm_set1_epi8 VPBROADCASTB (V2 _mm_broadcastb_epi8 |
TIP 2 _mm_set1_pd _mm_load1_pd MOVDDUP (S3 _mm_movedup_pd _mm_loaddup_pd |
TIP 2
_mm_set1_ps _mm_load1_ps VBROADCASTSS from mem (V1 from XMM (V2 _mm_broadcast_ss |
||
YMM / ZMM whole 1 elem |
VPBROADCASTQ (V2 _mm256_broadcastq_epi64 |
VPBROADCASTD (V2 _mm256_broadcastd_epi32 |
VPBROADCASTW (V2 _mm256_broadcastw_epi16 |
VPBROADCASTB (V2 _mm256_broadcastb_epi8 |
VBROADCASTSD from mem (V1 from XMM (V2 _mm256_broadcast_sd |
VBROADCASTSS from mem (V1 from XMM (V2 _mm256_broadcast_ss |
VBROADCASTF128 (V1 _mm256_broadcast_ps _mm256_broadcast_pd VBROADCASTI128 (V2 _mm256_broadcastsi128_si256 |
|
YMM / ZMM whole 2/4/8 elems |
VBROADCASTI64X2 (V5+DQ... _mm512_broadcast_i64x2 VBROADCASTI64X4 (V5 _mm512_broadcast_i64x4 |
VBROADCASTI32X2 (V5+DQ... _mm512_broadcast_i32x2 VBROADCASTI32X4 (V5... _mm512_broadcast_i32x4 VBROADCASTI32X8 (V5+DQ _mm512_broadcast_i32x8 |
VBROADCASTF64X2 (V5+DQ... _mm512_broadcast_f64x2 VBROADCASTF64X4 (V5 _mm512_broadcast_f64x4 |
VBROADCASTF32X2 (V5+DQ... _mm512_broadcast_f32x2 VBROADCASTF32X4 (V5... _mm512_broadcast_f32x4 VBROADCASTF32X8 (V5+DQ _mm512_broadcast_f32x8 |
||||
?MM multiple elems |
_mm_set_epi64x _mm_setr_epi64x |
_mm_set_epi32 _mm_setr_epi32 |
_mm_set_epi16 _mm_setr_epi16 |
_mm_set_epi8 _mm_setr_epi8 |
_mm_set_pd _mm_setr_pd |
_mm_set_ps _mm_setr_ps |
||
?MM whole zero |
TIP 1 _mm_setzero_si128 |
TIP 1 _mm_setzero_pd |
TIP 1 _mm_setzero_ps |
|||||
extract |
PEXTRQ (S4.1# _mm_extract_epi64 |
PEXTRD (S4.1 _mm_extract_epi32 |
PEXTRW to r (S2 PEXTRW to r/m (S4.1 _mm_extract_epi16 |
PEXTRB (S4.1 _mm_extract_epi8 |
->MOVHPD (S2 _mm_loadh_pd _mm_storeh_pd ->MOVLPD (S2 _mm_loadl_pd _mm_storel_pd |
EXTRACTPS (S4.1 _mm_extract_ps |
VEXTRACTF128 (V1 _mm256_extractf128_ps _mm256_extractf128_pd _mm256_extractf128_si256 VEXTRACTI128 (V2 _mm256_extracti128_si256 |
|
VEXTRACTI64X2 (V5+DQ... _mm512_extracti64x2_epi64 VEXTRACTI64X4 (V5 _mm512_extracti64x4_epi64 |
VEXTRACTI32X4 (V5... _mm512_extracti32x4_epi32 VEXTRACTI32X8 (V5+DQ _mm512_extracti32x8_epi32 |
VEXTRACTF64X2 (V5+DQ... _mm512_extractf64x2_pd VEXTRACTF64X4 (V5 _mm512_extractf64x4_pd |
VEXTRACTF32X4 (V5... _mm512_extractf32x4_ps VEXTRACTF32X8 (V5+DQ _mm512_extractf32x8_ps |
|||||
insert |
PINSRQ (S4.1# _mm_insert_epi64 |
PINSRD (S4.1 _mm_insert_epi32 |
PINSRW (S2 _mm_insert_epi16 |
PINSRB (S4.1 _mm_insert_epi8 |
->MOVHPD (S2 _mm_loadh_pd _mm_storeh_pd ->MOVLPD (S2 _mm_loadl_pd _mm_storel_pd |
INSERTPS (S4.1 _mm_insert_ps |
VINSERTF128 (V1 _mm256_insertf128_ps _mm256_insertf128_pd _mm256_insertf128_si256 VINSERTI128 (V2 _mm256_inserti128_si256 |
|
VINSERTI64X2 (V5+DQ... _mm512_inserrti64x2 VINSERTI64X4 (V5... _mm512_inserti64x4 |
VINSERTI32X4 (V5... _mm512_inserti32x4 VINSERTI32X8 (V5+DQ _mm512_inserti32x8 |
VINSERTF64X2 (V5+DQ... _mm512_insertf64x2 VINSERTF64X4 (V5 _mm512_insertf64x4 |
VINSERTF32X4 (V5... _mm512_insertf32x4 VINSERTF32X8 (V5+DQ _mm512_insertf32x8 |
|||||
unpack |
PUNPCKHQDQ (S2 _mm_unpackhi_epi64 PUNPCKLQDQ (S2 _mm_unpacklo_epi64 |
PUNPCKHDQ (S2 _mm_unpackhi_epi32 PUNPCKLDQ (S2 _mm_unpacklo_epi32 |
PUNPCKHWD (S2 _mm_unpackhi_epi16 PUNPCKLWD (S2 _mm_unpacklo_epi16 |
PUNPCKHBW (S2 _mm_unpackhi_epi8 PUNPCKLBW (S2 _mm_unpacklo_epi8 |
UNPCKHPD (S2 _mm_unpackhi_pd UNPCKLPD (S2 _mm_unpacklo_pd |
UNPCKHPS (S1 _mm_unpackhi_ps UNPCKLPS (S1 _mm_unpacklo_ps |
||
shuffle/permute |
VPERMQ (V2 _mm256_permute4x64_epi64 VPERMI2Q (V5... _mm_permutex2var_epi64 VPERMT2Q (V5... _mm_permutex2var_epi64 |
PSHUFD (S2 _mm_shuffle_epi32 VPERMD (V2 _mm256_permutevar8x32_epi32 _mm256_permutexvar_epi32 VPERMI2D (V5... _mm_permutex2var_epi32 VPERMT2D (V5... _mm_permutex2var_epi32 |
PSHUFHW (S2 _mm_shufflehi_epi16 PSHUFLW (S2 _mm_shufflelo_epi16 VPERMW (V5+BW... _mm_permutexvar_epi16 VPERMI2W (V5+BW... _mm_permutex2var_epi16 VPERMT2W (V5+BW... _mm_permutex2var_epi16 |
PSHUFB (SS3 _mm_shuffle_epi8 VPERMB (V5+VBMI... _mm_permutexvar_epi8 VPERMI2B (V5+VBMI... _mm_permutex2var_epi8 VPERMT2B (V5+VBMI... _mm_permutex2var_epi8 |
SHUFPD (S2 _mm_shuffle_pd VPERMILPD (V1 _mm_permute_pd _mm_permutevar_pd VPERMPD (V2 _mm256_permute4x64_pd VPERMI2PD (V5... _mm_permutex2var_pd VPERMT2PD (V5... _mm_permutex2var_pd |
SHUFPS (S1 _mm_shuffle_ps VPERMILPS (V1 _mm_permute_ps _mm_permutevar_ps VPERMPS (V2 _mm256_permutevar8x32_ps VPERMI2PS (V5... _mm_permutex2var_ps VPERMT2PS (V5... _mm_permutex2var_ps |
VPERM2F128 (V1 _mm256_permute2f128_ps _mm256_permute2f128_pd _mm256_permute2f128_si256 VPERM2I128 (V2 _mm256_permute2x128_si256 |
|
VSHUFI64X2 (V5... _mm512_shuffle_i64x2 |
VSHUFI32X4 (V5... _mm512_shuffle_i32x4 |
VSHUFF64X2 (V5... _mm512_shuffle_f64x2 |
VSHUFF32X4 (V5... _mm512_shuffle_f32x4 |
|||||
blend |
VPBLENDMQ (V5... _mm_mask_blend_epi32 |
VPBLENDD (V2 _mm_blend_epi32 VPBLENDMD (V5... _mm_mask_blend_epi32 |
PBLENDW (S4.1 _mm_blend_epi16 VPBLENDMW (V5+BW... _mm_mask_blend_epi16 |
PBLENDVB (S4.1 _mm_blendv_epi8 VPBLENDMB (V5+BW... _mm_mask_blend_epi8 |
BLENDPD (S4.1 _mm_blend_pd BLENDVPD (S4.1 _mm_blendv_pd VBLENDMPD (V5... _mm_mask_blend_pd |
BLENDPS (S4.1 _mm_blend_ps BLENDVPS (S4.1 _mm_blendv_ps VBLENDMPS (V5... _mm_mask_blend_ps |
||
move and duplicate |
MOVDDUP (S3 _mm_movedup_pd _mm_loaddup_pd |
MOVSHDUP (S3 _mm_movehdup_ps MOVSLDUP (S3 _mm_moveldup_ps |
||||||
mask move |
VPMASKMOVQ (V2 _mm_maskload_epi64 _mm_maskstore_epi64 |
VPMASKMOVD (V2 _mm_maskload_epi32 _mm_maskstore_epi32 |
VMASKMOVPD (V1 _mm_maskload_pd _mm_maskstore_pd |
VMASKMOVPS (V1 _mm_maskload_ps _mm_maskstore_ps |
||||
extract highest bit |
PMOVMSKB (S2 _mm_movemask_epi8 |
MOVMSKPD (S2 _mm_movemask_pd |
MOVMSKPS (S1 _mm_movemask_ps |
|||||
VPMOVQ2M (V5+DQ... _mm_movepi64_mask |
VPMOVD2M (V5+DQ... _mm_movepi32_mask |
VPMOVW2M (V5+BW... _mm_movepi16_mask |
VPMOVB2M (V5+BW... _mm_movepi8_mask |
|||||
gather |
VPGATHERDQ (V2 _mm_i32gather_epi64 _mm_mask_i32gather_epi64 VPGATHERQQ (V2 _mm_i64gather_epi64 _mm_mask_i64gather_epi64 |
VPGATHERDD (V2 _mm_i32gather_epi32 _mm_mask_i32gather_epi32 VPGATHERQD (V2 _mm_i64gather_epi32 _mm_mask_i64gather_epi32 |
VGATHERDPD (V2 _mm_i32gather_pd _mm_mask_i32gather_pd VGATHERQPD (V2 _mm_i64gather_pd _mm_mask_i64gather_pd |
VGATHERDPS (V2 _mm_i32gather_ps _mm_mask_i32gather_ps VGATHERQPS (V2 _mm_i64gather_ps _mm_mask_i64gather_ps |
||||
scatter |
VPSCATTERDQ (V5... _mm_i32scatter_epi64 _mm_mask_i32scatter_epi64 VPSCATTERQQ (V5... _mm_i64scatter_epi64 _mm_mask_i64scatter_epi64 |
VPSCATTERDD (V5... _mm_i32scatter_epi32 _mm_mask_i32scatter_epi32 VPSCATTERQD (V5... _mm_i64scatter_epi32 _mm_mask_i64scatter_epi32 |
VSCATTERDPD (V5... _mm_i32scatter_pd _mm_mask_i32scatter_pd VSCATTERQPD (V5... _mm_i64scatter_pd _mm_mask_i64scatter_pd |
VSCATTERDPS (V5... _mm_i32scatter_ps _mm_mask_i32scatter_ps VSCATTERQPS (V5... _mm_i64scatter_ps _mm_mask_i64scatter_ps |
||||
compress |
VPCOMPRESSQ (V5... _mm_mask_compress_epi64 _mm_mask_compressstoreu_epi64 |
VPCOMPRESSD (V5... _mm_mask_compress_epi32 _mm_mask_compressstoreu_epi32 |
VPCOMPRESSW (V5+VBMI2... _mm_mask_compress_epi16 _mm_mask_compressstoreu_epi16 |
VPCOMPRESSB (V5+VBMI2... _mm_mask_compress_epi8 _mm_mask_compressstoreu_epi8 |
VCOMPRESSPD (V5... _mm_mask_compress_pd _mm_mask_compressstoreu_pd |
VCOMPRESSPS (V5... _mm_mask_compress_ps _mm_mask_compressstoreu_ps |
||
expand |
VPEXPANDQ (V5... _mm_mask_expand_epi64 _mm_mask_expandloadu_epi64 |
VPEXPANDD (V5... _mm_mask_expand_epi32 _mm_mask_expandloadu_epi32 |
VPEXPANDW (V5+VBMI2... _mm_mask_expand_epi16 _mm_mask_expandloadu_epi16 |
VPEXPANDB (V5+VBMI2... _mm_mask_expand_epi8 _mm_mask_expandloadu_epi8 |
VEXPANDPD (V5... _mm_mask_expand_pd _mm_mask_expandloadu_pd |
VEXPANDPS (V5... _mm_mask_expand_ps _mm_mask_expandloadu_ps |
||
align right |
VALIGNQ (V5... _mm_alignr_epi64 |
VALIGND (V5... _mm_alignr_epi32 |
PALIGNR (SS3 _mm_alignr_epi8 |
|||||
expand Opmask bits |
VPMOVM2Q (V5+DQ... _mm_movm_epi64 |
VPMOVM2D (V5+DQ... _mm_movm_epi32 |
VPMOVM2W (V5+BW... _mm_movm_epi16 |
VPMOVM2B (V5+BW... _mm_movm_epi8 |
from \ to | Integer | Floating-Point | ||||||
---|---|---|---|---|---|---|---|---|
QWORD | DWORD | WORD | BYTE | Double | Single | Half | ||
Integer | QWORD |
VPMOVQD (V5... _mm_cvtepi64_epi32 VPMOVSQD (V5... _mm_cvtsepi64_epi32 VPMOVUSQD (V5... _mm_cvtusepi64_epi32 |
VPMOVQW (V5... _mm_cvtepi64_epi16 VPMOVSQW (V5... _mm_cvtsepi64_epi16 VPMOVUSQW (V5... _mm_cvtusepi64_epi16 |
VPMOVQB (V5... _mm_cvtepi64_epi8 VPMOVSQB (V5... _mm_cvtsepi64_epi8 VPMOVUSQB (V5... _mm_cvtusepi64_epi8 |
CVTSI2SD (S2# scalar only _mm_cvtsi64_sd VCVTQQ2PD* (V5+DQ... _mm_cvtepi64_pd VCVTUQQ2PD* (V5+DQ... _mm_cvtepu64_pd |
CVTSI2SS (S1# scalar only _mm_cvtsi64_ss VCVTQQ2PS* (V5+DQ... _mm_cvtepi64_ps VCVTUQQ2PS* (V5+DQ... _mm_cvtepu64_ps |
VCVTQQ2PH* (V5+FP16... _mm_cvtepi64_ph VCVTUQQ2PH* (V5+FP16... _mm_cvtepu64_ph |
|
DWORD |
TIP 3 PMOVSXDQ (S4.1 _mm_ cvtepi32_epi64 PMOVZXDQ (S4.1 _mm_ cvtepu32_epi64 |
PACKSSDW (S2 _mm_packs_epi32 PACKUSDW (S4.1 _mm_packus_epi32 VPMOVDW (V5... _mm_cvtepi32_epi16 VPMOVSDW (V5... _mm_cvtsepi32_epi16 VPMOVUSDW (V5... _mm_cvtusepi32_epi16 |
VPMOVDB (V5... _mm_cvtepi32_epi8 VPMOVSDB (V5... _mm_cvtsepi32_epi8 VPMOVUSDB (V5... _mm_cvtusepi32_epi8 |
CVTDQ2PD* (S2 _mm_cvtepi32_pd VCVTUDQ2PD* (V5... _mm_cvtepu32_pd |
CVTDQ2PS* (S2 _mm_cvtepi32_ps VCVTUDQ2PS* (V5... _mm_cvtepu32_ps |
VCVTDQ2PH* (V5+FP16... _mm_cvtepi32_ph VCVTUDQ2PH* (V5+FP16... _mm_cvtepu32_ph |
||
WORD |
PMOVSXWQ (S4.1 _mm_ cvtepi16_epi64 PMOVZXWQ (S4.1 _mm_ cvtepu16_epi64 |
TIP 3 PMOVSXWD (S4.1 _mm_ cvtepi16_epi32 PMOVZXWD (S4.1 _mm_ cvtepu16_epi32 |
PACKSSWB (S2 _mm_packs_epi16 PACKUSWB (S2 _mm_packus_epi16 VPMOVWB (V5+BW... _mm_cvtepi16_epi8 VPMOVSWB (V5+BW... _mm_cvtsepi16_epi8 VPMOVUSWB (V5+BW... _mm_cvtusepi16_epi8 |
VCVTW2PH* (V5+FP16... _mm_cvtepi16_ph VCVTUW2PH* (V5+FP16... _mm_cvtepu16_ph |
||||
BYTE |
PMOVSXBQ (S4.1 _mm_ cvtepi8_epi64 PMOVZXBQ (S4.1 _mm_ cvtepu8_epi64 |
PMOVSXBD (S4.1 _mm_ cvtepi8_epi32 PMOVZXBD (S4.1 _mm_ cvtepu8_epi32 |
TIP 3 PMOVSXBW (S4.1 _mm_ cvtepi8_epi16 PMOVZXBW (S4.1 _mm_ cvtepu8_epi16 |
|||||
Floating-Point | Double |
CVTSD2SI /
CVTTSD2SI (S2# scalar only _mm_cvtsd_si64 / _mm_cvttsd_si64 VCVTPD2QQ* / VCVTTPD2QQ* (V5+DQ... _mm_cvtpd_epi64 / _mm_cvttpd_epi64 VCVTPD2UQQ* / VCVTTPD2UQQ* (V5+DQ... _mm_cvtpd_epu64 / _mm_cvttpd_epu64 right ones are with truncation |
CVTPD2DQ* /
CVTTPD2DQ* (S2 _mm_cvtpd_epi32 / _mm_cvttpd_epi32 VCVTPD2UDQ* / VCVTTPD2UDQ* (V5... _mm_cvtpd_epu32 / _mm_cvttpd_epu32 right ones are with truncation |
CVTPD2PS* (S2 _mm_cvtpd_ps |
VCVTPD2PH* (V5+FP16... _mm_cvtpd_ph |
|||
Single |
CVTSS2SI /
CVTTSS2SI (S1# scalar only _mm_cvtss_si64 / _mm_cvttss_si64 VCVTPS2QQ* / VCVTTPS2QQ* (V5+DQ... _mm_cvtps_epi64 / _mm_cvttps_epi64 VCVTPS2UQQ* / VCVTTPS2UQQ* (V5+DQ... _mm_cvtps_epu64 / _mm_cvttps_epu64 right ones are with truncation |
CVTPS2DQ* /
CVTTPS2DQ* (S2 _mm_cvtps_epi32 / _mm_cvttps_epi32 VCVTPS2UDQ* / VCVTTPS2UDQ* (V5... _mm_cvtps_epu32 / _mm_cvttps_epu32 right ones are with truncation |
CVTPS2PD* (S2 _mm_cvtps_pd |
VCVTPS2PH (F16C _mm_cvtps_ph VCVTPS2PHX* (V5+FP16... _mm_cvtxps_ph |
||||
Half |
VCVTPH2QQ * /
VCVTTPH2QQ* (V5+FP16... _mm_cvtph_epi64 / _mm_cvttph_epi64 VCVTPH2UQQ* / VCVTTPH2UQQ*(V5+FP16... _mm_cvtph_epu64 / _mm_cvttph_epu64 right ones are with truncation |
VCVTPH2DQ* /
VCVTTPH2DQ* (V5+FP16... _mm_cvtph_epi32* / _mm_cvttph_epi32 VCVTPH2UDQ* / VCVTTPH2UDQ* (V5+FP16... _mm_cvtph_epu32 / _mm_cvttph_epu32 right ones are with truncation |
VCVTPH2W* /
VCVTTPH2W*(V5+FP16... _mm_cvtph_epi16 / _mm_cvttph_epi16 VCVTPH2UW* / VCVTTPH2UW*(V5+FP16... _mm_cvtph_epu16 / _mm_cvttph_epu16 right ones are with truncation |
VCVTPH2PD* (V5+FP16... _mm_cvtph_pd |
VCVTPH2PS (F16C _mm_cvtph_ps VCVTPH2PSX* (V5+FP16... _mm_cvtxph_ps |
Integer | Floating-Point | ||||||
---|---|---|---|---|---|---|---|
QWORD | DWORD | WORD | BYTE | Double | Single | Half | |
add |
PADDQ (S2 _mm_add_epi64 |
PADDD (S2 _mm_add_epi32 |
PADDW (S2 _mm_add_epi16 PADDSW (S2 _mm_adds_epi16 PADDUSW (S2 _mm_adds_epu16 |
PADDB (S2 _mm_add_epi8 PADDSB (S2 _mm_adds_epi8 PADDUSB (S2 _mm_adds_epu8 |
ADDPD* (S2 _mm_add_pd |
ADDPS* (S1 _mm_add_ps |
VADDPH* (V5+FP16... _mm_add_ph |
sub |
PSUBQ (S2 _mm_sub_epi64 |
PSUBD (S2 _mm_sub_epi32 |
PSUBW (S2 _mm_sub_epi16 PSUBSW (S2 _mm_subs_epi16 PSUBUSW (S2 _mm_subs_epu16 |
PSUBB (S2 _mm_sub_epi8 PSUBSB (S2 _mm_subs_epi8 PSUBUSB (S2 _mm_subs_epu8 |
SUBPD* (S2 _mm_sub_pd |
SUBPS* (S1 _mm_sub_ps |
VSUBPH* (V5+FP16... _mm_sub_ph |
mul |
VPMULLQ (V5+DQ... _mm_mullo_epi64 |
PMULDQ (S4.1 _mm_mul_epi32 PMULUDQ (S2 _mm_mul_epu32 PMULLD (S4.1 _mm_mullo_epi32 |
PMULHW (S2 _mm_mulhi_epi16 PMULHUW (S2 _mm_mulhi_epu16 PMULLW (S2 _mm_mullo_epi16 |
MULPD* (S2 _mm_mul_pd |
MULPS* (S1 _mm_mul_ps |
VMULPH* (V5+FP16... _mm_mul_ph |
|
div |
DIVPD* (S2 _mm_div_pd |
DIVPS* (S1 _mm_div_ps |
VDIVPH* (V5+FP16... _mm_div_ph |
||||
reciprocal |
VRCP14PD* (V5... _mm_rcp14_pd |
RCPPS* (S1 _mm_rcp_ps VRCP14PS* (V5... _mm_rcp14_ps |
VRCPPH* (V5+FP16... _mm_rcp_ph |
||||
square root |
SQRTPD* (S2 _mm_sqrt_pd |
SQRTPS* (S1 _mm_sqrt_ps |
VSQRTPH* (V5+FP16... _mm_sqrt_ph |
||||
reciprocal of square root |
VRSQRT14PD* (V5... _mm_rsqrt14_pd |
RSQRTPS* (S1 _mm_rsqrt_ps VRSQRT14PS* (V5... _mm_rsqrt14_ps |
VRSQRTPH* (V5+FP16... _mm_rsqrt_ph |
||||
multiply nth power of 2 |
VSCALEFPD* (V5... _mm_scalef_pd |
VSCALEFPS* (V5... _mm_scalef_ps |
VSCALEFPH* (V5+FP16... _mm_scalef_ph |
||||
max |
TIP 8 VPMAXSQ (V5... _mm_max_epi64 VPMAXUQ (V5... _mm_max_epu64 |
TIP 8 PMAXSD (S4.1 _mm_max_epi32 PMAXUD (S4.1 _mm_max_epu32 |
PMAXSW (S2 _mm_max_epi16 PMAXUW (S4.1 _mm_max_epu16 |
TIP 8 PMAXSB (S4.1 _mm_max_epi8 PMAXUB (S2 _mm_max_epu8 |
TIP 8 MAXPD* (S2 _mm_max_pd |
TIP 8 MAXPS* (S1 _mm_max_ps |
VMAXPH* (V5+FP16... _mm_max_ph |
min |
TIP 8 VPMINSQ (V5... _mm_min_epi64 VPMINUQ (V5... _mm_min_epu64 |
TIP 8 PMINSD (S4.1 _mm_min_epi32 PMINUD (S4.1 _mm_min_epu32 |
PMINSW (S2 _mm_min_epi16 PMINUW (S4.1 _mm_min_epu16 |
TIP 8 PMINSB (S4.1 _mm_min_epi8 PMINUB (S2 _mm_min_epu8 |
TIP 8 MINPD* (S2 _mm_min_pd |
TIP 8 MINPS* (S1 _mm_min_ps |
VMINPH* (V5+FP16... _mm_min_ph |
average |
PAVGW (S2 _mm_avg_epu16 |
PAVGB (S2 _mm_avg_epu8 |
|||||
absolute |
TIP 4 VPABSQ (V5... _mm_abs_epi64 |
TIP 4 PABSD (SS3 _mm_abs_epi32 |
TIP 4 PABSW (SS3 _mm_abs_epi16 |
TIP 4 PABSB (SS3 _mm_abs_epi8 |
TIP 5 | TIP 5 | |
sign operation |
PSIGND (SS3 _mm_sign_epi32 |
PSIGNW (SS3 _mm_sign_epi16 |
PSIGNB (SS3 _mm_sign_epi8 |
||||
population count |
VPOPCNTQ (V5+VPOPCNTDQ... _mm_popcnt_epi64 |
VPOPCNTD (V5+VPOPCNTDQ... _mm_popcnt_epi32 |
VPOPCNTW (V5+BITALG... _mm_popcnt_epi16 |
VPOPCNTB (V5+BITALG... _mm_popcnt_epi8 |
|||
round |
ROUNDPD* (S4.1 _mm_round_pd _mm_floor_pd _mm_ceil_pd VRNDSCALEPD* (V5... _mm_roundscale_pd |
ROUNDPS* (S4.1 _mm_round_ps _mm_floor_ps _mm_ceil_ps VRNDSCALEPS* (V5... _mm_roundscale_ps |
VRNDSCALEPH* (V5+FP16... _mm_roundscale_ph |
||||
difference from rounded value |
VREDUCEPD* (V5+DQ... _mm_reduce_pd |
VREDUCEPS* (V5+DQ... _mm_reduce_ps |
VREDUCEPH* (V5+FP16... _mm_reduce_ph |
||||
add / sub |
ADDSUBPD (S3 _mm_addsub_pd |
ADDSUBPS (S3 _mm_addsub_ps |
|||||
horizontal add |
PHADDD (SS3 _mm_hadd_epi32 |
PHADDW (SS3 _mm_hadd_epi16 PHADDSW (SS3 _mm_hadds_epi16 |
HADDPD (S3 _mm_hadd_pd |
HADDPS (S3 _mm_hadd_ps |
|||
horizontal sub |
PHSUBD (SS3 _mm_hsub_epi32 |
PHSUBW (SS3 _mm_hsub_epi16 PHSUBSW (SS3 _mm_hsubs_epi16 |
HSUBPD (S3 _mm_hsub_pd |
HSUBPS (S3 _mm_hsub_ps |
|||
dot product / multiply and add |
PMADDWD (S2 _mm_madd_epi16 VPDPWSSD (AVX_VNNI _mm_dpwssd_avx_epi32 VPDPWSSD (V5+VNNI... _mm_dpwssd_epi32 VPDPWSSDS (AVX_VNNI _mm_dpwssds_avx_epi32 VPDPWSSDS (V5+VNNI... _mm_dpwssds_epi32 |
PMADDUBSW (SS3 _mm_maddubs_epi16 VPDPBUSD (AVX_VNNI _mm_dpbusd_avx_epi32 VPDPBUSD (V5+VNNI... _mm_dpbusd_epi32 VPDPBUSDS (AVX_VNNI _mm_dpbusds_avx_epi32 VPDPBUSDS (V5+VNNI... _mm_dpbusds_epi32 |
DPPD (S4.1 _mm_dp_pd |
DPPS (S4.1 _mm_dp_ps |
|||
fused multiply and add / sub |
VFMADDxxxPD*
(FMA _mm_fmadd_pd VFMSUBxxxPD* (FMA _mm_fmsub_pd VFMADDSUBxxxPD (FMA _mm_fmaddsub_pd VFMSUBADDxxxPD (FMA _mm_fmsubadd_pd VFNMADDxxxPD* (FMA _mm_fnmadd_pd VFNMSUBxxxPD* (FMA _mm_fnmsub_pd xxx=132/213/231 |
VFMADDxxxPS*
(FMA _mm_fmadd_ps VFMSUBxxxPS* (FMA _mm_fmsub_ps VFMADDSUBxxxPS (FMA _mm_fmaddsub_ps VFMSUBADDxxxPS (FMA _mm_fmsubadd_ps VFNMADDxxxPS* (FMA _mm_fnmadd_ps VFNMSUBxxxPS* (FMA _mm_fnmsub_ps xxx=132/213/231 |
VFMADDxxxPH*
(V5+FP16... _mm_fmadd_ph VFMSUBxxxPH* (V5+FP16... _mm_fmsub_ph VFMADDSUBxxxPH (V5+FP16... _mm_fmaddsub_ph VFMSUBADDxxxPH (V5+FP16... _mm_fmsubadd_ph VFNMADDxxxPH* (V5+FP16... _mm_fnmadd_ph VFNMSUBxxxPH* (V5+FP16... _mm_fnmsub_ph xxx=132/213/231 |
Integer | ||||
---|---|---|---|---|
QWORD | DWORD | WORD | BYTE | |
compare for == |
PCMPEQQ (S4.1 _mm_cmpeq_epi64 _mm_cmpeq_epi64_mask (V5... VPCMPUQ (0) (V5... _mm_cmpeq_epu64_mask |
PCMPEQD (S2 _mm_cmpeq_epi32 _mm_cmpeq_epi32_mask (V5... VPCMPUD (0) (V5... _mm_cmpeq_epu32_mask |
PCMPEQW (S2 _mm_cmpeq_epi16 _mm_cmpeq_epi16_mask (V5+BW... VPCMPUW (0) (V5+BW... _mm_cmpeq_epu16_mask |
PCMPEQB (S2 _mm_cmpeq_epi8 _mm_cmpeq_epi8_mask (V5+BW... VPCMPUB (0) (V5+BW... _mm_cmpeq_epu8_mask |
compare for < |
VPCMPQ (1) (V5... _mm_cmplt_epi64_mask VPCMPUQ (1) (V5... _mm_cmplt_epu64_mask |
VPCMPD (1) (V5... _mm_cmplt_epi32_mask VPCMPUD (1) (V5... _mm_cmplt_epu32_mask |
VPCMPW (1) (V5+BW... _mm_cmplt_epi16_mask VPCMPUW (1) (V5+BW... _mm_cmplt_epu16_mask |
VPCMPB (1) (V5+BW... _mm_cmplt_epi8_mask VPCMPUB (1) (V5+BW... _mm_cmplt_epu8_mask |
compare for <= |
VPCMPQ (2) (V5... _mm_cmple_epi64_mask VPCMPUQ (2) (V5... _mm_cmple_epu64_mask |
VPCMPD (2) (V5... _mm_cmple_epi32_mask VPCMPUD (2) (V5... _mm_cmple_epu32_mask |
VPCMPW (2) (V5+BW... _mm_cmple_epi16_mask VPCMPUW (2) (V5+BW... _mm_cmple_epu16_mask |
VPCMPB (2) (V5+BW... _mm_cmple_epi8_mask VPCMPUB (2) (V5+BW... _mm_cmple_epu8_mask |
compare for > |
PCMPGTQ (S4.2 _mm_cmpgt_epi64 VPCMPQ (6) (V5... _mm_cmpgt_epi64_mask VPCMPUQ (6) (V5... _mm_cmpgt_epu64_mask |
PCMPGTD (S2 _mm_cmpgt_epi32 VPCMPD (6) (V5... _mm_cmpgt_epi32_mask VPCMPUD (6) (V5... _mm_cmpgt_epu32_mask |
PCMPGTW (S2 _mm_cmpgt_epi16 VPCMPW (6) (V5+BW... _mm_cmpgt_epi16_mask VPCMPUW (6) (V5+BW... _mm_cmpgt_epu16_mask |
PCMPGTB (S2 _mm_cmpgt_epi8 VPCMPB (6) (V5+BW... _mm_cmpgt_epi8_mask VPCMPUB (6) (V5+BW... _mm_cmpgt_epu8_mask |
compare for >= |
VPCMPQ (5) (V5... _mm_cmpge_epi64_mask VPCMPUQ (5) (V5... _mm_cmpge_epu64_mask |
VPCMPD (5) (V5... _mm_cmpge_epi32_mask VPCMPUD (5) (V5... _mm_cmpge_epu32_mask |
VPCMPW (5) (V5+BW... _mm_cmpge_epi16_mask VPCMPUW (5) (V5+BW... _mm_cmpge_epu16_mask |
VPCMPB (5) (V5+BW... _mm_cmpge_epi8_mask VPCMPUB (5) (V5+BW... _mm_cmpge_epu8_mask |
compare for != |
VPCMPQ (4) (V5... _mm_cmpneq_epi64_mask VPCMPUQ (4) (V5... _mm_cmpneq_epu64_mask |
VPCMPD (4) (V5... _mm_cmpneq_epi32_mask VPCMPUD (4) (V5... _mm_cmpneq_epu32_mask |
VPCMPW (4) (V5+BW... _mm_cmpneq_epi16_mask VPCMPUW (4) (V5+BW... _mm_cmpneq_epu16_mask |
VPCMPB (4) (V5+BW... _mm_cmpneq_epi8_mask VPCMPUB (4) (V5+BW... _mm_cmpneq_epu8_mask |
Floating-Point | |||||||||
---|---|---|---|---|---|---|---|---|---|
Double | Single | Half | |||||||
when either (or both) is Nan | condition unmet | condition met | condition unmet | condition met | |||||
Exception on QNaN | YES | NO | YES | NO | YES | NO | YES | NO | |
compare for == |
VCMPEQ_OSPD* (V1 _mm_cmp_pd |
CMPEQPD* (S2 _mm_cmpeq_pd |
VCMPEQ_USPD* (V1 _mm_cmp_pd |
VCMPEQ_UQPD* (V1 _mm_cmp_pd |
VCMPEQ_OSPS* (V1 _mm_cmp_ps |
CMPEQPS* (S1 _mm_cmpeq_ps |
VCMPEQ_USPS* (V1 _mm_cmp_ps |
VCMPEQ_UQPS* (V1 _mm_cmp_ps |
VCMPPH* (V5+FP16... _mm_cmp_ph |
compare for < |
CMPLTPD* (S2 _mm_cmplt_pd |
VCMPLT_OQPD* (V1 _mm_cmp_pd |
CMPLTPS* (S1 _mm_cmplt_ps |
VCMPLT_OQPS* (V1 _mm_cmp_ps |
|||||
compare for <= |
CMPLEPD* (S2 _mm_cmple_pd |
VCMPLE_OQPD* (V1 _mm_cmp_pd |
CMPLEPS* (S1 _mm_cmple_ps |
VCMPLE_OQPS* (V1 _mm_cmp_ps |
|||||
compare for > |
VCMPGTPD* (V1 _mm_cmpgt_pd (S2 |
VCMPGT_OQPD* (V1 _mm_cmp_pd |
VCMPGTPS* (V1 _mm_cmpgt_ps (S1 |
VCMPGT_OQPS* (V1 _mm_cmp_ps |
|||||
compare for >= |
VCMPGEPD* (V1 _mm_cmpge_pd (S2 |
VCMPGE_OQPD* (V1 _mm_cmp_pd |
VCMPGEPS* (V1 _mm_cmpge_ps (S1 |
VCMPGE_OQPS* (V1 _mm_cmp_ps |
|||||
compare for != |
VCMPNEQ_OSPD* (V1 _mm_cmp_pd |
VCMPNEQ_OQPD* (V1 _mm_cmp_pd |
VCMPNEQ_USPD* (V1 _mm_cmp_pd |
CMPNEQPD* (S2 _mm_cmpneq_pd |
VCMPNEQ_OSPS* (V1 _mm_cmp_ps |
VCMPNEQ_OQPS* (V1 _mm_cmp_ps |
VCMPNEQ_USPS* (V1 _mm_cmp_ps |
CMPNEQPS* (S1 _mm_cmpneq_ps |
|
compare for ! < |
CMPNLTPD* (S2 _mm_cmpnlt_pd |
VCMPNLT_UQPD* (V1 _mm_cmp_pd |
CMPNLTPS* (S1 _mm_cmpnlt_ps |
VCMPNLT_UQPS* (V1 _mm_cmp_ps |
|||||
compare for ! <= |
CMPNLEPD* (S2 _mm_cmpnle_pd |
VCMPNLE_UQPD* (V1 _mm_cmp_pd |
CMPNLEPS* (S1 _mm_cmpnle_ps |
VCMPNLE_UQPS* (V1 _mm_cmp_ps |
|||||
compare for ! > |
VCMPNGTPD* (V1 _mm_cmpngt_pd (S2 |
VCMPNGT_UQPD* (V1 _mm_cmp_pd |
VCMPNGTPS* (V1 _mm_cmpngt_ps (S1 |
VCMPNGT_UQPS* (V1 _mm_cmp_ps |
|||||
compare for ! >= |
VCMPNGEPD* (V1 _mm_cmpnge_pd (S2 |
VCMPNGE_UQPD* (V1 _mm_cmp_pd |
VCMPNGEPS* (V1 _mm_cmpnge_ps (S1 |
VCMPNGE_UQPS* (V1 _mm_cmp_ps |
|||||
compare for ordered |
VCMPORD_SPD* (V1 _mm_cmp_pd |
CMPORDPD* (S2 _mm_cmpord_pd |
VCMPORD_SPS* (V1 _mm_cmp_ps |
CMPORDPS* (S1 _mm_cmpord_ps |
|||||
compare for unordered |
VCMPUNORD_SPD* (V1 _mm_cmp_pd |
CMPUNORDPD* (S2 _mm_cmpunord_pd |
VCMPUNORD_SPS* (V1 _mm_cmp_ps |
CMPUNORDPS* (S1 _mm_cmpunord_ps |
|||||
TRUE |
VCMPTRUE_USPD* (V1 _mm_cmp_pd |
VCMPTRUEPD* (V1 _mm_cmp_pd |
VCMPTRUE_USPS* (V1 _mm_cmp_ps |
VCMPTRUEPS* (V1 _mm_cmp_ps |
|||||
FALSE |
VCMPFALSE_OSPD* (V1 _mm_cmp_pd |
VCMPFALSEPD* (V1 _mm_cmp_pd |
VCMPFALSE_OSPS* (V1 _mm_cmp_ps |
VCMPFALSEPS* (V1 _mm_cmp_ps |
Floating-Point | |||
---|---|---|---|
Double | Single | Half | |
compare scalar values to set flag register |
COMISD (S2 _mm_comieq_sd _mm_comilt_sd _mm_comile_sd _mm_comigt_sd _mm_comige_sd _mm_comineq_sd UCOMISD (S2 _mm_ucomieq_sd _mm_ucomilt_sd _mm_ucomile_sd _mm_ucomigt_sd _mm_ucomige_sd _mm_ucomineq_sd |
COMISS (S1 _mm_comieq_ss _mm_comilt_ss _mm_comile_ss _mm_comigt_ss _mm_comige_ss _mm_comineq_ss UCOMISS (S1 _mm_ucomieq_ss _mm_ucomilt_ss _mm_ucomile_ss _mm_ucomigt_ss _mm_ucomige_ss _mm_ucomineq_ss |
VCOMISH (V5+FP16 _mm_comieq_sh _mm_comilt_sh _mm_comile_sh _mm_comigt_sh _mm_comige_sh _mm_comineq_sh VUCOMISH (V5+FP16 _mm_ucomieq_sh _mm_ucomilt_sh _mm_ucomile_sh _mm_ucomigt_sh _mm_ucomige_sh _mm_ucomineq_sh |
Integer | Floating-Point | ||||||
---|---|---|---|---|---|---|---|
QWORD | DWORD | WORD | BYTE | Double | Single | Half | |
and |
PAND (S2 _mm_and_si128 |
ANDPD (S2 _mm_and_pd |
ANDPS (S1 _mm_and_ps |
||||
VPANDQ (V5... _mm512_and_epi64 etc |
VPANDD (V5... _mm512_and_epi32 etc |
||||||
and not |
PANDN (S2 _mm_andnot_si128 |
ANDNPD (S2 _mm_andnot_pd |
ANDNPS (S1 _mm_andnot_ps |
||||
VPANDNQ (V5... _mm512_andnot_epi64 etc |
VPANDND (V5... _mm512_andnot_epi32 etc |
||||||
or |
POR (S2 _mm_or_si128 |
ORPD (S2 _mm_or_pd |
ORPS (S1 _mm_or_ps |
||||
VPORQ (V5... _mm512_or_epi64 etc |
VPORD (V5... _mm512_or_epi32 etc |
||||||
xor |
PXOR (S2 _mm_xor_si128 |
XORPD (S2 _mm_xor_pd |
XORPS (S1 _mm_xor_ps |
||||
VPXORQ (V5... _mm512_xor_epi64 etc |
VPXORD (V5... _mm512_xor_epi32 etc |
||||||
test |
PTEST (S4.1 _mm_testz_si128 _mm_testc_si128 _mm_testnzc_si128 |
VTESTPD (V1 _mm_testz_pd _mm_testc_pd _mm_testnzc_pd |
VTESTPS (V1 _mm_testz_ps _mm_testc_ps _mm_testnzc_ps |
||||
VPTESTMQ (V5... _mm_test_epi64_mask VPTESTNMQ (V5... _mm_testn_epi64_mask |
VPTESTMD (V5... _mm_test_epi32_mask VPTESTNMD (V5... _mm_testn_epi32_mask |
VPTESTMW (V5+BW... _mm_test_epi16_mask VPTESTNMW (V5+BW... _mm_testn_epi16_mask |
VPTESTMB (V5+BW... _mm_test_epi8_mask VPTESTNMB (V5+BW... _mm_testn_epi8_mask |
||||
ternary operation |
VPTERNLOGQ (V5... _mm_ternarylogic_epi64 |
VPTERNLOGD (V5... _mm_ternarylogic_epi32 |
Integer | ||||
---|---|---|---|---|
QWORD | DWORD | WORD | BYTE | |
shift left logical |
PSLLQ (S2 _mm_slli_epi64 _mm_sll_epi64 |
PSLLD (S2 _mm_slli_epi32 _mm_sll_epi32 |
PSLLW (S2 _mm_slli_epi16 _mm_sll_epi16 |
|
VPSLLVQ (V2 _mm_sllv_epi64 |
VPSLLVD (V2 _mm_sllv_epi32 |
VPSLLVW (V5+BW... _mm_sllv_epi16 |
||
shift right logical |
PSRLQ (S2 _mm_srli_epi64 _mm_srl_epi64 |
PSRLD (S2 _mm_srli_epi32 _mm_srl_epi32 |
PSRLW (S2 _mm_srli_epi16 _mm_srl_epi16 |
|
VPSRLVQ (V2 _mm_srlv_epi64 |
VPSRLVD (V2 _mm_srlv_epi32 |
VPSRLVW (V5+BW... _mm_srlv_epi16 |
||
shift right arithmetic |
VPSRAQ (V5... _mm_srai_epi64 _mm_sra_epi64 |
PSRAD (S2 _mm_srai_epi32 _mm_sra_epi32 |
PSRAW (S2 _mm_srai_epi16 _mm_sra_epi16 |
|
VPSRAVQ (V5... _mm_srav_epi64 |
VPSRAVD (V2 _mm_srav_epi32 |
VPSRAVW (V5+BW... _mm_srav_epi16 |
||
rotate left |
VPROLQ (V5... _mm_rol_epi64 |
VPROLD (V5... _mm_rol_epi32 |
||
VPROLVQ (V5... _mm_rolv_epi64 |
VPROLVD (V5... _mm_rolv_epi32 |
|||
rotate right |
VPRORQ (V5... _mm_ror_epi64 |
VPRORD (V5... _mm_ror_epi32 |
||
VPRORVQ (V5... _mm_rorv_epi64 |
VPRORVD (V5... _mm_rorv_epi32 |
|||
shift left logical double |
VPSHLDQ (V5+VBMI2... _mm_shldi_epi64 |
VPSHLDD (V5+VBMI2... _mm_shldi_epi32 |
VPSHLDW (V5+VBMI2... _mm_shldi_epi16 |
|
VPSHLDVQ (V5+VBMI2... _mm_shldv_epi64 |
VPSHLDVD (V5+VBMI2... _mm_shldv_epi32 |
VPSHLDVW (V5+VBMI2... _mm_shldv_epi16 |
||
shift right logical double |
VPSHRDQ (V5+VBMI2... _mm_shrdi_epi64 |
VPSHRDD (V5+VBMI2... _mm_shrdi_epi32 |
VPSHRDW (V5+VBMI2... _mm_shrdi_epi16 |
|
VPSHRDVQ (V5+VBMI2... _mm_shrdv_epi64 |
VPSHRDVD (V5+VBMI2... _mm_shrdv_epi32 |
VPSHRDVW (V5+VBMI2... _mm_shrdv_epi16 |
128-bit | |
---|---|
shift left logical |
PSLLDQ (S2 _mm_slli_si128 |
shift right logical |
PSRLDQ (S2 _mm_srli_si128 |
packed align right |
PALIGNR (SS3 _mm_alignr_epi8 |
explicit length | implicit length | |
---|---|---|
return index |
PCMPESTRI (S4.2 _mm_cmpestri _mm_cmpestra _mm_cmpestrc _mm_cmpestro _mm_cmpestrs _mm_cmpestrz |
PCMPISTRI (S4.2 _mm_cmpistri _mm_cmpistra _mm_cmpistrc _mm_cmpistro _mm_cmpistrs _mm_cmpistrz |
return mask |
PCMPESTRM (S4.2 _mm_cmpestrm _mm_cmpestra _mm_cmpestrc _mm_cmpestro _mm_cmpestrs _mm_cmpestrz |
PCMPISTRM (S4.2 _mm_cmpistrm _mm_cmpistra _mm_cmpistrc _mm_cmpistro _mm_cmpistrs _mm_cmpistrz |
LDMXCSR (S1 _mm_setcsr |
Load MXCSR register |
STMXCSR (S1 _mm_getcsr |
Save MXCSR register state |
PSADBW (S2 _mm_sad_epu8 |
Compute sum of absolute differences |
MPSADBW (S4.1 _mm_mpsadbw_epu8 |
Performs eight 4-byte wide Sum of Absolute Differences operations to produce eight word integers. |
VDBPSADBW (V5+BW... _mm_dbsad_epu8 |
Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes |
PMULHRSW (SS3 _mm_mulhrs_epi16 |
Packed Multiply High with Round and Scale |
PHMINPOSUW (S4.1 _mm_minpos_epu16 |
Finds the value and location of the minimum unsigned word from one of 8 horizontally packed unsigned words. The resulting value and location (offset within the source) are packed into the low dword of the destination XMM register. |
VPCONFLICTQ (V5+CD... _mm512_conflict_epi64 VPCONFLICTD (V5+CD... _mm512_conflict_epi32 |
Detect Conflicts Within a Vector of Packed Dword/Qword Values into Dense Memory/ Register |
VP2INTERSECTQ (V5+VP2INTERSECT... _mm512_2intersect_epi64 VP2INTERSECTD (V5+VP2INTERSECT... _mm512_2intersect_epi32 |
Compute Intersection Between DWORDS/QUADWORDS to a Pair of Mask Registers |
VPLZCNTQ (V5+CD... _mm_lzcnt_epi64 VPLZCNTD (V5+CD... _mm_lzcnt_epi32 |
Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values |
VFIXUPIMMPD* (V5... _mm512_fixupimm_pd VFIXUPIMMPS* (V5... _mm512_fixupimm_ps |
Fix Up Special Packed Float64/32 Values |
VFPCLASSPD* (V5... _mm512_fpclass_pd_mask VFPCLASSPS* (V5... _mm512_fpclass_ps_mask VFPCLASSPH* (V5+FP16... _mm512_fpclass_ph_mask |
Tests Types Of a Packed Float64/32 Values |
VRANGEPD* (V5+DQ... _mm_range_pd VRANGEPS* (V5+DQ... _mm_range_pd |
Range Restriction Calculation For Packed Pairs of Float64/32 Values |
VGETEXPPD* (V5... _mm512_getexp_pd VGETEXPPS* (V5... _mm512_getexp_ps VGETEXPPH* (V5+FP16... _mm512_getexp_ph |
Convert Exponents of Packed DP/SP FP Values to FP Values |
VGETMANTPD* (V5... _mm512_getmant_pd VGETMANTPS* (V5... _mm512_getmant_ps VGETMANTPH* (V5+FP16... _mm512_getmant_ph |
Extract Float64/32 Vector of Normalized Mantissas from Float64/32 Vector |
AESDEC (AESNI _mm_aesdec_si128 |
Perform an AES decryption round using an 128-bit state and a round key |
AESDECLAST (AESNI _mm_aesdeclast_si128 |
Perform the last AES decryption round using an 128-bit state and a round key |
AESENC (AESNI _mm_aesenc_si128 |
Perform an AES encryption round using an 128-bit state and a round key |
AESENCLAST (AESNI _mm_aesenclast_si128 |
Perform the last AES encryption round using an 128-bit state and a round key |
AESIMC (AESNI _mm_aesimc_si128 |
Perform an inverse mix column transformation primitive |
AESKEYGENASSIST (AESNI _mm_aeskeygenassist_si128 |
Assist the creation of round keys with a key expansion schedule |
PCLMULQDQ (PCLMULQDQ _mm_clmulepi64_si128 |
Perform carryless multiplication of two 64-bit numbers |
SHA1RNDS4 (SHA _mm_sha1rnds4_epu32 |
Perform Four Rounds of SHA1 Operation |
SHA1NEXTE (SHA _mm_sha1nexte_epu32 |
Calculate SHA1 State Variable E after Four Rounds |
SHA1MSG1 (SHA _mm_sha1msg1_epu32 |
Perform an Intermediate Calculation for the Next Four SHA1 Message Dwords |
SHA1MSG2 (SHA _mm_sha1msg2_epu32 |
Perform a Final Calculation for the Next Four SHA1 Message Dwords |
SHA256RNDS2 (SHA _mm_sha256rnds2_epu32 |
Perform Two Rounds of SHA256 Operation |
SHA256MSG1 (SHA _mm_sha256msg1_epu32 |
Perform an Intermediate Calculation for the Next Four SHA256 Message |
SHA256MSG2 (SHA _mm_sha256msg2_epu32 |
Perform a Final Calculation for the Next Four SHA256 Message Dwords |
GF2P8AFFINEQB (GFNI... _mm_gf2p8affine_epi64_epi8 |
Galois Field Affine Transformation |
GF2P8AFFINEINVQB (GFNI... _mm_gf2p8affineinv_epi64_epi8 |
Galois Field Affine Transformation Inverse |
GF2P8MULB (GFNI... _mm_gf2p8mul_epi8 |
Galois Field Multiply Bytes |
VFMULCPH* (V5+FP16... _mm_fmul_pch _mm_mul_pch VFCMULCPH* (V5+FP16... _mm_fcmul_pch _mm_cmul_pch |
Complex Multiply FP16 Values |
VFMADDCPH* (V5+FP16... _mm_fmadd_pch VFCMADDCPH* (V5+FP16... _mm_fcmadd_pch |
Complex Multiply and Accumulate FP16 Values |
VCVTNE2PS2BF16 (V5+BF16... _mm_cvtne2ps_pbh |
Convert Two Packed Single Data to One Packed BF16 Data |
VCVTNEPS2BF16 (V5+BF16... _mm_cvtneps_pbh |
Convert Packed Single Data to Packed BF16 Data |
VDPBF16PS (V5+BF16... _mm_dpbf16_ps |
Dot Product of BF16 Pairs Accumulated Into Packed Single Precision |
VPMADD52HUQ (V5+IFMA... _mm_madd52hi_epu64 |
Packed Multiply of Unsigned 52-Bit Integers and Add the High 52-Bit Products to Qword Accumulators |
VPMADD52LUQ (V5+IFMA... _mm_madd52lo_epu64 |
Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products to Qword Accumulators |
VPMULTISHIFTQB (V5+VBMI... _mm_multishift_epi64_epi8 |
Select Packed Unaligned Bytes From Quadword Sources |
VPSHUFBITQMB (V5+BITALG... _mm_bitshuffle_epi64_mask |
Shuffle Bits From Quadword Elements Using Byte Indexes Into Mask |
VPBROADCASTMB2Q (V5+CD... _mm_broadcastmb_epi64 VPBROADCASTMW2D (V5+CD... _mm_broadcastmw_epi32 |
Broadcast Mask to Vector Register |
VZEROALL (V1 _mm256_zeroall |
Zero all YMM registers |
VZEROUPPER (V1 _mm256_zeroupper |
Zero upper 128 bits of all YMM registers |
MOVNTPS (S1 _mm_stream_ps |
Non-temporal store of four packed single-precision floating-point values from an XMM register into memory |
MASKMOVDQU (S2 _mm_maskmoveu_si128 |
Non-temporal store of selected bytes from an XMM register into memory |
MOVNTPD (S2 _mm_stream_pd |
Non-temporal store of two packed double-precision floating-point values from an XMM register into memory |
MOVNTDQ (S2 _mm_stream_si128 |
Non-temporal store of double quadword from an XMM register into memory |
LDDQU (S3 _mm_lddqu_si128 |
Special 128-bit unaligned load designed to avoid cache line splits |
MOVNTDQA (S4.1 _mm_stream_load_si128 |
Provides a non-temporal hint that can cause adjacent 16-byte items within an aligned 64-byte region (a streaming line) to be fetched and held in a small set of temporary buffers ("streaming load buffers"). Subsequent streaming loads to other aligned 16-byte items in the same streaming line may be supplied from the streaming load buffer and can improve throughput. |
LDTILECFG (AMX-TILE# _tile_loadconfig |
Load Tile Configuration |
STTILECFG (AMX-TILE# _tile_storeconfig |
Store Tile Configuration |
TILELOADD (AMX-TILE# _tile_loadd TILELOADDT1 (AMX-TILE# _tile_stream_loadd |
Load Tile Data |
TILESTORED (AMX-TILE# _tile_stored |
Store Tile Data |
TDPBF16PS (AMX-BF16# _tile_dpbf16ps |
Dot Product of BF16 Tiles Accumulated into Packed Single Precision Tile |
TDPBSSD (AMX-INT8# _tile_dpbssd TDPBSUD (AMX-INT8# _tile_dpbsud TDPBUSD (AMX-INT8# _tile_dpbusd TDPBUUD (AMX-INT8# _tile_dpbuud |
Dot Product of Signed/Unsigned Bytes with Dword Accumulation |
TILEZERO (AMX-TILE# _tile_zero |
Zero Tile |
TILERELEASE (AMX-TILE# _tile_release |
Release Tile |
XOR works for both Integer and Floating-point.
Example: Zero all of 2 QWORDS / 4 DWORDS / 8 WORDS / 16 BYTES in XMM1
pxor xmm1, xmm1
Example: Set 0.0f to 4 floats in XMM1
xorps xmm1, xmm1
Example: Set 0.0 to 2 doubles in XMM1
xorpd xmm1, xmm1
Use a shuffle instruction.
Example: Copy the lowest float element to other 3 elements in XMM1.
shufps xmm1, xmm1, 0
Example: Copy the lowest WORD element to other 7 elements in XMM1
Example: Copy the lower QWORD element to the upper element in XMM1
pshufd xmm1, xmm1, 44h ; 01 00 01 00 B = 44h
Is this better?
punpcklqdq xmm1, xmm1
Use an unpack instruction.
Example: Zero extend 8 WORDS in XMM1 to DWORDS in XMM1 (lower 4) and XMM2 (upper 4).
movdqa xmm2, xmm1 ; src data WORD[7] [6] [5] [4] [3] [2] [1] [0] pxor xmm3, xmm3 ; upper 16-bit to attach to each WORD = all 0 punpcklwd xmm1, xmm3 ; lower 4 DWORDS: 0 [3] 0 [2] 0 [1] 0 [0] punpckhwd xmm2, xmm3 ; upper 4 DWORDS: 0 [7] 0 [6] 0 [5] 0 [4]
Example: Sign extend 16 BYTES in XMM1 to WORDS in XMM1 (lower 8) and XMM2 (upper 8).
pxor xmm3, xmm3 movdqa xmm2, xmm1 pcmpgtb xmm3, xmm1 ; upper 8-bit to attach to each BYTE = src >= 0 ? 0 : -1 punpcklbw xmm1, xmm3 ; lower 8 WORDS punpckhbw xmm2, xmm3 ; upper 8 WORDS
Example (intrinsics): Sign extend 8 WORDS in __m128i variable words8 to DWORDS in dwords4lo (lower 4) and dwords4hi (upper 4)
const __m128i izero = _mm_setzero_si128(); __m128i words8hi = _mm_cmpgt_epi16(izero, words8); __m128i dwords4lo = _mm_unpacklo_epi16(words8, words8hi); __m128i dwords4hi = _mm_unpackhi_epi16(words8, words8hi);
If an integer value is positive or zero, no action is required. Otherwise complement and add 1.
Example: Set absolute values of 8 signed WORDS in XMM1 to XMM1
; if src is positive or 0; if src is negative pxor xmm2, xmm2 pcmpgtw xmm2, xmm1 ; xmm2 <- 0 ; xmm2 <- -1 pxor xmm1, xmm2 ; xor with 0(do nothing) ; xor with -1(complement all bits) psubw xmm1, xmm2 ; subtract 0(do nothing) ; subtract -1(add 1)
Example (intrinsics): Set abosolute values of 4 DWORDS in __m128i variable dwords4 to dwords4
const __m128i izero = _mm_setzero_si128(); __m128i tmp = _mm_cmpgt_epi32(izero, dwords4); dwords4 = _mm_xor_si128(dwords4, tmp); dwords4 = _mm_sub_epi32(dwords4, tmp);
Floating-Points are not complemented so just clearing sign (the highest) bit makes the absolute value.
Example: Set absolute values of 4 floats in XMM1 to XMM1
; data align 16 signoffmask dd 4 dup (7fffffffH) ; mask for clearing the highest bit ; code andps xmm1, xmmword ptr signoffmask
Example (intrinsics): Set absolute values of 4 floats in __m128 variable floats4 to floats4
const __m128 signmask = _mm_set1_ps(-0.0f); // 0x80000000 floats4 = _mm_andnot_ps(signmask, floats4);
Signed/unsigned makes difference only for the calculation of the upper part. For the lower part, the same instruction can be used both for signed and unsigned.
unsigned WORD * unsigned WORD -> Upper WORD: PMULHUW, Lower WORD: PMULLW
singed WORD * signed WORD -> Upper WORD: PMULHW, Lower WORD: PMULLW
Use a bitwise operation after a comparison into a mask register.
Example: Compare each signed DWORD in XMM1 and XMM2 and set smaller one to XMM1
; A=xmm1 B=xmm2 ; if A>B ; if A<=B movdqa xmm0, xmm1 pcmpgtd xmm1, xmm2 ; xmm1=-1 ; xmm1=0 pand xmm2, xmm1 ; xmm2=B ; xmm2=0 pandn xmm1, xmm0 ; xmm1=0 ; xmm1=A por xmm1, xmm2 ; xmm1=B ; xmm1=A
Example (intrinsics): Compare each signed byte in __m128i variables a, b and set larger one to maxAB
__m128i mask = _mm_cmpgt_epi8(a, b); __m128i selectedA = _mm_and_si128(mask, a); __m128i selectedB = _mm_andnot_si128(mask, b); __m128i maxAB = _mm_or_si128(selectedA, selectedB);
Use PCMPEQx.
Example: set -1 to all of the 2 QWORDS / 4 DWORDS / 8 WORDS / 16 BYTES in XMM1.
pcmpeqb xmm1, xmm1
can be done fast by an integer multiplication with some adjustment operations.
Example: divide each unsigned WORD in __m128i by 7.
// unsigned WORD integer division by constant 7 (SSE2) // __m128i dividends 8 unsigned WORDs __m128i t = _mm_mulhi_epu16(_mm_set1_epi16(9363), dividends); t = _mm_add_epi16(t, _mm_srli_epi16(_mm_sub_epi16(dividends, t), 1)); __m128i quotients = _mm_srli_epi16(t, 2);
For each divisor constant, a magic number to multiply is required. Detailed information about this alglorithm including how to calculate the magic number is in "Hacker's Delight" book written by Henry S. Warren Jr.
After the book I made a code generator for SSE/AVX intrinsics.