English Japanese

x86/x64 SIMD Instruction List (SSE to AVX512)

MMX register (64-bit) instructions are omitted.

S1=SSE  S2=SSE2 S3=SSE3 SS3=SSSE3 S4.1=SSE4.1 S4.2=SSE4.2 V1=AVX V2=AVX2 V5=AVX512 #=64-bit mode only

Instructions marked with * become scalar instructions (only the lowest element is calculated) when PS/PD/DQ is changed to SS/SD/SI.

C/C++ intrinsic name is written below each instruction in blue.

AVX/AVX2

AVX512

This document is intended that you can find the correct instruction name that you are not sure of, and make it possible to search in the manuals. Refer to the manuals before coding.

Intel's manuals -> https://software.intel.com/en-us/articles/intel-sdm

When you find any error or something please post this feedback form or email me to the address at the bottom of this page.

 

Highlighter  to   Color    To make these default, bookmark this page after clicking here.

MOVE     ?MM = XMM / YMM / ZMM

  Integer Floating-Point YMM lane (128-bit)
QWORD DWORD WORD BYTE Double Single Half
?MM whole
from / to
?MM/mem
MOVDQA (S2
_mm_load_si128
_mm_store_si128
MOVDQU (S2
_mm_loadu_si128
_mm_storeu_si128
MOVAPD (S2
_mm_load_pd
_mm_loadr_pd
_mm_store_pd
_mm_storer_pd

MOVUPD (S2
_mm_loadu_pd
_mm_storeu_pd
MOVAPS (S1
_mm_load_ps
_mm_loadr_ps
_mm_store_ps
_mm_storer_ps

MOVUPS (S1
_mm_loadu_ps
_mm_storeu_ps
 
VMOVDQA64 (V5...
_mm_mask_load_epi64
_mm_mask_store_epi64
etc
VMOVDQU64 (V5...
_mm_mask_loadu_epi64
_mm_mask_store_epi64
etc
VMOVDQA32 (V5...
_mm_mask_load_epi32
_mm_mask_store_epi32
etc
VMOVDQU32 (V5...
_mm_mask_loadu_epi32
_mm_mask_storeu_epi32
etc
VMOVDQU16 (V5+BW...
_mm_mask_loadu_epi16
_mm_mask_storeu_epi16
etc
VMOVDQU8 (V5+BW...
_mm_mask_loadu_epi8
_mm_mask_storeu_epi8
etc
XMM upper half
from / to
mem
MOVHPD (S2
_mm_loadh_pd
_mm_storeh_pd
MOVHPS (S1
_mm_loadh_pi
_mm_storeh_pi
 
XMM upper half
from / to
XMM lower half
MOVHLPS (S1
_mm_movehl_ps
MOVLHPS (S1
_mm_movelh_ps
 
XMM lower half
from / to
mem
MOVQ (S2
_mm_loadl_epi64
_mm_storel_epi64
      MOVLPD (S2
_mm_loadl_pd
_mm_storel_pd
MOVLPS (S1
_mm_loadl_pi
_mm_storel_pi
   
XMM lowest 1 elem
from / to
r/m
MOVQ (S2#
_mm_cvtsi64_si128
_mm_cvtsi128_si64
MOVD (S2
_mm_cvtsi32_si128
_mm_cvtsi128_si32
VMOVW (V5+FP16
_mm_cvtsi16_si128
_mm_cvtsi128_si16
   
XMM lowest 1 elem
from / to
XMM/mem
MOVQ (S2
_mm_move_epi64
      MOVSD (S2
_mm_load_sd
_mm_store_sd
_mm_move_sd
MOVSS (S1
_mm_load_ss
_mm_store_ss
_mm_move_ss
VMOVSH (V5+FP16
_mm_load_sh
_mm_store_sh
_mm_move_sh
 
XMM whole
from
1 elem
TIP 2
_mm_set1_epi64x
VPBROADCASTQ (V2
_mm_broadcastq_epi64
TIP 2
_mm_set1_epi32
VPBROADCASTD (V2
_mm_broadcastd_epi32
TIP 2
_mm_set1_epi16
VPBROADCASTW (V2
_mm_broadcastw_epi16
_mm_set1_epi8
VPBROADCASTB (V2
_mm_broadcastb_epi8
TIP 2
_mm_set1_pd
_mm_load1_pd
MOVDDUP (S3
_mm_movedup_pd
_mm_loaddup_pd

TIP 2
_mm_set1_ps
_mm_load1_ps

VBROADCASTSS
from mem (V1
from XMM (V2
_mm_broadcast_ss
YMM / ZMM whole
from
1 elem
VPBROADCASTQ (V2
_mm256_broadcastq_epi64
VPBROADCASTD (V2
_mm256_broadcastd_epi32
VPBROADCASTW (V2
_mm256_broadcastw_epi16
VPBROADCASTB (V2
_mm256_broadcastb_epi8
VBROADCASTSD
 from mem (V1
 from XMM (V2
_mm256_broadcast_sd
VBROADCASTSS
 from mem (V1
 from XMM (V2
_mm256_broadcast_ss
  VBROADCASTF128 (V1
_mm256_broadcast_ps
_mm256_broadcast_pd

VBROADCASTI128 (V2
_mm256_broadcastsi128_si256
YMM / ZMM whole
from
2/4/8 elems
VBROADCASTI64X2 (V5+DQ...
_mm512_broadcast_i64x2
VBROADCASTI64X4 (V5
_mm512_broadcast_i64x4
VBROADCASTI32X2 (V5+DQ...
_mm512_broadcast_i32x2
VBROADCASTI32X4 (V5...
_mm512_broadcast_i32x4
VBROADCASTI32X8 (V5+DQ
_mm512_broadcast_i32x8
VBROADCASTF64X2 (V5+DQ...
_mm512_broadcast_f64x2
VBROADCASTF64X4 (V5
_mm512_broadcast_f64x4
VBROADCASTF32X2 (V5+DQ...
_mm512_broadcast_f32x2
VBROADCASTF32X4 (V5...
_mm512_broadcast_f32x4
VBROADCASTF32X8 (V5+DQ
_mm512_broadcast_f32x8
?MM
from
multiple elems
_mm_set_epi64x
_mm_setr_epi64x
_mm_set_epi32
_mm_setr_epi32
_mm_set_epi16
_mm_setr_epi16
_mm_set_epi8
_mm_setr_epi8
_mm_set_pd
_mm_setr_pd
_mm_set_ps
_mm_setr_ps
   
?MM whole
from
zero
TIP 1
_mm_setzero_si128
TIP 1
_mm_setzero_pd
TIP 1
_mm_setzero_ps
   
extract PEXTRQ (S4.1#
_mm_extract_epi64
PEXTRD (S4.1
_mm_extract_epi32
PEXTRW to r (S2
PEXTRW to r/m (S4.1
_mm_extract_epi16
PEXTRB (S4.1
_mm_extract_epi8
->MOVHPD (S2
_mm_loadh_pd
_mm_storeh_pd

->MOVLPD (S2
_mm_loadl_pd
_mm_storel_pd
EXTRACTPS (S4.1
_mm_extract_ps
  VEXTRACTF128 (V1
_mm256_extractf128_ps
_mm256_extractf128_pd
_mm256_extractf128_si256

VEXTRACTI128 (V2
_mm256_extracti128_si256
VEXTRACTI64X2 (V5+DQ...
_mm512_extracti64x2_epi64
VEXTRACTI64X4 (V5
_mm512_extracti64x4_epi64
VEXTRACTI32X4 (V5...
_mm512_extracti32x4_epi32
VEXTRACTI32X8 (V5+DQ
_mm512_extracti32x8_epi32
VEXTRACTF64X2 (V5+DQ...
_mm512_extractf64x2_pd
VEXTRACTF64X4 (V5
_mm512_extractf64x4_pd
VEXTRACTF32X4 (V5...
_mm512_extractf32x4_ps
VEXTRACTF32X8 (V5+DQ
_mm512_extractf32x8_ps
insert PINSRQ (S4.1#
_mm_insert_epi64
PINSRD (S4.1
_mm_insert_epi32
PINSRW (S2
_mm_insert_epi16
PINSRB (S4.1
_mm_insert_epi8
->MOVHPD (S2
_mm_loadh_pd
_mm_storeh_pd

->MOVLPD (S2
_mm_loadl_pd
_mm_storel_pd
INSERTPS (S4.1
_mm_insert_ps
  VINSERTF128 (V1
_mm256_insertf128_ps
_mm256_insertf128_pd
_mm256_insertf128_si256

VINSERTI128 (V2
_mm256_inserti128_si256
VINSERTI64X2 (V5+DQ...
_mm512_inserrti64x2
VINSERTI64X4 (V5...
_mm512_inserti64x4
VINSERTI32X4 (V5...
_mm512_inserti32x4
VINSERTI32X8 (V5+DQ
_mm512_inserti32x8
VINSERTF64X2 (V5+DQ...
_mm512_insertf64x2
VINSERTF64X4 (V5
_mm512_insertf64x4
VINSERTF32X4 (V5...
_mm512_insertf32x4
VINSERTF32X8 (V5+DQ
_mm512_insertf32x8
unpack
PUNPCKHQDQ (S2
_mm_unpackhi_epi64
PUNPCKLQDQ (S2
_mm_unpacklo_epi64
PUNPCKHDQ (S2
_mm_unpackhi_epi32
PUNPCKLDQ (S2
_mm_unpacklo_epi32
PUNPCKHWD (S2
_mm_unpackhi_epi16
PUNPCKLWD (S2
_mm_unpacklo_epi16
PUNPCKHBW (S2
_mm_unpackhi_epi8
PUNPCKLBW (S2
_mm_unpacklo_epi8
UNPCKHPD (S2
_mm_unpackhi_pd
UNPCKLPD (S2
_mm_unpacklo_pd
UNPCKHPS (S1
_mm_unpackhi_ps
UNPCKLPS (S1
_mm_unpacklo_ps
   
shuffle/permute
VPERMQ (V2
_mm256_permute4x64_epi64
VPERMI2Q (V5...
_mm_permutex2var_epi64
VPERMT2Q (V5...
_mm_permutex2var_epi64
PSHUFD (S2
_mm_shuffle_epi32
VPERMD (V2
_mm256_permutevar8x32_epi32
_mm256_permutexvar_epi32
VPERMI2D (V5...
_mm_permutex2var_epi32
VPERMT2D (V5...
_mm_permutex2var_epi32
PSHUFHW (S2
_mm_shufflehi_epi16
PSHUFLW (S2
_mm_shufflelo_epi16
VPERMW (V5+BW...
_mm_permutexvar_epi16
VPERMI2W (V5+BW...
_mm_permutex2var_epi16
VPERMT2W (V5+BW...
_mm_permutex2var_epi16
PSHUFB (SS3
_mm_shuffle_epi8
VPERMB (V5+VBMI...
_mm_permutexvar_epi8
VPERMI2B (V5+VBMI...
_mm_permutex2var_epi8
VPERMT2B (V5+VBMI...
_mm_permutex2var_epi8
SHUFPD (S2
_mm_shuffle_pd
VPERMILPD (V1
_mm_permute_pd
_mm_permutevar_pd

VPERMPD (V2
_mm256_permute4x64_pd
VPERMI2PD (V5...
_mm_permutex2var_pd
VPERMT2PD (V5...
_mm_permutex2var_pd
SHUFPS (S1
_mm_shuffle_ps
VPERMILPS (V1
_mm_permute_ps
_mm_permutevar_ps

VPERMPS (V2
_mm256_permutevar8x32_ps
VPERMI2PS (V5...
_mm_permutex2var_ps
VPERMT2PS (V5...
_mm_permutex2var_ps
  VPERM2F128 (V1
_mm256_permute2f128_ps
_mm256_permute2f128_pd
_mm256_permute2f128_si256

VPERM2I128 (V2
_mm256_permute2x128_si256
VSHUFI64X2 (V5...
_mm512_shuffle_i64x2
VSHUFI32X4 (V5...
_mm512_shuffle_i32x4
VSHUFF64X2 (V5...
_mm512_shuffle_f64x2
VSHUFF32X4 (V5...
_mm512_shuffle_f32x4
blend
VPBLENDMQ (V5...
_mm_mask_blend_epi32
VPBLENDD (V2
_mm_blend_epi32
VPBLENDMD (V5...
_mm_mask_blend_epi32
PBLENDW (S4.1
_mm_blend_epi16
VPBLENDMW (V5+BW...
_mm_mask_blend_epi16
PBLENDVB (S4.1
_mm_blendv_epi8
VPBLENDMB (V5+BW...
_mm_mask_blend_epi8
BLENDPD (S4.1
_mm_blend_pd
BLENDVPD (S4.1
_mm_blendv_pd
VBLENDMPD (V5...
_mm_mask_blend_pd
BLENDPS (S4.1
_mm_blend_ps
BLENDVPS (S4.1
_mm_blendv_ps
VBLENDMPS (V5...
_mm_mask_blend_ps
   
move and duplicate MOVDDUP (S3
_mm_movedup_pd
_mm_loaddup_pd
MOVSHDUP (S3
_mm_movehdup_ps
MOVSLDUP (S3
_mm_moveldup_ps
 
mask move VPMASKMOVQ (V2
_mm_maskload_epi64
_mm_maskstore_epi64
VPMASKMOVD (V2
_mm_maskload_epi32
_mm_maskstore_epi32
    VMASKMOVPD (V1
_mm_maskload_pd
_mm_maskstore_pd
VMASKMOVPS (V1
_mm_maskload_ps
_mm_maskstore_ps
   
extract highest bit       PMOVMSKB (S2
_mm_movemask_epi8
MOVMSKPD (S2
_mm_movemask_pd
MOVMSKPS (S1
_mm_movemask_ps
   
VPMOVQ2M (V5+DQ...
_mm_movepi64_mask
VPMOVD2M (V5+DQ...
_mm_movepi32_mask
VPMOVW2M (V5+BW...
_mm_movepi16_mask
VPMOVB2M (V5+BW...
_mm_movepi8_mask
gather
VPGATHERDQ (V2
_mm_i32gather_epi64
_mm_mask_i32gather_epi64

VPGATHERQQ (V2
_mm_i64gather_epi64
_mm_mask_i64gather_epi64
VPGATHERDD (V2
_mm_i32gather_epi32
_mm_mask_i32gather_epi32

VPGATHERQD (V2
_mm_i64gather_epi32
_mm_mask_i64gather_epi32
    VGATHERDPD (V2
_mm_i32gather_pd
_mm_mask_i32gather_pd

VGATHERQPD (V2
_mm_i64gather_pd
_mm_mask_i64gather_pd
VGATHERDPS (V2
_mm_i32gather_ps
_mm_mask_i32gather_ps

VGATHERQPS (V2
_mm_i64gather_ps
_mm_mask_i64gather_ps
   
scatter
VPSCATTERDQ (V5...
_mm_i32scatter_epi64
_mm_mask_i32scatter_epi64

VPSCATTERQQ (V5...
_mm_i64scatter_epi64
_mm_mask_i64scatter_epi64
VPSCATTERDD (V5...
_mm_i32scatter_epi32
_mm_mask_i32scatter_epi32

VPSCATTERQD (V5...
_mm_i64scatter_epi32
_mm_mask_i64scatter_epi32
    VSCATTERDPD (V5...
_mm_i32scatter_pd
_mm_mask_i32scatter_pd

VSCATTERQPD (V5...
_mm_i64scatter_pd
_mm_mask_i64scatter_pd
VSCATTERDPS (V5...
_mm_i32scatter_ps
_mm_mask_i32scatter_ps

VSCATTERQPS (V5...
_mm_i64scatter_ps
_mm_mask_i64scatter_ps
   
compress
VPCOMPRESSQ (V5...
_mm_mask_compress_epi64
_mm_mask_compressstoreu_epi64
VPCOMPRESSD (V5...
_mm_mask_compress_epi32
_mm_mask_compressstoreu_epi32
VPCOMPRESSW (V5+VBMI2...
_mm_mask_compress_epi16
_mm_mask_compressstoreu_epi16
VPCOMPRESSB (V5+VBMI2...
_mm_mask_compress_epi8
_mm_mask_compressstoreu_epi8
VCOMPRESSPD (V5...
_mm_mask_compress_pd
_mm_mask_compressstoreu_pd
VCOMPRESSPS (V5...
_mm_mask_compress_ps
_mm_mask_compressstoreu_ps
expand
VPEXPANDQ (V5...
_mm_mask_expand_epi64
_mm_mask_expandloadu_epi64
VPEXPANDD (V5...
_mm_mask_expand_epi32
_mm_mask_expandloadu_epi32
VPEXPANDW (V5+VBMI2...
_mm_mask_expand_epi16
_mm_mask_expandloadu_epi16
VPEXPANDB (V5+VBMI2...
_mm_mask_expand_epi8
_mm_mask_expandloadu_epi8
VEXPANDPD (V5...
_mm_mask_expand_pd
_mm_mask_expandloadu_pd
VEXPANDPS (V5...
_mm_mask_expand_ps
_mm_mask_expandloadu_ps
align right VALIGNQ (V5...
_mm_alignr_epi64
VALIGND (V5...
_mm_alignr_epi32
PALIGNR (SS3
_mm_alignr_epi8
expand Opmask bits VPMOVM2Q (V5+DQ...
_mm_movm_epi64
VPMOVM2D (V5+DQ...
_mm_movm_epi32
VPMOVM2W (V5+BW...
_mm_movm_epi16
VPMOVM2B (V5+BW...
_mm_movm_epi8

 

Conversions

from \ to Integer Floating-Point
QWORD DWORD WORD BYTE Double Single Half
Integer QWORD VPMOVQD (V5...
_mm_cvtepi64_epi32
VPMOVSQD (V5...
_mm_cvtsepi64_epi32
VPMOVUSQD (V5...
_mm_cvtusepi64_epi32
VPMOVQW (V5...
_mm_cvtepi64_epi16
VPMOVSQW (V5...
_mm_cvtsepi64_epi16
VPMOVUSQW (V5...
_mm_cvtusepi64_epi16
VPMOVQB (V5...
_mm_cvtepi64_epi8
VPMOVSQB (V5...
_mm_cvtsepi64_epi8
VPMOVUSQB (V5...
_mm_cvtusepi64_epi8
CVTSI2SD (S2# scalar only
_mm_cvtsi64_sd
VCVTQQ2PD* (V5+DQ...
_mm_cvtepi64_pd
VCVTUQQ2PD* (V5+DQ...
_mm_cvtepu64_pd
CVTSI2SS (S1# scalar only
_mm_cvtsi64_ss
VCVTQQ2PS* (V5+DQ...
_mm_cvtepi64_ps
VCVTUQQ2PS* (V5+DQ...
_mm_cvtepu64_ps
VCVTQQ2PH* (V5+FP16...
_mm_cvtepi64_ph
VCVTUQQ2PH* (V5+FP16...
_mm_cvtepu64_ph
DWORD TIP 3
PMOVSXDQ (S4.1
_mm_ cvtepi32_epi64
PMOVZXDQ (S4.1
_mm_ cvtepu32_epi64
  PACKSSDW (S2
_mm_packs_epi32
PACKUSDW (S4.1
_mm_packus_epi32
VPMOVDW (V5...
_mm_cvtepi32_epi16
VPMOVSDW (V5...
_mm_cvtsepi32_epi16
VPMOVUSDW (V5...
_mm_cvtusepi32_epi16
VPMOVDB (V5...
_mm_cvtepi32_epi8
VPMOVSDB (V5...
_mm_cvtsepi32_epi8
VPMOVUSDB (V5...
_mm_cvtusepi32_epi8
CVTDQ2PD* (S2
_mm_cvtepi32_pd
VCVTUDQ2PD* (V5...
_mm_cvtepu32_pd
CVTDQ2PS* (S2
_mm_cvtepi32_ps
VCVTUDQ2PS* (V5...
_mm_cvtepu32_ps
VCVTDQ2PH* (V5+FP16...
_mm_cvtepi32_ph
VCVTUDQ2PH* (V5+FP16...
_mm_cvtepu32_ph
WORD PMOVSXWQ (S4.1
_mm_ cvtepi16_epi64
PMOVZXWQ (S4.1
_mm_ cvtepu16_epi64
TIP 3
PMOVSXWD (S4.1
_mm_ cvtepi16_epi32
PMOVZXWD (S4.1
_mm_ cvtepu16_epi32
PACKSSWB (S2
_mm_packs_epi16
PACKUSWB (S2
_mm_packus_epi16
VPMOVWB (V5+BW...
_mm_cvtepi16_epi8
VPMOVSWB (V5+BW...
_mm_cvtsepi16_epi8
VPMOVUSWB (V5+BW...
_mm_cvtusepi16_epi8
VCVTW2PH* (V5+FP16...
_mm_cvtepi16_ph
VCVTUW2PH* (V5+FP16...
_mm_cvtepu16_ph
BYTE PMOVSXBQ (S4.1
_mm_ cvtepi8_epi64
PMOVZXBQ (S4.1
_mm_ cvtepu8_epi64
PMOVSXBD (S4.1
_mm_ cvtepi8_epi32
PMOVZXBD (S4.1
_mm_ cvtepu8_epi32
TIP 3
PMOVSXBW (S4.1
_mm_ cvtepi8_epi16
PMOVZXBW (S4.1
_mm_ cvtepu8_epi16
Floating-Point Double CVTSD2SI / CVTTSD2SI (S2# scalar only
_mm_cvtsd_si64 / _mm_cvttsd_si64
VCVTPD2QQ* / VCVTTPD2QQ* (V5+DQ...
_mm_cvtpd_epi64 / _mm_cvttpd_epi64
VCVTPD2UQQ* / VCVTTPD2UQQ* (V5+DQ...
_mm_cvtpd_epu64 / _mm_cvttpd_epu64
right ones are with truncation
CVTPD2DQ* / CVTTPD2DQ* (S2
_mm_cvtpd_epi32 / _mm_cvttpd_epi32
VCVTPD2UDQ* / VCVTTPD2UDQ* (V5...
_mm_cvtpd_epu32 / _mm_cvttpd_epu32
right ones are with truncation
CVTPD2PS* (S2
_mm_cvtpd_ps
VCVTPD2PH* (V5+FP16...
_mm_cvtpd_ph
Single CVTSS2SI / CVTTSS2SI (S1# scalar only
_mm_cvtss_si64 / _mm_cvttss_si64
VCVTPS2QQ* / VCVTTPS2QQ* (V5+DQ...
_mm_cvtps_epi64 / _mm_cvttps_epi64
VCVTPS2UQQ* / VCVTTPS2UQQ* (V5+DQ...
_mm_cvtps_epu64 / _mm_cvttps_epu64
right ones are with truncation
CVTPS2DQ* / CVTTPS2DQ* (S2
_mm_cvtps_epi32 / _mm_cvttps_epi32
VCVTPS2UDQ* / VCVTTPS2UDQ* (V5...
_mm_cvtps_epu32 / _mm_cvttps_epu32
right ones are with truncation
  CVTPS2PD* (S2
_mm_cvtps_pd
VCVTPS2PH (F16C
_mm_cvtps_ph
VCVTPS2PHX* (V5+FP16...
_mm_cvtxps_ph
Half VCVTPH2QQ * / VCVTTPH2QQ* (V5+FP16...
_mm_cvtph_epi64 / _mm_cvttph_epi64
VCVTPH2UQQ* / VCVTTPH2UQQ*(V5+FP16...
_mm_cvtph_epu64 / _mm_cvttph_epu64
right ones are with truncation
VCVTPH2DQ* / VCVTTPH2DQ* (V5+FP16...
_mm_cvtph_epi32* / _mm_cvttph_epi32
VCVTPH2UDQ* / VCVTTPH2UDQ* (V5+FP16...
_mm_cvtph_epu32 / _mm_cvttph_epu32
right ones are with truncation
VCVTPH2W* / VCVTTPH2W*(V5+FP16...
_mm_cvtph_epi16 / _mm_cvttph_epi16
VCVTPH2UW* / VCVTTPH2UW*(V5+FP16...
_mm_cvtph_epu16 / _mm_cvttph_epu16
right ones are with truncation
VCVTPH2PD* (V5+FP16...
_mm_cvtph_pd
VCVTPH2PS (F16C
_mm_cvtph_ps
VCVTPH2PSX* (V5+FP16...
_mm_cvtxph_ps

 

Arithmetic Operations

  Integer Floating-Point
QWORD DWORD WORD BYTE Double Single Half
add PADDQ (S2
_mm_add_epi64
PADDD (S2
_mm_add_epi32
PADDW (S2
_mm_add_epi16
PADDSW (S2
_mm_adds_epi16
PADDUSW (S2
_mm_adds_epu16
PADDB (S2
_mm_add_epi8
PADDSB (S2
_mm_adds_epi8
PADDUSB (S2
_mm_adds_epu8
ADDPD* (S2
_mm_add_pd
ADDPS* (S1
_mm_add_ps
VADDPH* (V5+FP16...
_mm_add_ph
sub PSUBQ (S2
_mm_sub_epi64
PSUBD (S2
_mm_sub_epi32
PSUBW (S2
_mm_sub_epi16
PSUBSW (S2
_mm_subs_epi16
PSUBUSW (S2
_mm_subs_epu16
PSUBB (S2
_mm_sub_epi8
PSUBSB (S2
_mm_subs_epi8
PSUBUSB (S2
_mm_subs_epu8
SUBPD* (S2
_mm_sub_pd
SUBPS* (S1
_mm_sub_ps
VSUBPH* (V5+FP16...
_mm_sub_ph
mul VPMULLQ (V5+DQ...
_mm_mullo_epi64
PMULDQ (S4.1
_mm_mul_epi32
PMULUDQ (S2
_mm_mul_epu32
PMULLD (S4.1
_mm_mullo_epi32
PMULHW (S2
_mm_mulhi_epi16
PMULHUW (S2
_mm_mulhi_epu16
PMULLW (S2
_mm_mullo_epi16
MULPD* (S2
_mm_mul_pd
MULPS* (S1
_mm_mul_ps
VMULPH* (V5+FP16...
_mm_mul_ph
div DIVPD* (S2
_mm_div_pd
DIVPS* (S1
_mm_div_ps
VDIVPH* (V5+FP16...
_mm_div_ph
reciprocal         VRCP14PD* (V5...
_mm_rcp14_pd
RCPPS* (S1
_mm_rcp_ps
VRCP14PS* (V5...
_mm_rcp14_ps
VRCPPH* (V5+FP16...
_mm_rcp_ph
square root         SQRTPD* (S2
_mm_sqrt_pd
SQRTPS* (S1
_mm_sqrt_ps
VSQRTPH* (V5+FP16...
_mm_sqrt_ph
reciprocal of square root         VRSQRT14PD* (V5...
_mm_rsqrt14_pd
RSQRTPS* (S1
_mm_rsqrt_ps
VRSQRT14PS* (V5...
_mm_rsqrt14_ps
VRSQRTPH* (V5+FP16...
_mm_rsqrt_ph
multiply nth power of 2 VSCALEFPD* (V5...
_mm_scalef_pd
VSCALEFPS* (V5...
_mm_scalef_ps
VSCALEFPH* (V5+FP16...
_mm_scalef_ph
max TIP 8
VPMAXSQ (V5...
_mm_max_epi64
VPMAXUQ (V5...
_mm_max_epu64
TIP 8
PMAXSD (S4.1
_mm_max_epi32
PMAXUD (S4.1
_mm_max_epu32
PMAXSW (S2
_mm_max_epi16
PMAXUW (S4.1
_mm_max_epu16
TIP 8
PMAXSB (S4.1
_mm_max_epi8
PMAXUB (S2
_mm_max_epu8
TIP 8
MAXPD* (S2
_mm_max_pd
TIP 8
MAXPS* (S1
_mm_max_ps
VMAXPH* (V5+FP16...
_mm_max_ph
min TIP 8
VPMINSQ (V5...
_mm_min_epi64
VPMINUQ (V5...
_mm_min_epu64
TIP 8
PMINSD (S4.1
_mm_min_epi32
PMINUD (S4.1
_mm_min_epu32
PMINSW (S2
_mm_min_epi16
PMINUW (S4.1
_mm_min_epu16
TIP 8
PMINSB (S4.1
_mm_min_epi8
PMINUB (S2
_mm_min_epu8
TIP 8
MINPD* (S2
_mm_min_pd
TIP 8
MINPS* (S1
_mm_min_ps
VMINPH* (V5+FP16...
_mm_min_ph
average     PAVGW (S2
_mm_avg_epu16
PAVGB (S2
_mm_avg_epu8
     
absolute TIP 4
VPABSQ (V5...
_mm_abs_epi64
TIP 4
PABSD (SS3
_mm_abs_epi32
TIP 4
PABSW (SS3
_mm_abs_epi16
TIP 4
PABSB (SS3
_mm_abs_epi8
TIP 5 TIP 5  
sign operation   PSIGND (SS3
_mm_sign_epi32
PSIGNW (SS3
_mm_sign_epi16
PSIGNB (SS3
_mm_sign_epi8
     
population count VPOPCNTQ (V5+VPOPCNTDQ...
_mm_popcnt_epi64
VPOPCNTD (V5+VPOPCNTDQ...
_mm_popcnt_epi32
VPOPCNTW (V5+BITALG...
_mm_popcnt_epi16
VPOPCNTB (V5+BITALG...
_mm_popcnt_epi8
     
round         ROUNDPD* (S4.1
_mm_round_pd
_mm_floor_pd
_mm_ceil_pd

VRNDSCALEPD* (V5...
_mm_roundscale_pd
ROUNDPS* (S4.1
_mm_round_ps
_mm_floor_ps
_mm_ceil_ps

VRNDSCALEPS* (V5...
_mm_roundscale_ps
VRNDSCALEPH* (V5+FP16...
_mm_roundscale_ph
difference from rounded value         VREDUCEPD* (V5+DQ...
_mm_reduce_pd
VREDUCEPS* (V5+DQ...
_mm_reduce_ps
VREDUCEPH* (V5+FP16...
_mm_reduce_ph
add / sub         ADDSUBPD (S3
_mm_addsub_pd
ADDSUBPS (S3
_mm_addsub_ps
 
horizontal add   PHADDD (SS3
_mm_hadd_epi32
PHADDW (SS3
_mm_hadd_epi16
PHADDSW (SS3
_mm_hadds_epi16
  HADDPD (S3
_mm_hadd_pd
HADDPS (S3
_mm_hadd_ps
 
horizontal sub   PHSUBD (SS3
_mm_hsub_epi32
PHSUBW (SS3
_mm_hsub_epi16
PHSUBSW (SS3
_mm_hsubs_epi16
  HSUBPD (S3
_mm_hsub_pd
HSUBPS (S3
_mm_hsub_ps
 
dot product /
multiply and add
PMADDWD (S2
_mm_madd_epi16
VPDPWSSD (AVX_VNNI
_mm_dpwssd_avx_epi32
VPDPWSSD (V5+VNNI...
_mm_dpwssd_epi32
VPDPWSSDS (AVX_VNNI
_mm_dpwssds_avx_epi32
VPDPWSSDS (V5+VNNI...
_mm_dpwssds_epi32
PMADDUBSW (SS3
_mm_maddubs_epi16
VPDPBUSD (AVX_VNNI
_mm_dpbusd_avx_epi32
VPDPBUSD (V5+VNNI...
_mm_dpbusd_epi32
VPDPBUSDS (AVX_VNNI
_mm_dpbusds_avx_epi32
VPDPBUSDS (V5+VNNI...
_mm_dpbusds_epi32
DPPD (S4.1
_mm_dp_pd
DPPS (S4.1
_mm_dp_ps
 
fused multiply and add / sub         VFMADDxxxPD* (FMA
_mm_fmadd_pd
VFMSUBxxxPD* (FMA
_mm_fmsub_pd
VFMADDSUBxxxPD (FMA
_mm_fmaddsub_pd
VFMSUBADDxxxPD (FMA
_mm_fmsubadd_pd
VFNMADDxxxPD* (FMA
_mm_fnmadd_pd
VFNMSUBxxxPD* (FMA
_mm_fnmsub_pd
xxx=132/213/231
VFMADDxxxPS* (FMA
_mm_fmadd_ps
VFMSUBxxxPS* (FMA
_mm_fmsub_ps
VFMADDSUBxxxPS (FMA
_mm_fmaddsub_ps
VFMSUBADDxxxPS (FMA
_mm_fmsubadd_ps
VFNMADDxxxPS* (FMA
_mm_fnmadd_ps
VFNMSUBxxxPS* (FMA
_mm_fnmsub_ps
xxx=132/213/231
VFMADDxxxPH* (V5+FP16...
_mm_fmadd_ph
VFMSUBxxxPH* (V5+FP16...
_mm_fmsub_ph
VFMADDSUBxxxPH (V5+FP16...
_mm_fmaddsub_ph
VFMSUBADDxxxPH (V5+FP16...
_mm_fmsubadd_ph
VFNMADDxxxPH* (V5+FP16...
_mm_fnmadd_ph
VFNMSUBxxxPH* (V5+FP16...
_mm_fnmsub_ph
xxx=132/213/231

 

Compare

  Integer
QWORD DWORD WORD BYTE
compare for == PCMPEQQ (S4.1
_mm_cmpeq_epi64
_mm_cmpeq_epi64_mask (V5...
VPCMPUQ (0) (V5...
_mm_cmpeq_epu64_mask
PCMPEQD (S2
_mm_cmpeq_epi32
_mm_cmpeq_epi32_mask (V5...
VPCMPUD (0) (V5...
_mm_cmpeq_epu32_mask
PCMPEQW (S2
_mm_cmpeq_epi16
_mm_cmpeq_epi16_mask (V5+BW...
VPCMPUW (0) (V5+BW...
_mm_cmpeq_epu16_mask
PCMPEQB (S2
_mm_cmpeq_epi8
_mm_cmpeq_epi8_mask (V5+BW...
VPCMPUB (0) (V5+BW...
_mm_cmpeq_epu8_mask
compare for < VPCMPQ (1) (V5...
_mm_cmplt_epi64_mask
VPCMPUQ (1) (V5...
_mm_cmplt_epu64_mask
VPCMPD (1) (V5...
_mm_cmplt_epi32_mask
VPCMPUD (1) (V5...
_mm_cmplt_epu32_mask
VPCMPW (1) (V5+BW...
_mm_cmplt_epi16_mask
VPCMPUW (1) (V5+BW...
_mm_cmplt_epu16_mask
VPCMPB (1) (V5+BW...
_mm_cmplt_epi8_mask
VPCMPUB (1) (V5+BW...
_mm_cmplt_epu8_mask
compare for <= VPCMPQ (2) (V5...
_mm_cmple_epi64_mask
VPCMPUQ (2) (V5...
_mm_cmple_epu64_mask
VPCMPD (2) (V5...
_mm_cmple_epi32_mask
VPCMPUD (2) (V5...
_mm_cmple_epu32_mask
VPCMPW (2) (V5+BW...
_mm_cmple_epi16_mask
VPCMPUW (2) (V5+BW...
_mm_cmple_epu16_mask
VPCMPB (2) (V5+BW...
_mm_cmple_epi8_mask
VPCMPUB (2) (V5+BW...
_mm_cmple_epu8_mask
compare for > PCMPGTQ (S4.2
_mm_cmpgt_epi64
VPCMPQ (6) (V5...
_mm_cmpgt_epi64_mask
VPCMPUQ (6) (V5...
_mm_cmpgt_epu64_mask
PCMPGTD (S2
_mm_cmpgt_epi32
VPCMPD (6) (V5...
_mm_cmpgt_epi32_mask
VPCMPUD (6) (V5...
_mm_cmpgt_epu32_mask
PCMPGTW (S2
_mm_cmpgt_epi16
VPCMPW (6) (V5+BW...
_mm_cmpgt_epi16_mask
VPCMPUW (6) (V5+BW...
_mm_cmpgt_epu16_mask
PCMPGTB (S2
_mm_cmpgt_epi8
VPCMPB (6) (V5+BW...
_mm_cmpgt_epi8_mask
VPCMPUB (6) (V5+BW...
_mm_cmpgt_epu8_mask
compare for >= VPCMPQ (5) (V5...
_mm_cmpge_epi64_mask
VPCMPUQ (5) (V5...
_mm_cmpge_epu64_mask
VPCMPD (5) (V5...
_mm_cmpge_epi32_mask
VPCMPUD (5) (V5...
_mm_cmpge_epu32_mask
VPCMPW (5) (V5+BW...
_mm_cmpge_epi16_mask
VPCMPUW (5) (V5+BW...
_mm_cmpge_epu16_mask
VPCMPB (5) (V5+BW...
_mm_cmpge_epi8_mask
VPCMPUB (5) (V5+BW...
_mm_cmpge_epu8_mask
compare for != VPCMPQ (4) (V5...
_mm_cmpneq_epi64_mask
VPCMPUQ (4) (V5...
_mm_cmpneq_epu64_mask
VPCMPD (4) (V5...
_mm_cmpneq_epi32_mask
VPCMPUD (4) (V5...
_mm_cmpneq_epu32_mask
VPCMPW (4) (V5+BW...
_mm_cmpneq_epi16_mask
VPCMPUW (4) (V5+BW...
_mm_cmpneq_epu16_mask
VPCMPB (4) (V5+BW...
_mm_cmpneq_epi8_mask
VPCMPUB (4) (V5+BW...
_mm_cmpneq_epu8_mask

 

Floating-Point
Double Single Half
when either (or both) is Nan condition unmet condition met condition unmet condition met  
Exception on QNaN YES NO YES NO YES NO YES NO  
compare for == VCMPEQ_OSPD* (V1
_mm_cmp_pd
CMPEQPD* (S2
_mm_cmpeq_pd
VCMPEQ_USPD* (V1
_mm_cmp_pd
VCMPEQ_UQPD* (V1
_mm_cmp_pd
VCMPEQ_OSPS* (V1
_mm_cmp_ps
CMPEQPS* (S1
_mm_cmpeq_ps
VCMPEQ_USPS* (V1
_mm_cmp_ps
VCMPEQ_UQPS* (V1
_mm_cmp_ps
VCMPPH* (V5+FP16...
_mm_cmp_ph
compare for < CMPLTPD* (S2
_mm_cmplt_pd
VCMPLT_OQPD* (V1
_mm_cmp_pd
    CMPLTPS* (S1
_mm_cmplt_ps
VCMPLT_OQPS* (V1
_mm_cmp_ps
   
compare for <= CMPLEPD* (S2
_mm_cmple_pd
VCMPLE_OQPD* (V1
_mm_cmp_pd
CMPLEPS* (S1
_mm_cmple_ps
VCMPLE_OQPS* (V1
_mm_cmp_ps
compare for > VCMPGTPD* (V1
_mm_cmpgt_pd (S2
VCMPGT_OQPD* (V1
_mm_cmp_pd
    VCMPGTPS* (V1
_mm_cmpgt_ps (S1
VCMPGT_OQPS* (V1
_mm_cmp_ps
   
compare for >= VCMPGEPD* (V1
_mm_cmpge_pd (S2
VCMPGE_OQPD* (V1
_mm_cmp_pd
    VCMPGEPS* (V1
_mm_cmpge_ps (S1
VCMPGE_OQPS* (V1
_mm_cmp_ps
   
compare for != VCMPNEQ_OSPD* (V1
_mm_cmp_pd
VCMPNEQ_OQPD* (V1
_mm_cmp_pd
VCMPNEQ_USPD* (V1
_mm_cmp_pd
CMPNEQPD* (S2
_mm_cmpneq_pd
VCMPNEQ_OSPS* (V1
_mm_cmp_ps
VCMPNEQ_OQPS* (V1
_mm_cmp_ps
VCMPNEQ_USPS* (V1
_mm_cmp_ps
CMPNEQPS* (S1
_mm_cmpneq_ps
compare for ! < CMPNLTPD* (S2
_mm_cmpnlt_pd
VCMPNLT_UQPD* (V1
_mm_cmp_pd
CMPNLTPS* (S1
_mm_cmpnlt_ps
VCMPNLT_UQPS* (V1
_mm_cmp_ps
compare for ! <=     CMPNLEPD* (S2
_mm_cmpnle_pd
VCMPNLE_UQPD* (V1
_mm_cmp_pd
    CMPNLEPS* (S1
_mm_cmpnle_ps
VCMPNLE_UQPS* (V1
_mm_cmp_ps
compare for ! > VCMPNGTPD* (V1
_mm_cmpngt_pd (S2
VCMPNGT_UQPD* (V1
_mm_cmp_pd
VCMPNGTPS* (V1
_mm_cmpngt_ps (S1
VCMPNGT_UQPS* (V1
_mm_cmp_ps
compare for ! >=     VCMPNGEPD* (V1
_mm_cmpnge_pd (S2
VCMPNGE_UQPD* (V1
_mm_cmp_pd
    VCMPNGEPS* (V1
_mm_cmpnge_ps (S1
VCMPNGE_UQPS* (V1
_mm_cmp_ps
compare for ordered VCMPORD_SPD* (V1
_mm_cmp_pd
CMPORDPD* (S2
_mm_cmpord_pd
VCMPORD_SPS* (V1
_mm_cmp_ps
CMPORDPS* (S1
_mm_cmpord_ps
compare for unordered     VCMPUNORD_SPD* (V1
_mm_cmp_pd
CMPUNORDPD* (S2
_mm_cmpunord_pd
    VCMPUNORD_SPS* (V1
_mm_cmp_ps
CMPUNORDPS* (S1
_mm_cmpunord_ps
TRUE VCMPTRUE_USPD* (V1
_mm_cmp_pd
VCMPTRUEPD* (V1
_mm_cmp_pd
VCMPTRUE_USPS* (V1
_mm_cmp_ps
VCMPTRUEPS* (V1
_mm_cmp_ps
FALSE VCMPFALSE_OSPD* (V1
_mm_cmp_pd
VCMPFALSEPD* (V1
_mm_cmp_pd
    VCMPFALSE_OSPS* (V1
_mm_cmp_ps
VCMPFALSEPS* (V1
_mm_cmp_ps
   

 

  Floating-Point
Double Single Half
compare scalar values
to set flag register
COMISD (S2
_mm_comieq_sd
_mm_comilt_sd
_mm_comile_sd
_mm_comigt_sd
_mm_comige_sd
_mm_comineq_sd

UCOMISD (S2
_mm_ucomieq_sd
_mm_ucomilt_sd
_mm_ucomile_sd
_mm_ucomigt_sd
_mm_ucomige_sd
_mm_ucomineq_sd
COMISS (S1
_mm_comieq_ss
_mm_comilt_ss
_mm_comile_ss
_mm_comigt_ss
_mm_comige_ss
_mm_comineq_ss

UCOMISS (S1
_mm_ucomieq_ss
_mm_ucomilt_ss
_mm_ucomile_ss
_mm_ucomigt_ss
_mm_ucomige_ss
_mm_ucomineq_ss
VCOMISH (V5+FP16
_mm_comieq_sh
_mm_comilt_sh
_mm_comile_sh
_mm_comigt_sh
_mm_comige_sh
_mm_comineq_sh

VUCOMISH (V5+FP16
_mm_ucomieq_sh
_mm_ucomilt_sh
_mm_ucomile_sh
_mm_ucomigt_sh
_mm_ucomige_sh
_mm_ucomineq_sh

 

Bitwise Logical Operations

  Integer Floating-Point
QWORD DWORD WORD BYTE Double Single Half
and PAND (S2
_mm_and_si128
ANDPD (S2
_mm_and_pd
ANDPS (S1
_mm_and_ps
 
VPANDQ (V5...
_mm512_and_epi64
etc
VPANDD (V5...
_mm512_and_epi32
etc
and not PANDN (S2
_mm_andnot_si128
ANDNPD (S2
_mm_andnot_pd
ANDNPS (S1
_mm_andnot_ps
 
VPANDNQ (V5...
_mm512_andnot_epi64
etc
VPANDND (V5...
_mm512_andnot_epi32
etc
or POR (S2
_mm_or_si128
ORPD (S2
_mm_or_pd
ORPS (S1
_mm_or_ps
 
VPORQ (V5...
_mm512_or_epi64
etc
VPORD (V5...
_mm512_or_epi32
etc
xor PXOR (S2
_mm_xor_si128
XORPD (S2
_mm_xor_pd
XORPS (S1
_mm_xor_ps
VPXORQ (V5...
_mm512_xor_epi64
etc
VPXORD (V5...
_mm512_xor_epi32
etc
test PTEST (S4.1
_mm_testz_si128
_mm_testc_si128
_mm_testnzc_si128
VTESTPD (V1
_mm_testz_pd
_mm_testc_pd
_mm_testnzc_pd
VTESTPS (V1
_mm_testz_ps
_mm_testc_ps
_mm_testnzc_ps
 
VPTESTMQ (V5...
_mm_test_epi64_mask
VPTESTNMQ (V5...
_mm_testn_epi64_mask
VPTESTMD (V5...
_mm_test_epi32_mask
VPTESTNMD (V5...
_mm_testn_epi32_mask
VPTESTMW (V5+BW...
_mm_test_epi16_mask
VPTESTNMW (V5+BW...
_mm_testn_epi16_mask
VPTESTMB (V5+BW...
_mm_test_epi8_mask
VPTESTNMB (V5+BW...
_mm_testn_epi8_mask
ternary operation VPTERNLOGQ (V5...
_mm_ternarylogic_epi64
VPTERNLOGD (V5...
_mm_ternarylogic_epi32

 

Bit Shift / Rotate

  Integer
QWORD DWORD WORD BYTE
shift left logical PSLLQ (S2
_mm_slli_epi64
_mm_sll_epi64
PSLLD (S2
_mm_slli_epi32
_mm_sll_epi32
PSLLW (S2
_mm_slli_epi16
_mm_sll_epi16
 
VPSLLVQ (V2
_mm_sllv_epi64
VPSLLVD (V2
_mm_sllv_epi32
VPSLLVW (V5+BW...
_mm_sllv_epi16
 
shift right logical PSRLQ (S2
_mm_srli_epi64
_mm_srl_epi64
PSRLD (S2
_mm_srli_epi32
_mm_srl_epi32
PSRLW (S2
_mm_srli_epi16
_mm_srl_epi16
 
VPSRLVQ (V2
_mm_srlv_epi64
VPSRLVD (V2
_mm_srlv_epi32
VPSRLVW (V5+BW...
_mm_srlv_epi16
 
shift right arithmetic VPSRAQ (V5...
_mm_srai_epi64
_mm_sra_epi64
PSRAD (S2
_mm_srai_epi32
_mm_sra_epi32
PSRAW (S2
_mm_srai_epi16
_mm_sra_epi16
 
VPSRAVQ (V5...
_mm_srav_epi64
VPSRAVD (V2
_mm_srav_epi32
VPSRAVW (V5+BW...
_mm_srav_epi16
 
rotate left VPROLQ (V5...
_mm_rol_epi64
VPROLD (V5...
_mm_rol_epi32
VPROLVQ (V5...
_mm_rolv_epi64
VPROLVD (V5...
_mm_rolv_epi32
rotate right VPRORQ (V5...
_mm_ror_epi64
VPRORD (V5...
_mm_ror_epi32
VPRORVQ (V5...
_mm_rorv_epi64
VPRORVD (V5...
_mm_rorv_epi32
shift left logical double VPSHLDQ (V5+VBMI2...
_mm_shldi_epi64
VPSHLDD (V5+VBMI2...
_mm_shldi_epi32
VPSHLDW (V5+VBMI2...
_mm_shldi_epi16
VPSHLDVQ (V5+VBMI2...
_mm_shldv_epi64
VPSHLDVD (V5+VBMI2...
_mm_shldv_epi32
VPSHLDVW (V5+VBMI2...
_mm_shldv_epi16
shift right logical double VPSHRDQ (V5+VBMI2...
_mm_shrdi_epi64
VPSHRDD (V5+VBMI2...
_mm_shrdi_epi32
VPSHRDW (V5+VBMI2...
_mm_shrdi_epi16
VPSHRDVQ (V5+VBMI2...
_mm_shrdv_epi64
VPSHRDVD (V5+VBMI2...
_mm_shrdv_epi32
VPSHRDVW (V5+VBMI2...
_mm_shrdv_epi16

 

Byte Shift

128-bit
shift left logical PSLLDQ (S2
_mm_slli_si128
shift right logical PSRLDQ (S2
_mm_srli_si128
packed align right PALIGNR (SS3
_mm_alignr_epi8

 

Compare Strings

explicit length implicit length
return index PCMPESTRI (S4.2
_mm_cmpestri
_mm_cmpestra
_mm_cmpestrc
_mm_cmpestro
_mm_cmpestrs
_mm_cmpestrz
PCMPISTRI (S4.2
_mm_cmpistri
_mm_cmpistra
_mm_cmpistrc
_mm_cmpistro
_mm_cmpistrs
_mm_cmpistrz
return mask PCMPESTRM (S4.2
_mm_cmpestrm
_mm_cmpestra
_mm_cmpestrc
_mm_cmpestro
_mm_cmpestrs
_mm_cmpestrz
PCMPISTRM (S4.2
_mm_cmpistrm
_mm_cmpistra
_mm_cmpistrc
_mm_cmpistro
_mm_cmpistrs
_mm_cmpistrz

 

Others

LDMXCSR (S1
_mm_setcsr
Load MXCSR register
STMXCSR (S1
_mm_getcsr
Save MXCSR register state

PSADBW (S2
_mm_sad_epu8
Compute sum of absolute differences
MPSADBW (S4.1
_mm_mpsadbw_epu8
Performs eight 4-byte wide Sum of Absolute Differences operations to produce eight word integers.
VDBPSADBW (V5+BW...
_mm_dbsad_epu8
Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes

PMULHRSW (SS3
_mm_mulhrs_epi16
Packed Multiply High with Round and Scale

PHMINPOSUW (S4.1
_mm_minpos_epu16
Finds the value and location of the minimum unsigned word from one of 8 horizontally packed unsigned words. The resulting value and location (offset within the source) are packed into the low dword of the destination XMM register.

VPCONFLICTQ (V5+CD...
_mm512_conflict_epi64
VPCONFLICTD (V5+CD...
_mm512_conflict_epi32
Detect Conflicts Within a Vector of Packed Dword/Qword Values into Dense Memory/ Register

VP2INTERSECTQ (V5+VP2INTERSECT...
_mm512_2intersect_epi64
VP2INTERSECTD (V5+VP2INTERSECT...
_mm512_2intersect_epi32
Compute Intersection Between DWORDS/QUADWORDS to a Pair of Mask Registers

VPLZCNTQ (V5+CD...
_mm_lzcnt_epi64
VPLZCNTD (V5+CD...
_mm_lzcnt_epi32
Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values

VFIXUPIMMPD* (V5...
_mm512_fixupimm_pd
VFIXUPIMMPS* (V5...
_mm512_fixupimm_ps
Fix Up Special Packed Float64/32 Values
VFPCLASSPD* (V5...
_mm512_fpclass_pd_mask
VFPCLASSPS* (V5...
_mm512_fpclass_ps_mask
VFPCLASSPH* (V5+FP16...
_mm512_fpclass_ph_mask
Tests Types Of a Packed Float64/32 Values
VRANGEPD* (V5+DQ...
_mm_range_pd
VRANGEPS* (V5+DQ...
_mm_range_pd
Range Restriction Calculation For Packed Pairs of Float64/32 Values
VGETEXPPD* (V5...
_mm512_getexp_pd
VGETEXPPS* (V5...
_mm512_getexp_ps
VGETEXPPH* (V5+FP16...
_mm512_getexp_ph
Convert Exponents of Packed DP/SP FP Values to FP Values
VGETMANTPD* (V5...
_mm512_getmant_pd
VGETMANTPS* (V5...
_mm512_getmant_ps
VGETMANTPH* (V5+FP16...
_mm512_getmant_ph
Extract Float64/32 Vector of Normalized Mantissas from Float64/32 Vector

AESDEC (AESNI
_mm_aesdec_si128
Perform an AES decryption round using an 128-bit state and a round key
AESDECLAST (AESNI
_mm_aesdeclast_si128
Perform the last AES decryption round using an 128-bit state and a round key
AESENC (AESNI
_mm_aesenc_si128
Perform an AES encryption round using an 128-bit state and a round key
AESENCLAST (AESNI
_mm_aesenclast_si128
Perform the last AES encryption round using an 128-bit state and a round key
AESIMC (AESNI
_mm_aesimc_si128
Perform an inverse mix column transformation primitive
AESKEYGENASSIST (AESNI
_mm_aeskeygenassist_si128
Assist the creation of round keys with a key expansion schedule
PCLMULQDQ (PCLMULQDQ
_mm_clmulepi64_si128
Perform carryless multiplication of two 64-bit numbers

SHA1RNDS4 (SHA
_mm_sha1rnds4_epu32
Perform Four Rounds of SHA1 Operation
SHA1NEXTE (SHA
_mm_sha1nexte_epu32
Calculate SHA1 State Variable E after Four Rounds
SHA1MSG1 (SHA
_mm_sha1msg1_epu32
Perform an Intermediate Calculation for the Next Four SHA1 Message Dwords
SHA1MSG2 (SHA
_mm_sha1msg2_epu32
Perform a Final Calculation for the Next Four SHA1 Message Dwords
SHA256RNDS2 (SHA
_mm_sha256rnds2_epu32
Perform Two Rounds of SHA256 Operation
SHA256MSG1 (SHA
_mm_sha256msg1_epu32
Perform an Intermediate Calculation for the Next Four SHA256 Message
SHA256MSG2 (SHA
_mm_sha256msg2_epu32
Perform a Final Calculation for the Next Four SHA256 Message Dwords

GF2P8AFFINEQB (GFNI...
_mm_gf2p8affine_epi64_epi8
Galois Field Affine Transformation
GF2P8AFFINEINVQB (GFNI...
_mm_gf2p8affineinv_epi64_epi8
Galois Field Affine Transformation Inverse
GF2P8MULB (GFNI...
_mm_gf2p8mul_epi8
Galois Field Multiply Bytes

VFMULCPH* (V5+FP16...
_mm_fmul_pch
_mm_mul_pch
VFCMULCPH* (V5+FP16...
_mm_fcmul_pch
_mm_cmul_pch
Complex Multiply FP16 Values
VFMADDCPH* (V5+FP16...
_mm_fmadd_pch
VFCMADDCPH* (V5+FP16...
_mm_fcmadd_pch
Complex Multiply and Accumulate FP16 Values

VCVTNE2PS2BF16 (V5+BF16...
_mm_cvtne2ps_pbh
Convert Two Packed Single Data to One Packed BF16 Data
VCVTNEPS2BF16 (V5+BF16...
_mm_cvtneps_pbh
Convert Packed Single Data to Packed BF16 Data
VDPBF16PS (V5+BF16...
_mm_dpbf16_ps
Dot Product of BF16 Pairs Accumulated Into Packed Single Precision

VPMADD52HUQ (V5+IFMA...
_mm_madd52hi_epu64
Packed Multiply of Unsigned 52-Bit Integers and Add the High 52-Bit Products to Qword Accumulators
VPMADD52LUQ (V5+IFMA...
_mm_madd52lo_epu64
Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products to Qword Accumulators

VPMULTISHIFTQB (V5+VBMI...
_mm_multishift_epi64_epi8
Select Packed Unaligned Bytes From Quadword Sources
VPSHUFBITQMB (V5+BITALG...
_mm_bitshuffle_epi64_mask
Shuffle Bits From Quadword Elements Using Byte Indexes Into Mask

VPBROADCASTMB2Q (V5+CD...
_mm_broadcastmb_epi64
VPBROADCASTMW2D (V5+CD...
_mm_broadcastmw_epi32
Broadcast Mask to Vector Register

VZEROALL (V1
_mm256_zeroall
Zero all YMM registers
VZEROUPPER (V1
_mm256_zeroupper
Zero upper 128 bits of all YMM registers

MOVNTPS (S1
_mm_stream_ps
Non-temporal store of four packed single-precision floating-point values from an XMM register into memory
MASKMOVDQU (S2
_mm_maskmoveu_si128
Non-temporal store of selected bytes from an XMM register into memory
MOVNTPD (S2
_mm_stream_pd
Non-temporal store of two packed double-precision floating-point values from an XMM register into memory
MOVNTDQ (S2
_mm_stream_si128
Non-temporal store of double quadword from an XMM register into memory
LDDQU (S3
_mm_lddqu_si128
Special 128-bit unaligned load designed to avoid cache line splits
MOVNTDQA (S4.1
_mm_stream_load_si128
Provides a non-temporal hint that can cause adjacent 16-byte items within an aligned 64-byte region (a streaming line) to be fetched and held in a small set of temporary buffers ("streaming load buffers"). Subsequent streaming loads to other aligned 16-byte items in the same streaming line may be supplied from the streaming load buffer and can improve throughput.

LDTILECFG (AMX-TILE#
_tile_loadconfig
Load Tile Configuration
STTILECFG (AMX-TILE#
_tile_storeconfig
Store Tile Configuration
TILELOADD (AMX-TILE#
_tile_loadd
TILELOADDT1 (AMX-TILE#
_tile_stream_loadd
Load Tile Data
TILESTORED (AMX-TILE#
_tile_stored
Store Tile Data
TDPBF16PS (AMX-BF16#
_tile_dpbf16ps
Dot Product of BF16 Tiles Accumulated into Packed Single Precision Tile
TDPBSSD (AMX-INT8#
_tile_dpbssd
TDPBSUD (AMX-INT8#
_tile_dpbsud
TDPBUSD (AMX-INT8#
_tile_dpbusd
TDPBUUD (AMX-INT8#
_tile_dpbuud
Dot Product of Signed/Unsigned Bytes with Dword Accumulation
TILEZERO (AMX-TILE#
_tile_zero
Zero Tile
TILERELEASE (AMX-TILE#
_tile_release
Release Tile

 

TIPS

TIP 1: Zero Clear

XOR works for both Integer and Floating-point.

Example: Zero all of 2 QWORDS / 4 DWORDS / 8 WORDS / 16 BYTES in XMM1

        pxor         xmm1, xmm1

Example: Set 0.0f to 4 floats in XMM1

        xorps        xmm1, xmm1

Example: Set 0.0 to 2 doubles in XMM1

        xorpd        xmm1, xmm1

 

TIP 2: Copy the first element to other elements in XMM register

Use a shuffle instruction.

Example: Copy the lowest float element to other 3 elements in XMM1.

        shufps       xmm1, xmm1, 0

Example: Copy the lowest WORD element to other 7 elements in XMM1

        pshuflw       xmm1, xmm1, 0
        pshufd        xmm1, xmm1, 0

Example: Copy the lower QWORD element to the upper element in XMM1

        pshufd        xmm1, xmm1, 44h     ; 01 00 01 00 B = 44h

Is this better?

        punpcklqdq    xmm1, xmm1

 

TIP 3: Integer Sign Extension / Zero Extension

Use an unpack instruction.

Example: Zero extend 8 WORDS in XMM1 to DWORDS in XMM1 (lower 4) and XMM2 (upper 4).

        movdqa     xmm2, xmm1     ; src data WORD[7] [6] [5] [4] [3] [2] [1] [0]
        pxor       xmm3, xmm3     ; upper 16-bit to attach to each WORD = all 0
        punpcklwd  xmm1, xmm3     ; lower 4 DWORDS:  0 [3] 0 [2] 0 [1] 0 [0] 
        punpckhwd  xmm2, xmm3     ; upper 4 DWORDS:  0 [7] 0 [6] 0 [5] 0 [4]

Example: Sign extend 16 BYTES in XMM1 to WORDS in XMM1 (lower 8) and XMM2 (upper 8).

        pxor       xmm3, xmm3
        movdqa     xmm2, xmm1
        pcmpgtb    xmm3, xmm1     ; upper 8-bit to attach to each BYTE = src >= 0 ? 0 : -1
        punpcklbw  xmm1, xmm3     ; lower 8 WORDS
        punpckhbw  xmm2, xmm3     ; upper 8 WORDS

Example (intrinsics): Sign extend 8 WORDS in __m128i variable words8 to DWORDS in dwords4lo (lower 4) and dwords4hi (upper 4)

    const __m128i izero = _mm_setzero_si128();
    __m128i words8hi = _mm_cmpgt_epi16(izero, words8);
    __m128i dwords4lo = _mm_unpacklo_epi16(words8, words8hi);
    __m128i dwords4hi = _mm_unpackhi_epi16(words8, words8hi);

 

TIP 4: Absolute Values of Integers

If an integer value is positive or zero, no action is required. Otherwise complement and add 1.

Example: Set absolute values of 8 signed WORDS in XMM1 to XMM1

                                  ; if src is positive or 0; if src is negative
        pxor      xmm2, xmm2      
        pcmpgtw   xmm2, xmm1      ; xmm2 <- 0              ; xmm2 <- -1
        pxor      xmm1, xmm2      ; xor with 0(do nothing) ; xor with -1(complement all bits)
        psubw     xmm1, xmm2      ; subtract 0(do nothing) ; subtract -1(add 1)

Example (intrinsics): Set abosolute values of 4 DWORDS in __m128i variable dwords4 to dwords4

    const __m128i izero = _mm_setzero_si128();
    __m128i tmp = _mm_cmpgt_epi32(izero, dwords4);
    dwords4 = _mm_xor_si128(dwords4, tmp);
    dwords4 = _mm_sub_epi32(dwords4, tmp);

 

TIPS 5: Absolute Values of Floating-Points

Floating-Points are not complemented so just clearing sign (the highest) bit makes the absolute value.

Example: Set absolute values of 4 floats in XMM1 to XMM1

; data
              align   16
signoffmask   dd      4 dup (7fffffffH)       ; mask for clearing the highest bit
        
; code
        andps   xmm1, xmmword ptr signoffmask        

Example (intrinsics): Set absolute values of 4 floats in __m128 variable floats4 to floats4

        const __m128 signmask = _mm_set1_ps(-0.0f); // 0x80000000

        floats4 = _mm_andnot_ps(signmask, floats4);

 

TIP 6: Lacking some integer MUL instructions?

Signed/unsigned makes difference only for the calculation of the upper part. For the lower part, the same instruction can be used both for signed and unsigned.

unsigned WORD * unsigned WORD -> Upper WORD: PMULHUW, Lower WORD: PMULLW

singed WORD * signed WORD -> Upper WORD: PMULHW, Lower WORD: PMULLW

 

TIP 8: max / min

Use a bitwise operation after a comparison into a mask register.

Example: Compare each signed DWORD in XMM1 and XMM2 and set smaller one to XMM1

; A=xmm1  B=xmm2                    ; if A>B        ; if A<=B
        movdqa      xmm0, xmm1
        pcmpgtd     xmm1, xmm2      ; xmm1=-1       ; xmm1=0
        pand        xmm2, xmm1      ; xmm2=B        ; xmm2=0
        pandn       xmm1, xmm0      ; xmm1=0        ; xmm1=A
        por         xmm1, xmm2      ; xmm1=B        ; xmm1=A

Example (intrinsics): Compare each signed byte in __m128i variables a, b and set larger one to maxAB

    __m128i mask = _mm_cmpgt_epi8(a, b);
    __m128i selectedA = _mm_and_si128(mask, a);
    __m128i selectedB = _mm_andnot_si128(mask, b);
    __m128i maxAB = _mm_or_si128(selectedA, selectedB);

 

TIP 10: Set all bits

Use PCMPEQx.

Example: set -1 to all of the 2 QWORDS / 4 DWORDS / 8 WORDS / 16 BYTES in XMM1.

        pcmpeqb         xmm1, xmm1

 

TIP 11: Integer Division by Constant

can be done fast by an integer multiplication with some adjustment operations.

Example: divide each unsigned WORD in __m128i by 7.

// unsigned WORD integer division by constant 7 (SSE2)
// __m128i dividends 8 unsigned WORDs
__m128i t = _mm_mulhi_epu16(_mm_set1_epi16(9363), dividends);
t = _mm_add_epi16(t, _mm_srli_epi16(_mm_sub_epi16(dividends, t), 1));
__m128i quotients = _mm_srli_epi16(t, 2);

For each divisor constant, a magic number to multiply is required. Detailed information about this alglorithm including how to calculate the magic number is in "Hacker's Delight" book written by Henry S. Warren Jr.

After the book I made a code generator for SSE/AVX intrinsics.

 


ver 2023092500