MMXレジスタ(64ビット)の命令は割愛しました。
S1=SSE S2=SSE2 S3=SSE3 SS3=SSSE3 S4.1=SSE4.1 S4.2=SSE4.2 V1=AVX V2=AVX2 V5=AVX512 #=64ビットモード専用
*はPS/PD/DQ を SS/SD/SI に変えるとスカラー命令(最下位のひとつのデータだけ計算)になります。
各命令の下の青字はその命令に対応するC/C++ intrinsicsの名前です。
AVX/AVX2
AVX512
うろ覚えの命令名を見つけてマニュアルを引けるようにするために作ったものです。プログラミングの際にはこの表の内容を元にせずに必ずマニュアルで確認してください。
Intelの マニュアル→ http://www.intel.co.jp/content/www/jp/ja/processors/architectures-software-developer-manuals.html
お気づきの点がありましたらこちらのフィードバックフォーム、またはメールでページ末尾のアドレスまでご一報いただければ幸甚です。
強調表示 ~ 色 この設定をデフォルトにするにはここをクリックしてからこのページをブックマークしてください。
整数 | 実数 | YMMレーン(128bit) | ||||||
---|---|---|---|---|---|---|---|---|
QWORD | DWORD | WORD | BYTE | 倍精度 | 単精度 | 半精度 | ||
○MM全体 ↑↓ ○MM/mem |
MOVDQA (S2 512bitは下段の数字つきの命令のみ _mm_load_si128 _mm_store_si128 MOVDQU (S2 _mm_loadu_si128 _mm_storeu_si128 |
MOVAPD (S2 _mm_load_pd _mm_loadr_pd _mm_store_pd _mm_storer_pd MOVUPD (S2 _mm_loadu_pd _mm_storeu_pd |
MOVAPS (S1 _mm_load_ps _mm_loadr_ps _mm_store_ps _mm_storer_ps MOVUPS (S1 _mm_loadu_ps _mm_storeu_ps |
|||||
VMOVDQA64 (V5… _mm_mask_load_epi64 _mm_mask_store_epi64 など VMOVDQU64 (V5… _mm_mask_loadu_epi64 _mm_mask_store_epi64 など |
VMOVDQA32 (V5… _mm_mask_load_epi32 _mm_mask_store_epi32 など VMOVDQU32 (V5… _mm_mask_loadu_epi32 _mm_mask_storeu_epi32 など |
VMOVDQU16 (V5+BW… _mm_mask_loadu_epi16 _mm_mask_storeu_epi16 など |
VMOVDQU8 (V5+BW… _mm_mask_loadu_epi8 _mm_mask_storeu_epi8 など |
|||||
XMM上半分 ↑↓ mem |
MOVHPD (S2 _mm_loadh_pd _mm_storeh_pd |
MOVHPS (S1 _mm_loadh_pi _mm_storeh_pi |
||||||
XMM上半分 ↑↓ XMM下半分 |
MOVHLPS (S1 _mm_movehl_ps MOVLHPS (S1 _mm_movelh_ps |
|||||||
XMM下半分 ↑↓ mem |
MOVQ (S2 _mm_loadl_epi64 _mm_storel_epi64 |
MOVLPD (S2 _mm_loadl_pd _mm_storel_pd |
MOVLPS (S1 _mm_loadl_pi _mm_storel_pi |
|||||
XMM最下位ひとつ ↑↓ r/m |
MOVQ (S2# _mm_cvtsi64_si128 _mm_cvtsi128_si64 |
MOVD (S2 _mm_cvtsi32_si128 _mm_cvtsi128_si32 |
VMOVW (V5+FP16 _mm_cvtsi16_si128 _mm_cvtsi128_si16 |
|||||
XMM最下位ひとつ ↑↓ XMM/mem |
MOVQ (S2 _mm_move_epi64 |
MOVSD (S2 _mm_load_sd _mm_store_sd _mm_move_sd |
MOVSS (S1 _mm_load_ss _mm_store_ss _mm_move_ss |
VMOVSH (V5+FP16 _mm_load_sh _mm_store_sh _mm_move_sh |
||||
XMM全体 ↑ ひとつのデータ |
小ネタ2 _mm_set1_epi64x VPBROADCASTQ (V2 _mm_broadcastq_epi64 |
小ネタ2 _mm_set1_epi32 VPBROADCASTD (V2 _mm_broadcastd_epi32 |
小ネタ2 _mm_set1_epi16 VPBROADCASTW (V2 _mm_broadcastw_epi16 |
_mm_set1_epi8 VPBROADCASTB (V2 _mm_broadcastb_epi8 |
小ネタ2 _mm_set1_pd _mm_load1_pd MOVDDUP (S3 _mm_movedup_pd _mm_loaddup_pd |
小ネタ2
_mm_set1_ps _mm_load1_ps VBROADCASTSS memから (V1 XMMから(V2 _mm_broadcast_ss |
||
YMM/ZMM全体 ↑ ひとつのデータ |
VPBROADCASTQ (V2 _mm256_broadcastq_epi64 |
VPBROADCASTD (V2 _mm256_broadcastd_epi32 |
VPBROADCASTW (V2 _mm256_broadcastw_epi16 |
VPBROADCASTB (V2 _mm256_broadcastb_epi8 |
VBROADCASTSD memから (V1 XMMから (V2 _mm256_broadcast_sd |
VBROADCASTSS memから (V1 XMMから(V2 _mm256_broadcast_ss |
VBROADCASTF128 (V1 _mm256_broadcast_ps _mm256_broadcast_pd VBROADCASTI128 (V2 _mm256_broadcastsi128_si256 |
|
YMM/ZMM全体 ↑ 2/4/8個のデータ |
VBROADCASTI64X2 (V5+DQ… _mm512_broadcast_i64x2 VBROADCASTI64X4 (V5 _mm512_broadcast_i64x4 |
VBROADCASTI32X2 (V5+DQ… _mm512_broadcast_i32x2 VBROADCASTI32X4 (V5… _mm512_broadcast_i32x4 VBROADCASTI32X8 (V5+DQ _mm512_broadcast_i32x8 |
VBROADCASTF64X2 (V5+DQ… _mm512_broadcast_f64x2 VBROADCASTF64X4 (V5 _mm512_broadcast_f64x4 |
VBROADCASTF32X2 (V5+DQ… _mm512_broadcast_f32x2 VBROADCASTF32X4 (V5… _mm512_broadcast_f32x4 VBROADCASTF32X8 (V5+DQ _mm512_broadcast_f32x8 |
||||
○MM ↑ 複数個のデータ |
_mm_set_epi64x _mm_setr_epi64x |
_mm_set_epi32 _mm_setr_epi32 |
_mm_set_epi16 _mm_setr_epi16 |
_mm_set_epi8 _mm_setr_epi8 |
_mm_set_pd _mm_setr_pd |
_mm_set_ps _mm_setr_ps |
||
○MM全体 ↑ ゼロ |
小ネタ1 _mm_setzero_si128 |
小ネタ1 _mm_setzero_pd |
小ネタ1 _mm_setzero_ps |
|||||
extract |
PEXTRQ (S4.1# _mm_extract_epi64 |
PEXTRD (S4.1 _mm_extract_epi32 |
PEXTRW rへ (S2 PEXTRW r/mへ (S4.1 _mm_extract_epi16 |
PEXTRB (S4.1 _mm_extract_epi8 |
→MOVHPD (S2 _mm_loadh_pd _mm_storeh_pd →MOVLPD (S2 _mm_loadl_pd _mm_storel_pd |
EXTRACTPS (S4.1 _mm_extract_ps |
VEXTRACTF128 (V1 _mm256_extractf128_ps _mm256_extractf128_pd _mm256_extractf128_si256 VEXTRACTI128 (V2 _mm256_extracti128_si256 |
|
VEXTRACTI64X2 (V5+DQ… _mm512_extracti64x2_epi64 VEXTRACTI64X4 (V5 _mm512_extracti64x4_epi64 |
VEXTRACTI32X4 (V5… _mm512_extracti32x4_epi32 VEXTRACTI32X8 (V5+DQ _mm512_extracti32x8_epi32 |
VEXTRACTF64X2 (V5+DQ… _mm512_extractf64x2_pd VEXTRACTF64X4 (V5 _mm512_extractf64x4_pd |
VEXTRACTF32X4 (V5… _mm512_extractf32x4_ps VEXTRACTF32X8 (V5+DQ _mm512_extractf32x8_ps |
|||||
insert |
PINSRQ (S4.1# _mm_insert_epi64 |
PINSRD (S4.1 _mm_insert_epi32 |
PINSRW (S2 _mm_insert_epi16 |
PINSRB (S4.1 _mm_insert_epi8 |
→MOVHPD (S2 _mm_loadh_pd _mm_storeh_pd →MOVLPD (S2 _mm_loadl_pd _mm_storel_pd |
INSERTPS (S4.1 _mm_insert_ps |
VINSERTF128 (V1 _mm256_insertf128_ps _mm256_insertf128_pd _mm256_insertf128_si256 VINSERTI128 (V2 _mm256_inserti128_si256 |
|
VINSERTI64X2 (V5+DQ… _mm512_inserrti64x2 VINSERTI64X4 (V5… _mm512_inserti64x4 |
VINSERTI32X4 (V5… _mm512_inserti32x4 VINSERTI32X8 (V5+DQ _mm512_inserti32x8 |
VINSERTF64X2 (V5+DQ… _mm512_insertf64x2 VINSERTF64X4 (V5 _mm512_insertf64x4 |
VINSERTF32X4 (V5… _mm512_insertf32x4 VINSERTF32X8 (V5+DQ _mm512_insertf32x8 |
|||||
unpack |
PUNPCKHQDQ (S2 _mm_unpackhi_epi64 PUNPCKLQDQ (S2 _mm_unpacklo_epi64 |
PUNPCKHDQ (S2 _mm_unpackhi_epi32 PUNPCKLDQ (S2 _mm_unpacklo_epi32 |
PUNPCKHWD (S2 _mm_unpackhi_epi16 PUNPCKLWD (S2 _mm_unpacklo_epi16 |
PUNPCKHBW (S2 _mm_unpackhi_epi8 PUNPCKLBW (S2 _mm_unpacklo_epi8 |
UNPCKHPD (S2 _mm_unpackhi_pd UNPCKLPD (S2 _mm_unpacklo_pd |
UNPCKHPS (S1 _mm_unpackhi_ps UNPCKLPS (S1 _mm_unpacklo_ps |
||
shuffle/permute |
VPERMQ (V2 _mm256_permute4x64_epi64 VPERMI2Q (V5… _mm_permutex2var_epi64 VPERMT2Q (V5… _mm_permutex2var_epi64 |
PSHUFD (S2 _mm_shuffle_epi32 VPERMD (V2 _mm256_permutevar8x32_epi32 _mm256_permutexvar_epi32 VPERMI2D (V5… _mm_permutex2var_epi32 VPERMT2D (V5… _mm_permutex2var_epi32 |
PSHUFHW (S2 _mm_shufflehi_epi16 PSHUFLW (S2 _mm_shufflelo_epi16 VPERMW (V5+BW… _mm_permutexvar_epi16 VPERMI2W (V5+BW… _mm_permutex2var_epi16 VPERMT2W (V5+BW… _mm_permutex2var_epi16 |
PSHUFB (SS3 _mm_shuffle_epi8 VPERMB (V5+VBMI… _mm_permutexvar_epi8 VPERMI2B (V5+VBMI… _mm_permutex2var_epi8 VPERMT2B (V5+VBMI… _mm_permutex2var_epi8 |
SHUFPD (S2 _mm_shuffle_pd VPERMILPD (V1 _mm_permute_pd _mm_permutevar_pd VPERMPD (V2 _mm256_permute4x64_pd VPERMI2PD (V5… _mm_permutex2var_pd VPERMT2PD (V5… _mm_permutex2var_pd |
SHUFPS (S1 _mm_shuffle_ps VPERMILPS (V1 _mm_permute_ps _mm_permutevar_ps VPERMPS (V2 _mm256_permutevar8x32_ps VPERMI2PS (V5… _mm_permutex2var_ps VPERMT2PS (V5… _mm_permutex2var_ps |
VPERM2F128 (V1 _mm256_permute2f128_ps _mm256_permute2f128_pd _mm256_permute2f128_si256 VPERM2I128 (V2 _mm256_permute2x128_si256 |
|
VSHUFI64X2 (V5… _mm512_shuffle_i64x2 |
VSHUFI32X4 (V5… _mm512_shuffle_i32x4 |
VSHUFF64X2 (V5… _mm512_shuffle_f64x2 |
VSHUFF32X4 (V5… _mm512_shuffle_f32x4 |
|||||
blend |
VPBLENDMQ (V5… _mm_mask_blend_epi32 |
VPBLENDD (V2 _mm_blend_epi32 VPBLENDMD (V5… _mm_mask_blend_epi32 |
PBLENDW (S4.1 _mm_blend_epi16 VPBLENDMW (V5+BW… _mm_mask_blend_epi16 |
PBLENDVB (S4.1 _mm_blendv_epi8 VPBLENDMB (V5+BW… _mm_mask_blend_epi8 |
BLENDPD (S4.1 _mm_blend_pd BLENDVPD (S4.1 _mm_blendv_pd VBLENDMPD (V5… _mm_mask_blend_pd |
BLENDPS (S4.1 _mm_blend_ps BLENDVPS (S4.1 _mm_blendv_ps VBLENDMPS (V5… _mm_mask_blend_ps |
||
move and duplicate |
MOVDDUP (S3 _mm_movedup_pd _mm_loaddup_pd |
MOVSHDUP (S3 _mm_movehdup_ps MOVSLDUP (S3 _mm_moveldup_ps |
||||||
mask move |
VPMASKMOVQ (V2 _mm_maskload_epi64 _mm_maskstore_epi64 |
VPMASKMOVD (V2 _mm_maskload_epi32 _mm_maskstore_epi32 |
VMASKMOVPD (V1 _mm_maskload_pd _mm_maskstore_pd |
VMASKMOVPS (V1 _mm_maskload_ps _mm_maskstore_ps |
||||
最上位ビット抽出 |
PMOVMSKB (S2 _mm_movemask_epi8 |
MOVMSKPD (S2 _mm_movemask_pd |
MOVMSKPS (S1 _mm_movemask_ps |
|||||
VPMOVQ2M (V5+DQ… _mm_movepi64_mask |
VPMOVD2M (V5+DQ… _mm_movepi32_mask |
VPMOVW2M (V5+BW… _mm_movepi16_mask |
VPMOVB2M (V5+BW… _mm_movepi8_mask |
|||||
gather |
VPGATHERDQ (V2 _mm_i32gather_epi64 _mm_mask_i32gather_epi64 VPGATHERQQ (V2 _mm_i64gather_epi64 _mm_mask_i64gather_epi64 |
VPGATHERDD (V2 _mm_i32gather_epi32 _mm_mask_i32gather_epi32 VPGATHERQD (V2 _mm_i64gather_epi32 _mm_mask_i64gather_epi32 |
VGATHERDPD (V2 _mm_i32gather_pd _mm_mask_i32gather_pd VGATHERQPD (V2 _mm_i64gather_pd _mm_mask_i64gather_pd |
VGATHERDPS (V2 _mm_i32gather_ps _mm_mask_i32gather_ps VGATHERQPS (V2 _mm_i64gather_ps _mm_mask_i64gather_ps |
||||
scatter |
VPSCATTERDQ (V5… _mm_i32scatter_epi64 _mm_mask_i32scatter_epi64 VPSCATTERQQ (V5… _mm_i64scatter_epi64 _mm_mask_i64scatter_epi64 |
VPSCATTERDD (V5… _mm_i32scatter_epi32 _mm_mask_i32scatter_epi32 VPSCATTERQD (V5… _mm_i64scatter_epi32 _mm_mask_i64scatter_epi32 |
VSCATTERDPD (V5… _mm_i32scatter_pd _mm_mask_i32scatter_pd VSCATTERQPD (V5… _mm_i64scatter_pd _mm_mask_i64scatter_pd |
VSCATTERDPS (V5… _mm_i32scatter_ps _mm_mask_i32scatter_ps VSCATTERQPS (V5… _mm_i64scatter_ps _mm_mask_i64scatter_ps |
||||
compress |
VPCOMPRESSQ (V5… _mm_mask_compress_epi64 _mm_mask_compressstoreu_epi64 |
VPCOMPRESSD (V5… _mm_mask_compress_epi32 _mm_mask_compressstoreu_epi32 |
VPCOMPRESSW (V5+VBMI2… _mm_mask_compress_epi16 _mm_mask_compressstoreu_epi16 |
VPCOMPRESSB (V5+VBMI2… _mm_mask_compress_epi8 _mm_mask_compressstoreu_epi8 |
VCOMPRESSPD (V5… _mm_mask_compress_pd _mm_mask_compressstoreu_pd |
VCOMPRESSPS (V5… _mm_mask_compress_ps _mm_mask_compressstoreu_ps |
||
expand |
VPEXPANDQ (V5… _mm_mask_expand_epi64 _mm_mask_expandloadu_epi64 |
VPEXPANDD (V5… _mm_mask_expand_epi32 _mm_mask_expandloadu_epi32 |
VPEXPANDW (V5+VBMI2… _mm_mask_expand_epi16 _mm_mask_expandloadu_epi16 |
VPEXPANDB (V5+VBMI2… _mm_mask_expand_epi8 _mm_mask_expandloadu_epi8 |
VEXPANDPD (V5… _mm_mask_expand_pd _mm_mask_expandloadu_pd |
VEXPANDPS (V5… _mm_mask_expand_ps _mm_mask_expandloadu_ps |
||
align right |
VALIGNQ (V5… _mm_alignr_epi64 |
VALIGND (V5… _mm_alignr_epi32 |
PALIGNR (SS3 _mm_alignr_epi8 |
|||||
Opmaskのビットを拡張 |
VPMOVM2Q (V5+DQ… _mm_movm_epi64 |
VPMOVM2D (V5+DQ… _mm_movm_epi32 |
VPMOVM2W (V5+BW… _mm_movm_epi16 |
VPMOVM2B (V5+BW… _mm_movm_epi8 |
変換元\変換先 | 整数 | 実数 | ||||||
---|---|---|---|---|---|---|---|---|
QWORD | DWORD | WORD | BYTE | 倍精度 | 単精度 | 半精度 | ||
整数 | QWORD |
VPMOVQD (V5… _mm_cvtepi64_epi32 VPMOVSQD (V5… _mm_cvtsepi64_epi32 VPMOVUSQD (V5… _mm_cvtusepi64_epi32 |
VPMOVQW (V5… _mm_cvtepi64_epi16 VPMOVSQW (V5… _mm_cvtsepi64_epi16 VPMOVUSQW (V5… _mm_cvtusepi64_epi16 |
VPMOVQB (V5… _mm_cvtepi64_epi8 VPMOVSQB (V5… _mm_cvtsepi64_epi8 VPMOVUSQB (V5… _mm_cvtusepi64_epi8 |
CVTSI2SD (S2# スカラーのみ _mm_cvtsi64_sd VCVTQQ2PD* (V5+DQ… _mm_cvtepi64_pd VCVTUQQ2PD* (V5+DQ… _mm_cvtepu64_pd |
CVTSI2SS (S1# スカラーのみ _mm_cvtsi64_ss VCVTQQ2PS* (V5+DQ… _mm_cvtepi64_ps VCVTUQQ2PS* (V5+DQ… _mm_cvtepu64_ps |
VCVTQQ2PH* (V5+FP16… _mm_cvtepi64_ph VCVTUQQ2PH* (V5+FP16… _mm_cvtepu64_ph |
|
DWORD |
小ネタ3 PMOVSXDQ (S4.1 _mm_ cvtepi32_epi64 PMOVZXDQ (S4.1 _mm_ cvtepu32_epi64 |
PACKSSDW (S2 _mm_packs_epi32 PACKUSDW (S4.1 _mm_packus_epi32 VPMOVDW (V5… _mm_cvtepi32_epi16 VPMOVSDW (V5… _mm_cvtsepi32_epi16 VPMOVUSDW (V5… _mm_cvtusepi32_epi16 |
VPMOVDB (V5… _mm_cvtepi32_epi8 VPMOVSDB (V5… _mm_cvtsepi32_epi8 VPMOVUSDB (V5… _mm_cvtusepi32_epi8 |
CVTDQ2PD* (S2 _mm_cvtepi32_pd VCVTUDQ2PD* (V5… _mm_cvtepu32_pd |
CVTDQ2PS* (S2 _mm_cvtepi32_ps VCVTUDQ2PS* (V5… _mm_cvtepu32_ps |
VCVTDQ2PH* (V5+FP16… _mm_cvtepi32_ph VCVTUDQ2PH* (V5+FP16… _mm_cvtepu32_ph |
||
WORD |
PMOVSXWQ (S4.1 _mm_ cvtepi16_epi64 PMOVZXWQ (S4.1 _mm_ cvtepu16_epi64 |
小ネタ3 PMOVSXWD (S4.1 _mm_ cvtepi16_epi32 PMOVZXWD (S4.1 _mm_ cvtepu16_epi32 |
PACKSSWB (S2 _mm_packs_epi16 PACKUSWB (S2 _mm_packus_epi16 VPMOVWB (V5+BW… _mm_cvtepi16_epi8 VPMOVSWB (V5+BW… _mm_cvtsepi16_epi8 VPMOVUSWB (V5+BW… _mm_cvtusepi16_epi8 |
VCVTW2PH* (V5+FP16… _mm_cvtepi16_ph VCVTUW2PH* (V5+FP16… _mm_cvtepu16_ph |
||||
BYTE |
PMOVSXBQ (S4.1 _mm_ cvtepi8_epi64 PMOVZXBQ (S4.1 _mm_ cvtepu8_epi64 |
PMOVSXBD (S4.1 _mm_ cvtepi8_epi32 PMOVZXBD (S4.1 _mm_ cvtepu8_epi32 |
小ネタ3 PMOVSXBW (S4.1 _mm_ cvtepi8_epi16 PMOVZXBW (S4.1 _mm_ cvtepu8_epi16 |
|||||
実数 | 倍精度 |
CVTSD2SI /
CVTTSD2SI (S2# スカラーのみ _mm_cvtsd_si64 / _mm_cvttsd_si64 VCVTPD2QQ* / VCVTTPD2QQ* (V5+DQ… _mm_cvtpd_epi64 / _mm_cvttpd_epi64 VCVTPD2UQQ* / VCVTTPD2UQQ* (V5+DQ… _mm_cvtpd_epu64 / _mm_cvttpd_epu64 右側は端数切捨 |
CVTPD2DQ* /
CVTTPD2DQ* (S2 _mm_cvtpd_epi32 / _mm_cvttpd_epi32 VCVTPD2UDQ* / VCVTTPD2UDQ* (V5… _mm_cvtpd_epu32 / _mm_cvttpd_epu32 右側は端数切捨 |
CVTPD2PS* (S2 _mm_cvtpd_ps |
VCVTPD2PH* (V5+FP16… _mm_cvtpd_ph |
|||
単精度 |
CVTSS2SI /
CVTTSS2SI (S1# スカラーのみ _mm_cvtss_si64 / _mm_cvttss_si64 VCVTPS2QQ* / VCVTTPS2QQ* (V5+DQ… _mm_cvtps_epi64 / _mm_cvttps_epi64 VCVTPS2UQQ* / VCVTTPS2UQQ* (V5+DQ… _mm_cvtps_epu64 / _mm_cvttps_epu64 右側は端数切捨 |
CVTPS2DQ* /
CVTTPS2DQ* (S2 _mm_cvtps_epi32 / _mm_cvttps_epi32 VCVTPS2UDQ* / VCVTTPS2UDQ* (V5… _mm_cvtps_epu32 / _mm_cvttps_epu32 右側は端数切捨 |
CVTPS2PD* (S2 _mm_cvtps_pd |
VCVTPS2PH (F16C _mm_cvtps_ph VCVTPS2PHX* (V5+FP16… _mm_cvtxps_ph |
||||
半精度 |
VCVTPH2QQ* /
VCVTTPH2QQ*(V5+FP16… _mm_cvtph_epi64 / _mm_cvttph_epi64 VCVTPH2UQQ* / VCVTTPH2UQQ* (V5+FP16… _mm_cvtph_epu64 / _mm_cvttph_epu64 右側は端数切捨 |
VCVTPH2DQ* /
VCVTTPH2DQ* (V5+FP16… _mm_cvtph_epi32 / _mm_cvttph_epi32 VCVTPH2UDQ* / VCVTTPH2UDQ* (V5+FP16… _mm_cvtph_epu32 / _mm_cvttph_epu32 右側は端数切捨 |
VCVTPH2W* /
VCVTTPH2W* /(V5+FP16… _mm_cvtph_epi16 / _mm_cvttph_epi16 VCVTPH2UW* / VCVTTPH2UW*(V5+FP16… _mm_cvtph_epu16 / _mm_cvttph_epu16 右側は端数切捨 |
VCVTPH2PD* (V5+FP16… _mm_cvtph_pd |
VCVTPH2PS (F16C _mm_cvtph_ps VCVTPH2PSX* (V5+FP16… _mm_cvtxph_ps |
整数 | 実数 | ||||||
---|---|---|---|---|---|---|---|
QWORD | DWORD | WORD | BYTE | 倍精度 | 単精度 | 半精度 | |
add |
PADDQ (S2 _mm_add_epi64 |
PADDD (S2 _mm_add_epi32 |
PADDW (S2 _mm_add_epi16 PADDSW (S2 _mm_adds_epi16 PADDUSW (S2 _mm_adds_epu16 |
PADDB (S2 _mm_add_epi8 PADDSB (S2 _mm_adds_epi8 PADDUSB (S2 _mm_adds_epu8 |
ADDPD* (S2 _mm_add_pd |
ADDPS* (S1 _mm_add_ps |
VADDPH* (V5+FP16… _mm_add_ph |
sub |
PSUBQ (S2 _mm_sub_epi64 |
PSUBD (S2 _mm_sub_epi32 |
PSUBW (S2 _mm_sub_epi16 PSUBSW (S2 _mm_subs_epi16 PSUBUSW (S2 _mm_subs_epu16 |
PSUBB (S2 _mm_sub_epi8 PSUBSB (S2 _mm_subs_epi8 PSUBUSB (S2 _mm_subs_epu8 |
SUBPD* (S2 _mm_sub_pd |
SUBPS* (S1 _mm_sub_ps |
VSUBPH* (V5+FP16… _mm_sub_ph |
mul |
VPMULLQ (V5+DQ… _mm_mullo_epi64 |
PMULDQ (S4.1 _mm_mul_epi32 PMULUDQ (S2 _mm_mul_epu32 PMULLD (S4.1 _mm_mullo_epi32 |
PMULHW (S2 _mm_mulhi_epi16 PMULHUW (S2 _mm_mulhi_epu16 PMULLW (S2 _mm_mullo_epi16 |
MULPD* (S2 _mm_mul_pd |
MULPS* (S1 _mm_mul_ps |
VMULPH* (V5+FP16… _mm_mul_ph |
|
div |
DIVPD* (S2 _mm_div_pd |
DIVPS* (S1 _mm_div_ps |
VDIVPH* (V5+FP16… _mm_div_ph |
||||
逆数 |
VRCP14PD* (V5… _mm_rcp14_pd |
RCPPS* (S1 _mm_rcp_ps VRCP14PS* (V5… _mm_rcp14_ps |
VRCPPH* (V5+FP16… _mm_rcp_ph |
||||
平方根 |
SQRTPD* (S2 _mm_sqrt_pd |
SQRTPS* (S1 _mm_sqrt_ps |
VSQRTPH* (V5+FP16… _mm_sqrt_ph |
||||
平方根の逆数 |
VRSQRT14PD* (V5… _mm_rsqrt14_pd |
RSQRTPS* (S1 _mm_rsqrt_ps VRSQRT14PS* (V5… _mm_rsqrt14_ps |
VRSQRTPH* (V5+FP16… _mm_rsqrt_ph |
||||
2のn乗を掛ける |
VSCALEFPD* (V5… _mm_scalef_pd |
VSCALEFPS* (V5… _mm_scalef_ps |
VSCALEFPH* (V5+FP16… _mm_scalef_ph |
||||
max |
小ネタ8 VPMAXSQ (V5… _mm_max_epi64 VPMAXUQ (V5… _mm_max_epu64 |
小ネタ8 PMAXSD (S4.1 _mm_max_epi32 PMAXUD (S4.1 _mm_max_epu32 |
PMAXSW (S2 _mm_max_epi16 PMAXUW (S4.1 _mm_max_epu16 |
小ネタ8 PMAXSB (S4.1 _mm_max_epi8 PMAXUB (S2 _mm_max_epu8 |
小ネタ8 MAXPD* (S2 _mm_max_pd |
小ネタ8 MAXPS* (S1 _mm_max_ps |
VMAXPH* (V5+FP16… _mm_max_ph |
min |
小ネタ8 VPMINSQ (V5… _mm_min_epi64 VPMINUQ (V5… _mm_min_epu64 |
小ネタ8 PMINSD (S4.1 _mm_min_epi32 PMINUD (S4.1 _mm_min_epu32 |
PMINSW (S2 _mm_min_epi16 PMINUW (S4.1 _mm_min_epu16 |
小ネタ8 PMINSB (S4.1 _mm_min_epi8 PMINUB (S2 _mm_min_epu8 |
小ネタ8 MINPD* (S2 _mm_min_pd |
小ネタ8 MINPS* (S1 _mm_min_ps |
VMINPH* (V5+FP16… _mm_min_ph |
平均 |
PAVGW (S2 _mm_avg_epu16 |
PAVGB (S2 _mm_avg_epu8 |
|||||
絶対値 |
小ネタ4 VPABSQ (V5… _mm_abs_epi64 |
小ネタ4 PABSD (SS3 _mm_abs_epi32 |
小ネタ4 PABSW (SS3 _mm_abs_epi16 |
小ネタ4 PABSB (SS3 _mm_abs_epi8 |
小ネタ5 | 小ネタ5 | |
符号操作 |
PSIGND (SS3 _mm_sign_epi32 |
PSIGNW (SS3 _mm_sign_epi16 |
PSIGNB (SS3 _mm_sign_epi8 |
||||
立っているビット数 |
VPOPCNTQ (V5+VPOPCNTDQ… _mm_popcnt_epi64 |
VPOPCNTD (V5+VPOPCNTDQ… _mm_popcnt_epi32 |
VPOPCNTW (V5+BITALG… _mm_popcnt_epi16 |
VPOPCNTB (V5+BITALG… _mm_popcnt_epi8 |
|||
丸め |
ROUNDPD* (S4.1 _mm_round_pd _mm_floor_pd _mm_ceil_pd VRNDSCALEPD* (V5… _mm_roundscale_pd |
ROUNDPS* (S4.1 _mm_round_ps _mm_floor_ps _mm_ceil_ps VRNDSCALEPS* (V5… _mm_roundscale_ps |
VRNDSCALEPH* (V5+FP16… _mm_roundscale_ph |
||||
丸めとの差 |
VREDUCEPD* (V5+DQ… _mm_reduce_pd |
VREDUCEPS* (V5+DQ… _mm_reduce_ps |
VREDUCEPH* (V5+FP16… _mm_reduce_ph |
||||
add/sub |
ADDSUBPD (S3 _mm_addsub_pd |
ADDSUBPS (S3 _mm_addsub_ps |
|||||
horizontal add |
PHADDD (SS3 _mm_hadd_epi32 |
PHADDW (SS3 _mm_hadd_epi16 PHADDSW (SS3 _mm_hadds_epi16 |
HADDPD (S3 _mm_hadd_pd |
HADDPS (S3 _mm_hadd_ps |
|||
horizontal sub |
PHSUBD (SS3 _mm_hsub_epi32 |
PHSUBW (SS3 _mm_hsub_epi16 PHSUBSW (SS3 _mm_hsubs_epi16 |
HSUBPD (S3 _mm_hsub_pd |
HSUBPS (S3 _mm_hsub_ps |
|||
ドット積 / multiply and add |
PMADDWD (S2 _mm_madd_epi16 VPDPWSSD (AVX_VNNI _mm_dpwssd_avx_epi32 VPDPWSSD (V5+VNNI… _mm_dpwssd_epi32 VPDPWSSDS (AVX_VNNI _mm_dpwssds_avx_epi32 VPDPWSSDS (V5+VNNI… _mm_dpwssds_epi32 |
PMADDUBSW (SS3 _mm_maddubs_epi16 VPDPBUSD (AVX_VNNI _mm_dpbusd_avx_epi32 VPDPBUSD (V5+VNNI… _mm_dpbusd_epi32 VPDPBUSDS (AVX_VNNI _mm_dpbusds_avx_epi32 VPDPBUSDS (V5+VNNI… _mm_dpbusds_epi32 |
DPPD (S4.1 _mm_dp_pd |
DPPS (S4.1 _mm_dp_ps |
|||
fused multiply and add / sub |
VFMADDxxxPD*
(FMA _mm_fmadd_pd VFMSUBxxxPD* (FMA _mm_fmsub_pd VFMADDSUBxxxPD (FMA _mm_fmaddsub_pd VFMSUBADDxxxPD (FMA _mm_fmsubadd_pd VFNMADDxxxPD* (FMA _mm_fnmadd_pd VFNMSUBxxxPD* (FMA _mm_fnmsub_pd xxx=132/213/231 |
VFMADDxxxPS*
(FMA _mm_fmadd_ps VFMSUBxxxPS* (FMA _mm_fmsub_ps VFMADDSUBxxxPS (FMA _mm_fmaddsub_ps VFMSUBADDxxxPS (FMA _mm_fmsubadd_ps VFNMADDxxxPS* (FMA _mm_fnmadd_ps VFNMSUBxxxPS* (FMA _mm_fnmsub_ps xxx=132/213/231 |
VFMADDxxxPH*
(V5+FP16… _mm_fmadd_ph VFMSUBxxxPH* (V5+FP16… _mm_fmsub_ph VFMADDSUBxxxPH (V5+FP16… _mm_fmaddsub_ph VFMSUBADDxxxPH (V5+FP16… _mm_fmsubadd_ph VFNMADDxxxPH* (V5+FP16… _mm_fnmadd_ph VFNMSUBxxxPH* (V5+FP16… _mm_fnmsub_ph xxx=132/213/231 |
整数 | ||||
---|---|---|---|---|
QWORD | DWORD | WORD | BYTE | |
compare for == |
PCMPEQQ (S4.1 _mm_cmpeq_epi64 _mm_cmpeq_epi64_mask (V5… VPCMPUQ (0) (V5… _mm_cmpeq_epu64_mask |
PCMPEQD (S2 _mm_cmpeq_epi32 _mm_cmpeq_epi32_mask (V5… VPCMPUD (0) (V5… _mm_cmpeq_epu32_mask |
PCMPEQW (S2 _mm_cmpeq_epi16 _mm_cmpeq_epi16_mask (V5+BW… VPCMPUW (0) (V5+BW… _mm_cmpeq_epu16_mask |
PCMPEQB (S2 _mm_cmpeq_epi8 _mm_cmpeq_epi8_mask (V5+BW… VPCMPUB (0) (V5+BW… _mm_cmpeq_epu8_mask |
compare for < |
VPCMPQ (1) (V5… _mm_cmplt_epi64_mask VPCMPUQ (1) (V5… _mm_cmplt_epu64_mask |
VPCMPD (1) (V5… _mm_cmplt_epi32_mask VPCMPUD (1) (V5… _mm_cmplt_epu32_mask |
VPCMPW (1) (V5+BW… _mm_cmplt_epi16_mask VPCMPUW (1) (V5+BW… _mm_cmplt_epu16_mask |
VPCMPB (1) (V5+BW… _mm_cmplt_epi8_mask VPCMPUB (1) (V5+BW… _mm_cmplt_epu8_mask |
compare for <= |
VPCMPQ (2) (V5… _mm_cmple_epi64_mask VPCMPUQ (2) (V5… _mm_cmple_epu64_mask |
VPCMPD (2) (V5… _mm_cmple_epi32_mask VPCMPUD (2) (V5… _mm_cmple_epu32_mask |
VPCMPW (2) (V5+BW… _mm_cmple_epi16_mask VPCMPUW (2) (V5+BW… _mm_cmple_epu16_mask |
VPCMPB (2) (V5+BW… _mm_cmple_epi8_mask VPCMPUB (2) (V5+BW… _mm_cmple_epu8_mask |
compare for > |
PCMPGTQ (S4.2 _mm_cmpgt_epi64 VPCMPQ (6) (V5… _mm_cmpgt_epi64_mask VPCMPUQ (6) (V5… _mm_cmpgt_epu64_mask |
PCMPGTD (S2 _mm_cmpgt_epi32 VPCMPD (6) (V5… _mm_cmpgt_epi32_mask VPCMPUD (6) (V5… _mm_cmpgt_epu32_mask |
PCMPGTW (S2 _mm_cmpgt_epi16 VPCMPW (6) (V5+BW… _mm_cmpgt_epi16_mask VPCMPUW (6) (V5+BW… _mm_cmpgt_epu16_mask |
PCMPGTB (S2 _mm_cmpgt_epi8 VPCMPB (6) (V5+BW… _mm_cmpgt_epi8_mask VPCMPUB (6) (V5+BW… _mm_cmpgt_epu8_mask |
compare for >= |
VPCMPQ (5) (V5… _mm_cmpge_epi64_mask VPCMPUQ (5) (V5… _mm_cmpge_epu64_mask |
VPCMPD (5) (V5… _mm_cmpge_epi32_mask VPCMPUD (5) (V5… _mm_cmpge_epu32_mask |
VPCMPW (5) (V5+BW… _mm_cmpge_epi16_mask VPCMPUW (5) (V5+BW… _mm_cmpge_epu16_mask |
VPCMPB (5) (V5+BW… _mm_cmpge_epi8_mask VPCMPUB (5) (V5+BW… _mm_cmpge_epu8_mask |
compare for != |
VPCMPQ (4) (V5… _mm_cmpneq_epi64_mask VPCMPUQ (4) (V5… _mm_cmpneq_epu64_mask |
VPCMPD (4) (V5… _mm_cmpneq_epi32_mask VPCMPUD (4) (V5… _mm_cmpneq_epu32_mask |
VPCMPW (4) (V5+BW… _mm_cmpneq_epi16_mask VPCMPUW (4) (V5+BW… _mm_cmpneq_epu16_mask |
VPCMPB (4) (V5+BW… _mm_cmpneq_epi8_mask VPCMPUB (4) (V5+BW… _mm_cmpneq_epu8_mask |
実数 | |||||||||
---|---|---|---|---|---|---|---|---|---|
倍精度 | 単精度 | 半精度 | |||||||
一方または両方がNaNのとき | 条件不成立 | 条件成立とみなす | 条件不成立 | 条件成立とみなす | |||||
QNaNで例外 | YES | NO | YES | NO | YES | NO | YES | NO | |
compare for == |
VCMPEQ_OSPD* (V1 _mm_cmp_pd |
CMPEQPD* (S2 _mm_cmpeq_pd |
VCMPEQ_USPD* (V1 _mm_cmp_pd |
VCMPEQ_UQPD* (V1 _mm_cmp_pd |
VCMPEQ_OSPS* (V1 _mm_cmp_ps |
CMPEQPS* (S1 _mm_cmpeq_ps |
VCMPEQ_USPS* (V1 _mm_cmp_ps |
VCMPEQ_UQPS* (V1 _mm_cmp_ps |
VCMPPH* (V5+FP16… _mm_cmp_ph |
compare for < |
CMPLTPD* (S2 _mm_cmplt_pd |
VCMPLT_OQPD* (V1 _mm_cmp_pd |
CMPLTPS* (S1 _mm_cmplt_ps |
VCMPLT_OQPS* (V1 _mm_cmp_ps |
|||||
compare for <= |
CMPLEPD* (S2 _mm_cmple_pd |
VCMPLE_OQPD* (V1 _mm_cmp_pd |
CMPLEPS* (S1 _mm_cmple_ps |
VCMPLE_OQPS* (V1 _mm_cmp_ps |
|||||
compare for > |
VCMPGTPD* (V1 _mm_cmpgt_pd (S2 |
VCMPGT_OQPD* (V1 _mm_cmp_pd |
VCMPGTPS* (V1 _mm_cmpgt_ps (S1 |
VCMPGT_OQPS* (V1 _mm_cmp_ps |
|||||
compare for >= |
VCMPGEPD* (V1 _mm_cmpge_pd (S2 |
VCMPGE_OQPD* (V1 _mm_cmp_pd |
VCMPGEPS* (V1 _mm_cmpge_ps (S1 |
VCMPGE_OQPS* (V1 _mm_cmp_ps |
|||||
compare for != |
VCMPNEQ_OSPD* (V1 _mm_cmp_pd |
VCMPNEQ_OQPD* (V1 _mm_cmp_pd |
VCMPNEQ_USPD* (V1 _mm_cmp_pd |
CMPNEQPD* (S2 _mm_cmpneq_pd |
VCMPNEQ_OSPS* (V1 _mm_cmp_ps |
VCMPNEQ_OQPS* (V1 _mm_cmp_ps |
VCMPNEQ_USPS* (V1 _mm_cmp_ps |
CMPNEQPS* (S1 _mm_cmpneq_ps |
|
compare for ! < |
CMPNLTPD* (S2 _mm_cmpnlt_pd |
VCMPNLT_UQPD* (V1 _mm_cmp_pd |
CMPNLTPS* (S1 _mm_cmpnlt_ps |
VCMPNLT_UQPS* (V1 _mm_cmp_ps |
|||||
compare for ! <= |
CMPNLEPD* (S2 _mm_cmpnle_pd |
VCMPNLE_UQPD* (V1 _mm_cmp_pd |
CMPNLEPS* (S1 _mm_cmpnle_ps |
VCMPNLE_UQPS* (V1 _mm_cmp_ps |
|||||
compare for ! > |
VCMPNGTPD* (V1 _mm_cmpngt_pd (S2 |
VCMPNGT_UQPD* (V1 _mm_cmp_pd |
VCMPNGTPS* (V1 _mm_cmpngt_ps (S1 |
VCMPNGT_UQPS* (V1 _mm_cmp_ps |
|||||
compare for ! >= |
VCMPNGEPD* (V1 _mm_cmpnge_pd (S2 |
VCMPNGE_UQPD* (V1 _mm_cmp_pd |
VCMPNGEPS* (V1 _mm_cmpnge_ps (S1 |
VCMPNGE_UQPS* (V1 _mm_cmp_ps |
|||||
compare for ordered |
VCMPORD_SPD* (V1 _mm_cmp_pd |
CMPORDPD* (S2 _mm_cmpord_pd |
VCMPORD_SPS* (V1 _mm_cmp_ps |
CMPORDPS* (S1 _mm_cmpord_ps |
|||||
compare for unordered |
VCMPUNORD_SPD* (V1 _mm_cmp_pd |
CMPUNORDPD* (S2 _mm_cmpunord_pd |
VCMPUNORD_SPS* (V1 _mm_cmp_ps |
CMPUNORDPS* (S1 _mm_cmpunord_ps |
|||||
TRUE |
VCMPTRUE_USPD* (V1 _mm_cmp_pd |
VCMPTRUEPD* (V1 _mm_cmp_pd |
VCMPTRUE_USPS* (V1 _mm_cmp_ps |
VCMPTRUEPS* (V1 _mm_cmp_ps |
|||||
FALSE |
VCMPFALSE_OSPD* (V1 _mm_cmp_pd |
VCMPFALSEPD* (V1 _mm_cmp_pd |
VCMPFALSE_OSPS* (V1 _mm_cmp_ps |
VCMPFALSEPS* (V1 _mm_cmp_ps |
実数 | |||
---|---|---|---|
倍精度 | 単精度 | 半精度 | |
スカラー比較して 結果をフラグにセット |
COMISD (S2 _mm_comieq_sd _mm_comilt_sd _mm_comile_sd _mm_comigt_sd _mm_comige_sd _mm_comineq_sd UCOMISD (S2 _mm_ucomieq_sd _mm_ucomilt_sd _mm_ucomile_sd _mm_ucomigt_sd _mm_ucomige_sd _mm_ucomineq_sd |
COMISS (S1 _mm_comieq_ss _mm_comilt_ss _mm_comile_ss _mm_comigt_ss _mm_comige_ss _mm_comineq_ss UCOMISS (S1 _mm_ucomieq_ss _mm_ucomilt_ss _mm_ucomile_ss _mm_ucomigt_ss _mm_ucomige_ss _mm_ucomineq_ss |
VCOMISH (V5+FP16 _mm_comieq_sh _mm_comilt_sh _mm_comile_sh _mm_comigt_sh _mm_comige_sh _mm_comineq_sh VUCOMISH (V5+FP16 _mm_ucomieq_sh _mm_ucomilt_sh _mm_ucomile_sh _mm_ucomigt_sh _mm_ucomige_sh _mm_ucomineq_sh |
整数 | 実数 | ||||||
---|---|---|---|---|---|---|---|
QWORD | DWORD | WORD | BYTE | 倍精度 | 単精度 | 半精度 | |
and |
PAND (S2 _mm_and_si128 |
ANDPD (S2 _mm_and_pd |
ANDPS (S1 _mm_and_ps |
||||
VPANDQ (V5… _mm512_and_epi64 など |
VPANDD (V5… _mm512_and_epi32 など |
||||||
and not |
PANDN (S2 _mm_andnot_si128 |
ANDNPD (S2 _mm_andnot_pd |
ANDNPS (S1 _mm_andnot_ps |
||||
VPANDNQ (V5… _mm512_andnot_epi64 など |
VPANDND (V5… _mm512_andnot_epi32 など |
||||||
or |
POR (S2 _mm_or_si128 |
ORPD (S2 _mm_or_pd |
ORPS (S1 _mm_or_ps |
||||
VPORQ (V5… _mm512_or_epi64 など |
VPORD (V5… _mm512_or_epi32 など |
||||||
xor |
PXOR (S2 _mm_xor_si128 |
XORPD (S2 _mm_xor_pd |
XORPS (S1 _mm_xor_ps |
||||
VPXORQ (V5… _mm512_xor_epi64 など |
VPXORD (V5… _mm512_xor_epi32 など |
||||||
test |
PTEST (S4.1 _mm_testz_si128 _mm_testc_si128 _mm_testnzc_si128 |
VTESTPD (V1 _mm_testz_pd _mm_testc_pd _mm_testnzc_pd |
VTESTPS (V1 _mm_testz_ps _mm_testc_ps _mm_testnzc_ps |
||||
VPTESTMQ (V5… _mm_test_epi64_mask VPTESTNMQ (V5… _mm_testn_epi64_mask |
VPTESTMD (V5… _mm_test_epi32_mask VPTESTNMD (V5… _mm_testn_epi32_mask |
VPTESTMW (V5+BW… _mm_test_epi16_mask VPTESTNMW (V5+BW… _mm_testn_epi16_mask |
VPTESTMB (V5+BW… _mm_test_epi8_mask VPTESTNMB (V5+BW… _mm_testn_epi8_mask |
||||
三項論理演算 |
VPTERNLOGQ (V5… _mm_ternarylogic_epi64 |
VPTERNLOGD (V5… _mm_ternarylogic_epi32 |
整数 | ||||
---|---|---|---|---|
QWORD | DWORD | WORD | BYTE | |
shift left logical |
PSLLQ (S2 _mm_slli_epi64 _mm_sll_epi64 |
PSLLD (S2 _mm_slli_epi32 _mm_sll_epi32 |
PSLLW (S2 _mm_slli_epi16 _mm_sll_epi16 |
|
VPSLLVQ (V2 _mm_sllv_epi64 |
VPSLLVD (V2 _mm_sllv_epi32 |
VPSLLVW (V5+BW… _mm_sllv_epi16 |
||
shift right logical |
PSRLQ (S2 _mm_srli_epi64 _mm_srl_epi64 |
PSRLD (S2 _mm_srli_epi32 _mm_srl_epi32 |
PSRLW (S2 _mm_srli_epi16 _mm_srl_epi16 |
|
VPSRLVQ (V2 _mm_srlv_epi64 |
VPSRLVD (V2 _mm_srlv_epi32 |
VPSRLVW (V5+BW… _mm_srlv_epi16 |
||
shift right arithmetic |
VPSRAQ (V5… _mm_srai_epi64 _mm_sra_epi64 |
PSRAD (S2 _mm_srai_epi32 _mm_sra_epi32 |
PSRAW (S2 _mm_srai_epi16 _mm_sra_epi16 |
|
VPSRAVQ (V5… _mm_srav_epi64 |
VPSRAVD (V2 _mm_srav_epi32 |
VPSRAVW (V5+BW… _mm_srav_epi16 |
||
rotate left |
VPROLQ (V5… _mm_rol_epi64 |
VPROLD (V5… _mm_rol_epi32 |
||
VPROLVQ (V5… _mm_rolv_epi64 |
VPROLVD (V5… _mm_rolv_epi32 |
|||
rotate right |
VPRORQ (V5… _mm_ror_epi64 |
VPRORD (V5… _mm_ror_epi32 |
||
VPRORVQ (V5… _mm_rorv_epi64 |
VPRORVD (V5… _mm_rorv_epi32 |
|||
shift left logical double |
VPSHLDQ (V5+VBMI2… _mm_shldi_epi64 |
VPSHLDD (V5+VBMI2… _mm_shldi_epi32 |
VPSHLDW (V5+VBMI2… _mm_shldi_epi16 |
|
VPSHLDVQ (V5+VBMI2… _mm_shldv_epi64 |
VPSHLDVD (V5+VBMI2… _mm_shldv_epi32 |
VPSHLDVW (V5+VBMI2… _mm_shldv_epi16 |
||
shift right logical double |
VPSHRDQ (V5+VBMI2… _mm_shrdi_epi64 |
VPSHRDD (V5+VBMI2… _mm_shrdi_epi32 |
VPSHRDW (V5+VBMI2… _mm_shrdi_epi16 |
|
VPSHRDVQ (V5+VBMI2… _mm_shrdv_epi64 |
VPSHRDVD (V5+VBMI2… _mm_shrdv_epi32 |
VPSHRDVW (V5+VBMI2… _mm_shrdv_epi16 |
128bit | |
---|---|
shift left logical |
PSLLDQ (S2 _mm_slli_si128 |
shift right logical |
PSRLDQ (S2 _mm_srli_si128 |
packed align right |
PALIGNR (SS3 _mm_alignr_epi8 |
explicit length | implicit length | |
---|---|---|
return index |
PCMPESTRI (S4.2 _mm_cmpestri _mm_cmpestra _mm_cmpestrc _mm_cmpestro _mm_cmpestrs _mm_cmpestrz |
PCMPISTRI (S4.2 _mm_cmpistri _mm_cmpistra _mm_cmpistrc _mm_cmpistro _mm_cmpistrs _mm_cmpistrz |
return mask |
PCMPESTRM (S4.2 _mm_cmpestrm _mm_cmpestra _mm_cmpestrc _mm_cmpestro _mm_cmpestrs _mm_cmpestrz |
PCMPISTRM (S4.2 _mm_cmpistrm _mm_cmpistra _mm_cmpistrc _mm_cmpistro _mm_cmpistrs _mm_cmpistrz |
LDMXCSR (S1 _mm_setcsr |
Load MXCSR register |
STMXCSR (S1 _mm_getcsr |
Save MXCSR register state |
PSADBW (S2 _mm_sad_epu8 |
Compute sum of absolute differences |
MPSADBW (S4.1 _mm_mpsadbw_epu8 |
Performs eight 4-byte wide Sum of Absolute Differences operations to produce eight word integers. |
VDBPSADBW (V5+BW… _mm_dbsad_epu8 |
Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes |
PMULHRSW (SS3 _mm_mulhrs_epi16 |
Packed Multiply High with Round and Scale |
PHMINPOSUW (S4.1 _mm_minpos_epu16 |
Finds the value and location of the minimum unsigned word from one of 8 horizontally packed unsigned words. The resulting value and location (offset within the source) are packed into the low dword of the destination XMM register. |
VPCONFLICTQ (V5+CD… _mm512_conflict_epi64 VPCONFLICTD (V5+CD… _mm512_conflict_epi32 |
Detect Conflicts Within a Vector of Packed Dword/Qword Values into Dense Memory/ Register |
VP2INTERSECTQ (V5+VP2INTERSECT… _mm512_2intersect_epi64 VP2INTERSECTD (V5+VP2INTERSECT… _mm512_2intersect_epi32 |
Compute Intersection Between DWORDS/QUADWORDS to a Pair of Mask Registers |
VPLZCNTQ (V5+CD… _mm_lzcnt_epi64 VPLZCNTD (V5+CD… _mm_lzcnt_epi32 |
Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values |
VFIXUPIMMPD* (V5… _mm512_fixupimm_pd VFIXUPIMMPS* (V5… _mm512_fixupimm_ps |
Fix Up Special Packed Float64/32 Values |
VFPCLASSPD* (V5… _mm512_fpclass_pd_mask VFPCLASSPS* (V5… _mm512_fpclass_ps_mask VFPCLASSPH* (V5+FP16… _mm512_fpclass_ph_mask |
Tests Types Of a Packed Float64/32 Values |
VRANGEPD* (V5+DQ… _mm_range_pd VRANGEPS* (V5+DQ… _mm_range_pd |
Range Restriction Calculation For Packed Pairs of Float64/32 Values |
VGETEXPPD* (V5… _mm512_getexp_pd VGETEXPPS* (V5… _mm512_getexp_ps VGETEXPPH* (V5+FP16… _mm512_getexp_ph |
Convert Exponents of Packed DP/SP FP Values to FP Values |
VGETMANTPD* (V5… _mm512_getmant_pd VGETMANTPS* (V5… _mm512_getmant_ps VGETMANTPH* (V5+FP16… _mm512_getmant_ph |
Extract Float64/32 Vector of Normalized Mantissas from Float64/32 Vector |
AESDEC (AESNI _mm_aesdec_si128 |
Perform an AES decryption round using an 128-bit state and a round key |
AESDECLAST (AESNI _mm_aesdeclast_si128 |
Perform the last AES decryption round using an 128-bit state and a round key |
AESENC (AESNI _mm_aesenc_si128 |
Perform an AES encryption round using an 128-bit state and a round key |
AESENCLAST (AESNI _mm_aesenclast_si128 |
Perform the last AES encryption round using an 128-bit state and a round key |
AESIMC (AESNI _mm_aesimc_si128 |
Perform an inverse mix column transformation primitive |
AESKEYGENASSIST (AESNI _mm_aeskeygenassist_si128 |
Assist the creation of round keys with a key expansion schedule |
PCLMULQDQ (PCLMULQDQ _mm_clmulepi64_si128 |
Perform carryless multiplication of two 64-bit numbers |
SHA1RNDS4 (SHA _mm_sha1rnds4_epu32 |
Perform Four Rounds of SHA1 Operation |
SHA1NEXTE (SHA _mm_sha1nexte_epu32 |
Calculate SHA1 State Variable E after Four Rounds |
SHA1MSG1 (SHA _mm_sha1msg1_epu32 |
Perform an Intermediate Calculation for the Next Four SHA1 Message Dwords |
SHA1MSG2 (SHA _mm_sha1msg2_epu32 |
Perform a Final Calculation for the Next Four SHA1 Message Dwords |
SHA256RNDS2 (SHA _mm_sha256rnds2_epu32 |
Perform Two Rounds of SHA256 Operation |
SHA256MSG1 (SHA _mm_sha256msg1_epu32 |
Perform an Intermediate Calculation for the Next Four SHA256 Message |
SHA256MSG2 (SHA _mm_sha256msg2_epu32 |
Perform a Final Calculation for the Next Four SHA256 Message Dwords |
GF2P8AFFINEQB (GFNI… _mm_gf2p8affine_epi64_epi8 |
Galois Field Affine Transformation |
GF2P8AFFINEINVQB (GFNI… _mm_gf2p8affineinv_epi64_epi8 |
Galois Field Affine Transformation Inverse |
GF2P8MULB (GFNI… _mm_gf2p8mul_epi8 |
Galois Field Multiply Bytes |
VFMULCPH* (V5+FP16… _mm_fmul_pch _mm_mul_pch VFCMULCPH* (V5+FP16… _mm_fcmul_pch _mm_cmul_pch |
Complex Multiply FP16 Values |
VFMADDCPH* (V5+FP16… _mm_fmadd_pch VFCMADDCPH* (V5+FP16… _mm_fcmadd_pch |
Complex Multiply and Accumulate FP16 Values |
VCVTNE2PS2BF16 (V5+BF16… _mm_cvtne2ps_pbh |
Convert Two Packed Single Data to One Packed BF16 Data |
VCVTNEPS2BF16 (V5+BF16… _mm_cvtneps_pbh |
Convert Packed Single Data to Packed BF16 Data |
VDPBF16PS (V5+BF16… _mm_dpbf16_ps |
Dot Product of BF16 Pairs Accumulated Into Packed Single Precision |
VPMADD52HUQ (V5+IFMA… _mm_madd52hi_epu64 |
Packed Multiply of Unsigned 52-Bit Integers and Add the High 52-Bit Products to Qword Accumulators |
VPMADD52LUQ (V5+IFMA… _mm_madd52lo_epu64 |
Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products to Qword Accumulators |
VPMULTISHIFTQB (V5+VBMI… _mm_multishift_epi64_epi8 |
Select Packed Unaligned Bytes From Quadword Sources |
VPSHUFBITQMB (V5+BITALG… _mm_bitshuffle_epi64_mask |
Shuffle Bits From Quadword Elements Using Byte Indexes Into Mask |
VPBROADCASTMB2Q (V5+CD… _mm_broadcastmb_epi64 VPBROADCASTMW2D (V5+CD… _mm_broadcastmw_epi32 |
Broadcast Mask to Vector Register |
VZEROALL (V1 _mm256_zeroall |
Zero all YMM registers |
VZEROUPPER (V1 _mm256_zeroupper |
Zero upper 128 bits of all YMM registers |
MOVNTPS (S1 _mm_stream_ps |
Non-temporal store of four packed single-precision floating-point values from an XMM register into memory |
MASKMOVDQU (S2 _mm_maskmoveu_si128 |
Non-temporal store of selected bytes from an XMM register into memory |
MOVNTPD (S2 _mm_stream_pd |
Non-temporal store of two packed double-precision floating-point values from an XMM register into memory |
MOVNTDQ (S2 _mm_stream_si128 |
Non-temporal store of double quadword from an XMM register into memory |
LDDQU (S3 _mm_lddqu_si128 |
Special 128-bit unaligned load designed to avoid cache line splits |
MOVNTDQA (S4.1 _mm_stream_load_si128 |
Provides a non-temporal hint that can cause adjacent 16-byte items within an aligned 64-byte region (a streaming line) to be fetched and held in a small set of temporary buffers (“streaming load buffers”). Subsequent streaming loads to other aligned 16-byte items in the same streaming line may be supplied from the streaming load buffer and can improve throughput. |
LDTILECFG (AMX-TILE# _tile_loadconfig |
Load Tile Configuration |
STTILECFG (AMX-TILE# _tile_storeconfig |
Store Tile Configuration |
TILELOADD (AMX-TILE# _tile_loadd TILELOADDT1 (AMX-TILE# _tile_stream_loadd |
Load Tile Data |
TILESTORED (AMX-TILE# _tile_stored |
Store Tile Data |
TDPBF16PS (AMX-BF16# _tile_dpbf16ps |
Dot Product of BF16 Tiles Accumulated into Packed Single Precision Tile |
TDPBSSD (AMX-INT8# _tile_dpbssd TDPBSUD (AMX-INT8# _tile_dpbsud TDPBUSD (AMX-INT8# _tile_dpbusd TDPBUUD (AMX-INT8# _tile_dpbuud |
Dot Product of Signed/Unsigned Bytes with Dword Accumulation |
TILEZERO (AMX-TILE# _tile_zero |
Zero Tile |
TILERELEASE (AMX-TILE# _tile_release |
Release Tile |
xor命令でできます。実数型でもできます。
例: XMM1の2個のQWORD(または4個のDWORD、8個のWORD、16個のBYTE)をすべて0にする
pxor xmm1, xmm1
例: XMM1の4個のfloatに0.0fを入れる
xorps xmm1, xmm1
例: XMM1の2個のdoubleに0.0を入れる
xorpd xmm1, xmm1
shuffle命令でできます。
例: XMM1の最下位32ビットに入っているfloat値をXMM1の他の3つの32bitにコピーする
shufps xmm1, xmm1, 0
例: XMM1の最下位16ビットに入っているWORD値をXMM1の他の7つの16bitにコピーする
例: XMM1の下位64ビットに入っているQWORD値をXMM1の上位64bitにコピーする
pshufd xmm1, xmm1, 44h ; 01 00 01 00 B = 44h
これはこっちのほうがいいですね。
punpcklqdq xmm1, xmm1
unpack命令でできます。
例: XMM1の8個のWORD値をDWORDにゼロ拡張してXMM1(下位4個), XMM2(上位4個)に入れる
movdqa xmm2, xmm1 ; 元データ WORD[7] [6] [5] [4] [3] [2] [1] [0] pxor xmm3, xmm3 ; 各WORDに付加する上位16ビット=all 0 punpcklwd xmm1, xmm3 ; 下位4個分 0 [3] 0 [2] 0 [1] 0 [0] punpckhwd xmm2, xmm3 ; 上位4個分 0 [7] 0 [6] 0 [5] 0 [4]
例: XMM1の16個のBYTE値をWORDに符号拡張してXMM1(下位8個)、XMM2(上位8個)に入れる
pxor xmm3, xmm3 movdqa xmm2, xmm1 pcmpgtb xmm3, xmm1 ; 各BYTEに付加する上位8ビット 正なら0、負なら-1 punpcklbw xmm1, xmm3 ; 下位8個分 punpckhbw xmm2, xmm3 ; 上位8個分
例(intrinsics): __m128i 型変数words8に入っている8個のWORD値を符号拡張してdwords4lo(下位4個)、dwords4hi(上位4個)に入れる
const __m128i izero = _mm_setzero_si128(); __m128i words8hi = _mm_cmpgt_epi16(izero, words8); __m128i dwords4lo = _mm_unpacklo_epi16(words8, words8hi); __m128i dwords4hi = _mm_unpackhi_epi16(words8, words8hi);
整数は2の補数なので、正または0ならば何もしない、負ならば全ビット反転後1を加える、ということをすると絶対値になります。
例: XMM1に入っている8個の符号付きWORD値の絶対値をXMM1に入れる
; 元データが正/0の場合; 元データが負の場合 pxor xmm2, xmm2 pcmpgtw xmm2, xmm1 ; xmm2←0 ; xmm2←-1 pxor xmm1, xmm2 ; 0とxor(何もしない) ; -1とxor(全ビット反転) psubw xmm1, xmm2 ; 0を引く(何もしない) ; -1を引く(1を加える)
例(intrinsics): __m128i 型変数dwords4に入っている4個の符号付きDWORD値の絶対値をdwords4に入れる
const __m128i izero = _mm_setzero_si128(); __m128i tmp = _mm_cmpgt_epi32(izero, dwords4); dwords4 = _mm_xor_si128(dwords4, tmp); dwords4 = _mm_sub_epi32(dwords4, tmp);
実数は補数でないので符号(最上位ビット)だけクリアすれば絶対値になります。
例: XMM1に入っている4個の単精度実数の絶対値をXMM1に入れる
; データ align 16 signoffmask dd 4 dup (7fffffffH) ; 最上位ビットだけを落とすマスク ; コード andps xmm1, xmmword ptr signoffmask
例(intrinsics): __m128 型変数floats4に入っている4個の単精度実数の絶対値をfloats4に入れる
const __m128 signmask = _mm_set1_ps(-0.0f); // 0x80000000 floats4 = _mm_andnot_ps(signmask, floats4);
符号なし・符号つきで違いがあるのは上位だけです。下位は兼用できます。
符号なしWORD×WORD→上位側PMULHUW 下位側PMULLW
符号つきWORD×WORD→上位側PMULHW 下位側PMULLW
MOV系の命令はビットパターンをそのままコピーするだけなのになぜ型ごとに別の命令があるのでしょうか?
違う動作をする? いいえ。ソフトウェア的には同じ動作です。実数としては不正かもしれない任意のビットパターンをmovapsしても大丈夫です。単精度データが入っているXMMWORDをmovdqaでXMMレジスタにロードしてそのあと単精度演算しても大丈夫です。型の違う命令でmovしてもOKなことは仕様で明確に規定されています。
違いは外からは直接見えないCPUの中の動作です。正しい型の命令を使えばCPUの中で一貫性のある動作ができるので処理が速くなる可能性があります。型がわかっている場合はその型の命令を使うのがよいでしょう。
サブルーチンの入口・出口でXMMレジスタのセーブやリストアをするときなど、どの型のデータが中に入っているか知るすべがないような場合には、型の違う命令でmovしてもちゃんと動くということです。
ビット演算命令も同じです。
比較命令でマスクを得てからビット演算するとできます。
例: XMM1とXMM2に入っている各4個のDWORD値どうしを符号付き比較して小さい方をXMM1に入れる
; A=xmm1 B=xmm2 ; A>Bのとき ; A<=Bのとき movdqa xmm0, xmm1 pcmpgtd xmm1, xmm2 ; xmm1=-1 ; xmm1=0 pand xmm2, xmm1 ; xmm2=B ; xmm2=0 pandn xmm1, xmm0 ; xmm1=0 ; xmm1=A por xmm1, xmm2 ; xmm1=B ; xmm1=A
例(intrinsics): __m128i 型変数a, bに入っている各16個のバイト値どうしを符号付き比較して大きいほうをmaxABに入れる
__m128i mask = _mm_cmpgt_epi8(a, b); __m128i selectedA = _mm_and_si128(mask, a); __m128i selectedB = _mm_andnot_si128(mask, b); __m128i maxAB = _mm_or_si128(selectedA, selectedB);
多くの命令でデスティネーションにソースとは別のレジスタが使えるのでmovが要らなくなるのは自明ですがほかにも違いがあります。
一般的にAVX128ビット命令ではデスティネーションレジスタの上位にある128ビットがゼロクリアされます(例:デスティネーションにXMM0を指定するとYMM0の上位128ビットが0になります)。SSE命令では上位をいじりません。
一般的にSSE命令では16バイト境界調整が必須ですがAVX命令では16バイト境界調整しなくても実行できます(vmovdqA等、明示的にアラインメントを要求する命令を除く)。が性能的には調整したほうがいいでしょう。
AVX256ビット命令とAVX128ビット命令とSSE128ビット命令を混在させると著しく性能低下する場合があるようです。AVX256ビット命令とAVX128ビット命令を使う場合はSSE128ビット命令の使用を避ける(AVX128ビット命令に書き換える)のがいいかもしれません。
PCMPEQx命令でできます。
例: XMM1の2個のQWORD(または4個のDWORD、8個のWORD、16個のBYTE)をすべて-1にする
pcmpeqb xmm1, xmm1
整数乗算+いくつかの命令で高速に演算できます。
例: __m128i型変数dividendsの8個の符号なしWORDを7で割って商をquotientsに格納する
// unsigned WORD integer division by constant 7 (SSE2) // __m128i dividends 8 unsigned WORDs __m128i t = _mm_mulhi_epu16(_mm_set1_epi16(9363), dividends); t = _mm_add_epi16(t, _mm_srli_epi16(_mm_sub_epi16(dividends, t), 1)); __m128i quotients = _mm_srli_epi16(t, 2);
除数に応じたマジックナンバーを掛ける必要があります。このアルゴリズムの詳細とマジックナンバーの計算方法はHenry S. Warren Jr.著「Hacker's Delight」という本に詳しく書いてあります。
その本を元ネタにしてSSE/AVX intrinsicsコードジェネレーターを作りました。