Content deleted Content added
→CPUs with AVX-512: Removed comma. |
|||
(21 intermediate revisions by 8 users not shown) | |||
Line 2:
'''AVX-512''' are 512-bit extensions to the 256-bit [[Advanced Vector Extensions]] [[SIMD]] instructions for [[x86]] [[instruction set architecture]] (ISA) proposed by [[Intel Corporation|Intel]] in July 2013, and first implemented in the 2016 Intel [[Xeon Phi#Knights Landing|Xeon Phi x200]] (Knights Landing),<ref name="reinders512">{{cite web|url=https://software.intel.com/en-us/articles/intel-avx-512-instructions|title=AVX-512 Instructions|author=James Reinders|date=23 July 2013|publisher=[[Intel]]|access-date=20 August 2013}}</ref> and then later in a number of [[AMD]] and other Intel CPUs ([[#CPUs with AVX-512|see list below]]). AVX-512 consists of multiple extensions that may be implemented independently.{{sfn|Kusswurm|2022|p=223}} This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F (AVX-512 Foundation) is required by all AVX-512 implementations.
Besides widening most 256-bit instructions, the extensions introduce various new operations, such as new data conversions, [[gather-scatter|scatter]] operations, and permutations.{{sfn|Kusswurm|2022|p=223}} The number of AVX registers is increased from 16 to 32, and eight new "mask registers" are added, which allow for variable selection and blending of the results of instructions. In CPUs with the vector length (VL) extension—included in most AVX-512-capable processors (see {{section link||CPUs with AVX-512}})—these instructions may also be used on the 128-bit and 256-bit vector sizes.
AVX-512 is not the first 512-bit SIMD instruction set that Intel has introduced in processors: the earlier 512-bit SIMD instructions used in the first generation [[Xeon Phi]] coprocessors, derived from Intel's [[Larrabee (microarchitecture)|Larrabee]] project, are similar but not binary compatible and only partially source compatible.<ref name="reinders512" /> The successor to AVX-512 is [[Advanced Vector Extensions#AVX10|AVX10]], announced July 2023,<ref>{{cite web |url=https://www.anandtech.com/show/18975/intel-unveils-avx10-and-apx-isas-unifying-avx512-for-hybrid-architectures- |title=Intel Unveils AVX10 and APX Instruction Sets: Unifying AVX-512 For Hybrid Architectures |last=Bonshor |first=Gavin |date=2023-07-25 |publisher=[[AnandTech]] |access-date=2024-08-21}}</ref> which will work on both performance and [[Alder Lake#History|efficiency]] cores.
==Instruction set==
The AVX-512 instruction set consists of several separate sets each having their own unique CPUID feature bit
▲; F, CD, ER, PF: Introduced with [[Xeon Phi#Knights Landing|Xeon Phi x200 (Knights Landing)]] and Xeon Gold/Platinum ([[Skylake (microarchitecture)#Skylake-SP (14 nm) Scalable Performance|Skylake SP]] "Purley"), with the last two (ER and PF) being specific to Knights Landing.
▲:* ''AVX-512 Foundation (F)''{{snd}} expands most 32-bit and 64-bit based [[Advanced Vector Extensions|AVX]] instructions with the [[EVEX prefix|EVEX]] coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control, implemented by Knights Landing and Skylake Xeon
▲:* ''AVX-512 Conflict Detection Instructions (CD)''{{snd}} efficient conflict detection to allow more loops to be [[Array programming|vectorized]], implemented by Knights Landing<ref name="reinders512"/> and Skylake X
▲:* ''AVX-512 [[Exponential function|Exponential]] and [[Multiplicative inverse|Reciprocal]] Instructions (ER){{snd}}'' exponential and reciprocal operations designed to help implement [[Transcendental function|transcendental]] operations, implemented by Knights Landing<ref name="reinders512"/>
▲:* ''AVX-512 Prefetch Instructions (PF)''{{snd}} new prefetch capabilities, implemented by Knights Landing<ref name="reinders512"/>
▲; VL, DQ, BW: Introduced with Skylake X and [[Cannon Lake (microarchitecture)|Cannon Lake]].
▲:* ''AVX-512 Vector Length Extensions (VL)''{{snd}} extends most AVX-512 operations to also operate on XMM (128-bit) and YMM (256-bit) registers<ref name="reinders512b"/>
▲:* ''AVX-512 Doubleword and Quadword Instructions (DQ)''{{snd}} adds new 32-bit and 64-bit AVX-512 instructions<ref name="reinders512b"/>
▲:* ''AVX-512 Byte and Word Instructions (BW)''{{snd}} extends AVX-512 to cover 8-bit and 16-bit integer operations<ref name="reinders512b">{{cite web|url=https://software.intel.com/en-us/articles/additional-intel-avx-512-instructions|title=Additional AVX-512 instructions|author=James Reinders|date=17 July 2014|publisher=[[Intel]]|access-date=3 August 2014}}</ref>
▲; IFMA, VBMI: Introduced with [[Cannon Lake (microarchitecture)|Cannon Lake]].<ref>{{cite web|url=https://www.kitguru.net/components/cpu/anton-shilov/intel-skylake-processors-for-pcs-will-not-support-avx-512-instructions/|title=Intel 'Skylake' processors for PCs will not support AVX-512 instructions|author=Anton Shilov|website=Kitguru.net| access-date=2015-03-17}}</ref>
▲:* ''AVX-512 Integer [[Fused Multiply Add]] (IFMA)'' – fused multiply add of integers using 52-bit precision.
▲:* ''AVX-512 Vector Byte Manipulation Instructions (VBMI)'' adds vector byte permutation instructions which were not present in AVX-512BW.
▲; 4VNNIW, 4FMAPS:Introduced with [[Xeon Phi#Knights Mill|Knights Mill]].<ref>{{cite web|url=https://lemire.me/blog/2016/10/14/intel-will-add-deep-learning-instructions-to-its-processors/|title = Intel will add deep-learning instructions to its processors| date=14 October 2016 }}</ref><ref name="newisa"/>
▲:* ''AVX-512 Vector Neural Network Instructions Word variable precision (4VNNIW)'' – vector instructions for deep learning, enhanced word, variable precision.
▲:* ''AVX-512 Fused Multiply Accumulation Packed Single precision (4FMAPS)'' – vector instructions for deep learning, floating point, single precision.
▲; VPOPCNTDQ: Vector [[Hamming weight|population count]] instruction. Introduced with Knights Mill and [[Ice Lake (microarchitecture)|Ice Lake]].<ref name="iaiseaffpr"/>
▲; VNNI, VBMI2, BITALG:Introduced with Ice Lake.<ref name="iaiseaffpr"/>
▲:* ''AVX-512 Vector Neural Network Instructions (VNNI)'' – vector instructions for deep learning.
▲:* ''AVX-512 Vector Byte Manipulation Instructions 2 (VBMI2)'' – byte/word load, store and concatenation with shift.
▲:* ''AVX-512 Bit Algorithms (BITALG)'' – byte/word [[bit manipulation]] instructions expanding VPOPCNTDQ.
▲; VP2INTERSECT: Introduced with Tiger Lake.
▲:* ''AVX-512 Vector Pair Intersection to a Pair of Mask Registers (VP2INTERSECT)''.
▲; GFNI, VPCLMULQDQ, VAES:Introduced with Ice Lake.<ref name="iaiseaffpr"/>
▲:* These are not AVX-512 features per se. Together with AVX-512, they enable EVEX encoded versions of GFNI, [[CLMUL instruction set|PCLMULQDQ]] and AES instructions.
==Encoding and features==
Line 57 ⟶ 68:
| SSE–SSE4.2
| xmm0–xmm15
| single floats<br>''from SSE2:''
|-
! scope="row" | AVX-128 (VEX)
Line 67 ⟶ 78:
| AVX, AVX2
| ymm0–ymm15
| single float and double float<br>''from AVX2:''
|-
! scope="row" | AVX-128 (EVEX)
| AVX-512VL
| xmm0–xmm31<br>(k0–k7)
| doublewords, quadwords, single float and double float<br>''with AVX512BW:''
|-
! scope="row" | AVX-256 (EVEX)
| AVX-512VL
| ymm0–ymm31<br>(k0–k7)
| doublewords, quadwords, single float and double float<br>''with AVX512BW:''
|-
! scope="row" | {{nowrap|AVX-512 (EVEX)}}
| AVX-512F
| {{nowrap|zmm0–zmm31}}<br>(k0–k7)
| doublewords, quadwords, single float and double float<br>''with AVX512BW:''
|}
Line 681 ⟶ 692:
|-
! scope="col" | Instruction
! scope="col" | Extension
! scope="col" | Description
|-
Line 704 ⟶ 715:
| Minimum of packed signed/unsigned quadword
|-
| <code>VPROLD</code>, <code>VPROLVD</code>, <code>VPROLQ</code>, <code>VPROLVQ</code>,
| F
| Bit rotate left or right
|-
| <code>VPSCATTERDD</code>, <code>VPSCATTERDQ</code>,
| F
| Scatter packed doubleword/quadword with
|-
| <code>VSCATTERDPS</code>, <code>VSCATTERDPD</code>,
| F
| Scatter packed float32/float64 with
|}
Line 727 ⟶ 738:
! scope="col" | Description
|-
| <code>VPCONFLICTD</code>,
| Detect conflicts within vector of packed double- or quadwords values
| Compares each element in the first source, to all elements on same or earlier places in the second source and forms a bit vector of the results
|-
| <code>VPLZCNTD</code>,
| Count the number of leading zero bits for packed double- or quadword values
| Vectorized <code>LZCNT</code> instruction
|-
| <code>VPBROADCASTMB2Q</code>,
| Broadcast mask to vector register
| Either 8-bit mask to quadword vector, or 16-bit mask to doubleword vector
Line 749 ⟶ 760:
|-
| <code>VEXP2PD</code>, <code>VEXP2PS</code>
| Compute approximate exponential 2
|-
| <code>VRCP28PD</code>, <code>VRCP28PS</code>
Line 772 ⟶ 783:
! scope="col" | Description
|-
| <code>VGATHERPF0DPS</code>, <code>VGATHERPF0QPS</code>,
| Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T0 hint.
|-
| <code>VGATHERPF1DPS</code>, <code>VGATHERPF1QPS</code>,
| Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T1 hint.
|-
| <code>VSCATTERPF0DPS</code>, <code>VSCATTERPF0QPS</code>,
| Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using writemask k1 and T0 hint with intent to write.
|-
| <code>VSCATTERPF1DPS</code>,
| Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double precision data using writemask k1 and T1 hint with intent to write.
|}
Line 791 ⟶ 802:
|-
! scope="col" | Instruction
! scope="col" | Extension
! scope="col" | Description
|-
| <code>V4FMADDPS</code>,
| 4FMAPS
| Packed/scalar single-precision floating-point fused multiply-add (4-iterations)
|-
| <code>V4FNMADDPS</code>,
| 4FMAPS
| Packed/scalar single-precision floating-point fused multiply-add and negate (4-iterations)
Line 828 ⟶ 839:
|-
! scope="col" | Instruction
! scope="col" | Extension
! scope="col" | Description
|-
Line 860 ⟶ 871:
|-
! scope="col" | Instruction
! scope="col" | Extension
! scope="col" | Description
|-
Line 942 ⟶ 953:
|-
! scope="col" | Instruction
! scope="col" | Extension
! scope="col" | Description
|-
Line 981 ⟶ 992:
! scope="col" | Description
|-
| <code>VP2INTERSECTD</code>,
| VP2INTERSECT
| Compute intersection between doublewords/quadwords to a pair of mask registers
Line 1,084 ⟶ 1,095:
| Compute square root of packed/scalar FP16 numbers.
|-
| <code>VFMADD{132, 213, 231}PH</code>,
| Multiply-add packed/scalar FP16 numbers.
|-
| <code>VFNMADD{132, 213, 231}PH</code>,
| Negated multiply-add packed/scalar FP16 numbers.
|-
| <code>VFMSUB{132, 213, 231}PH</code>,
| Multiply-subtract packed/scalar FP16 numbers.
|-
| <code>VFNMSUB{132, 213, 231}PH</code>,
| Negated multiply-subtract packed/scalar FP16 numbers.
|-
Line 1,099 ⟶ 1,110:
| Multiply-add (odd vector elements) or multiply-subtract (even vector elements) packed FP16 numbers.
|-
| {{nowrap|<code>VFMSUBADD{132, 213, 231}PH</code>}}
| Multiply-subtract (odd vector elements) or multiply-add (even vector elements) packed FP16 numbers.
|-
Line 1,138 ⟶ 1,149:
|-
| <code>VRCPPH</code>, <code>VRCPSH</code>
| Compute approximate reciprocal of the packed/scalar FP16 numbers. The maximum relative error of the approximation is less than {{nowrap|2<sup>−11</sup> + 2<sup>−14</sup>}}.
|-
| <code>VRSQRTPH</code>,
| Compute approximate reciprocal square root of the packed/scalar FP16 numbers. The maximum relative error of the approximation is less than 2<sup>−14</sup>.
|}
Line 1,165 ⟶ 1,176:
| Select the minimum of each vertical pair of the source packed/scalar FP16 numbers.
|-
| <code>VFPCLASSPH</code>,
| Test packed/scalar FP16 numbers for special categories (NaN, infinity, negative zero, etc.) and store the result in a mask register.
|}
Line 1,499 ⟶ 1,510:
** [[Tiger Lake (microarchitecture)|Tiger Lake]] (except Pentium and Celeron but some reviewer have the CPU-Z Screenshot of Celeron 6305 with AVX-512 support<ref>{{cite web |title=Intel Celeron 6305 Processor (4M Cache, 1.80 GHz, with IPU) Product Specifications |url=https://ark.intel.com/content/www/us/en/ark/products/208646/intel-celeron-6305-processor-4m-cache-1-80-ghz-with-ipu.html |url-status=live |access-date=2020-11-10 |website=ark.intel.com |language=en|archive-url=https://web.archive.org/web/20201018025359/https://ark.intel.com/content/www/us/en/ark/products/208646/intel-celeron-6305-processor-4m-cache-1-80-ghz-with-ipu.html |archive-date=2020-10-18 }}</ref><ref>{{Citation |title=Laptop Murah Kinerja Boleh Diadu {{!}} HP 14S DQ2518TU | date=18 June 2021 |url=https://www.youtube.com/watch?v=q0HvFnvjyb0&t=119s |language=en |access-date=2021-08-08}}</ref>):<ref name="gcc">{{cite web |url=https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html |title=Using the GNU Compiler Collection (GCC): x86 Options |access-date=2019-10-14 |publisher=GNU}}</ref> AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VPOPCNTDQ, BITALG, VNNI, VPCLMULQDQ, GFNI, VAES, VP2INTERSECT
** [[Alder Lake]] (never officially supported by Intel, completely removed in newer CPUs{{ref|adl-avx512-note|Note 1}}):<ref name='anandalderreview'>{{cite web|first1=Ian|last1=Cutress|first2=Andrei|last2=Frumusanu|title=The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity |url=https://www.anandtech.com/show/17047/the-intel-12th-gen-core-i912900k-review-hybrid-performance-brings-hybrid-complexity/2 |website=www.anandtech.com |access-date=5 November 2021}}</ref><ref>{{cite web|url=https://www.phoronix.com/scan.php?page=article&item=alder-lake-avx512&num=1|title=Intel Core i9 12900K "Alder Lake" AVX-512 On Linux|website=www.phoronix.com|first1=Michael|last1=Larabel|access-date=2021-11-08}}</ref> AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VPOPCNTDQ, BITALG, VNNI, VPCLMULQDQ, GFNI, VAES, BF16, VP2INTERSECT, FP16
** [[Sapphire Rapids (microprocessor)|Sapphire Rapids]]
* [[Centaur Technology]]
** "CNS" core (8c/8t):<ref>{{cite web | url=https://centtech.com/ai-technology/ | archive-url=https://web.archive.org/web/20191212134128/https://centtech.com/ai-technology/ | url-status=usurped | archive-date=December 12, 2019 | title=The industry's first high-performance x86 SOC with server-class CPUs and integrated AI coprocessor technology | date=2 August 2022 }}</ref><ref name="instlatx64">{{cite web |title=x86, x64 Instruction Latency, Memory Latency and CPUID dumps (instlatx64) |url=http://users.atw.hu/instlatx64/ |website=users.atw.hu}}</ref> AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI
Line 1,508 ⟶ 1,519:
<div style="overflow-x: scroll">
{| class="wikitable"
! {{vert header|Subset}}
! {{vert header|F}}
! {{vert header|CD}}
! {{vert header|ER}}
! {{vert header|PF}}
! {{vert header|4FMAPS}}
! {{vert header|4VNNIW}}
! {{vert header|VPOPCNTDQ}}
! {{vert header|VL}}
! {{vert header|DQ}}
! {{vert header|BW}}
! {{vert header|IFMA}}
! {{vert header|VBMI}}
! {{vert header|VNNI}}
! {{vert header|BF16}}
! {{vert header|VBMI2}}
! {{vert header|BITALG}}
! {{vert header|VPCLMULQDQ}}
! {{vert header|GFNI}}
! {{vert header|VAES}}
! {{vert header|VP2INTERSECT}}
! {{vert header|FP16}}
|-
| [[Xeon Phi#Knights Landing|Knights
| colspan="2" rowspan="9" {{Yes}}
| colspan="2" rowspan="2" {{Yes}}
| colspan="17" {{No}}
|-
| [[Xeon Phi#Knights Mill|Knights Mill]] {{nowrap|(Xeon Phi x205, 2017)}}
| colspan="3" {{Yes}}
| colspan="14" {{No}}
|-
| [[Skylake (microarchitecture)#Skylake-SP (14 nm) Scalable Performance|Skylake-SP]], {{nowrap|[[Skylake (microarchitecture)#Mainstream desktop processors|Skylake-X]] (2017)}}
| colspan="4" rowspan="11" {{No}}
| rowspan="4" {{No}}
Line 1,550 ⟶ 1,561:
| colspan="9" {{No}}
|-
| {{nowrap|[[Cascade Lake (microarchitecture)|Cascade Lake]] (2019)}}
| rowspan="2" colspan="2" {{No}}
| rowspan="2" {{Yes}}
Line 1,573 ⟶ 1,584:
|-
| [[Alder Lake]] (2021)
| colspan="2" {{Partial|Partial{{ref|adl-avx512-note|Note
| colspan="15" {{Partial|Partial{{ref|adl-avx512-note|Note
|-
| [[Zen 4]] (2022)
Line 1,581 ⟶ 1,592:
| colspan="2" {{No}}
|-
| [[Sapphire Rapids (microprocessor)|Sapphire
| {{No}}
| {{Yes}}
Line 1,598 ⟶ 1,609:
[[Intel Advisor|Intel Vectorization Advisor]] (starting from version 2017) supports native AVX-512 performance and vector code quality analysis (for "Core", Xeon and [[Knights Landing microarchitecture|Intel Xeon Phi]] processors). Along with traditional hotspots profile, Advisor Recommendations and "seamless" integration of Intel Compiler vectorization diagnostics, Advisor Survey analysis also provides AVX-512 ISA metrics and new AVX-512-specific "traits", e.g. Scatter, Compress/Expand, mask utilization.<ref>{{cite web|url=https://software.intel.com/en-us/articles/intel-advisor-xe-2016-update-3-what-s-new|title=Intel Advisor XE 2016 Update 3 What's new - Intel Software|website=Software.intel.com|access-date=2016-10-20}}</ref><ref>{{cite web|url=https://software.intel.com/en-us/intel-advisor-xe|title=Intel Advisor - Intel Software|website=Software.intel.com|access-date=2016-10-20}}</ref>
On some processors (mostly pre-[[Ice Lake (microprocessor)|Ice Lake]] Intel), AVX-512 instructions can cause a frequency throttling even greater than its predecessors, causing a penalty for mixed workloads. The additional downclocking is triggered by the 512-bit width of vectors and
C/[[C++]] compilers also automatically handle [[loop unrolling]] and preventing [[Instruction pipelining#Pipeline bubble|stalls in the pipeline]] in order to use AVX-512 most effectively, which means a programmer using language [[Intrinsic function|intrinsics]] to try to force use of AVX-512 can sometimes result in worse performance relative to the code generated by the compiler when it encounters loops plainly written in the source code.<ref>{{cite AV media |people=Matthew Kolbe |date=2023-10-10 |title=Lightning Talk: How to Leverage SIMD Intrinsics for Massive Slowdowns - Matthew Kolbe - CppNow 2023 |url=https://www.youtube.com/watch?v=GleC3SZ8gjU |access-date=2023-10-15| via=YouTube| publisher=C++Now}}</ref> In other cases, using AVX-512 intrinsics in C/C++ code can result in a performance improvement relative to plainly written C/C++.<ref>{{cite
==Reception==
|