Features deferred to V 64 bit instruction encoding

Statically encoding SEW and LMUL
Predicates
- Predicating instructions with the complement of v0
- Predicating instructions with a register other than v0
Note, for straightforward implementations, this feature adds another regfile read port (or map-table read port for renamed implementations)
- 2 input predicates? - useful in SIMT emulation (aggressive, interleaving diverged)
memory addressing modes
- Indexed memory accesses that implicitly scale the index by SEW/8
- Indexed memory accesses that decouple index width from data width
- BaseReg + scale * IndexReg + offset
Combinatoric explosion of operand types This has historically been the biggest reason why I (Ag) want more than 32 bits of instruction for vectors - all of the following are fairly simple and could fit in the RV32 format but there are just too many of them!
- Mixed width, widening
  - e.g. vs1.8[i] * vs2.16[i] =+ vd.32[i]
    - signed X signed, signed X unsigned, unsigned X unsigned
- DSP datatypes, with saturation
  - SS: saturate signed N bits --> signed M bits, M < N
  - UU: saturate unsigned N bits --> unsigned M bits, M < N
  - US: saturate unsigned N bits --> signed M bits, M < N
  - SU: saturate signed N bits --> unsigned M bits, M < N
    - this is ReLU, a common function in DL
    - although this particular saturation would mainly be used at the end of a dot product
      - e.g. in a reduction, or in an actual dot product
- New FP types including instructions with Mixed FP types
  - single X single =+ double
  - FP16, BFLOAT16
    - fp16 X fp16 =+ {single, fp16}
    - bfloat16 X bfloat16 =+ {single, bfloat16}
    - fp16 X single =+ single
    - bfloat16 X single =+ single
  - eight bit floating-point types...
    - emerging standards? e.g. se4m3
    - https://en.wikipedia.org/wiki/Minifloat
- Mixed integer/fixed/floating point instructions
unums ??
complex
- chunky or interleaved (re,im) vs (im,re)
- planar or SOA
  - most common for existing GPU and/or vectors without complex support
  - e.g. planar vector vector ops like add needs four inputs and two outputs
    - but doing it as one instruction rather than decomposing improves ratio of compute to data movement
Improved "scalar" support in vector registers
- e.g. instead of having reductions always write vd[0], and "wasting" rest of vd, specify which vector element the reduction "scalar" should be written to
  - both static, and dynamic determined by another scalar
- similarly for "large scalars" that occupy more than one vector element * LMUL max, as occurs in some crypto instruction proposals
More Instructions with three inputs, non-source destroying
- vd := vs1*vs2 + vs3
- vector BitBlt funnel "shift"
  - to use vectors for block copies without misaligned

         e.g. vd[i] := concat(vs1,vs2)[i+offset], i := 0..VLEN/SEW-1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features deferred to V 64 bit instruction encoding

Clone this wiki locally