Skip to content
This repository has been archived by the owner on Mar 20, 2024. It is now read-only.

Features deferred to V 64 bit instruction encoding

Andy Glew edited this page Mar 11, 2020 · 1 revision
  • Statically encoding SEW and LMUL

  • Predicates

    • Predicating instructions with the complement of v0
    • Predicating instructions with a register other than v0

    Note, for straightforward implementations, this feature adds another regfile read port (or map-table read port for renamed implementations)

    • 2 input predicates? - useful in SIMT emulation (aggressive, interleaving diverged)
  • memory addressing modes

    • Indexed memory accesses that implicitly scale the index by SEW/8
    • Indexed memory accesses that decouple index width from data width
    • BaseReg + scale * IndexReg + offset
  • Combinatoric explosion of operand types This has historically been the biggest reason why I (Ag) want more than 32 bits of instruction for vectors - all of the following are fairly simple and could fit in the RV32 format but there are just too many of them!

    • Mixed width, widening
      • e.g. vs1.8[i] * vs2.16[i] =+ vd.32[i]
        • signed X signed, signed X unsigned, unsigned X unsigned
    • DSP datatypes, with saturation
      • SS: saturate signed N bits --> signed M bits, M < N
      • UU: saturate unsigned N bits --> unsigned M bits, M < N
      • US: saturate unsigned N bits --> signed M bits, M < N
      • SU: saturate signed N bits --> unsigned M bits, M < N
        • this is ReLU, a common function in DL
        • although this particular saturation would mainly be used at the end of a dot product
          • e.g. in a reduction, or in an actual dot product
    • New FP types including instructions with Mixed FP types
      • single X single =+ double
      • FP16, BFLOAT16
        • fp16 X fp16 =+ {single, fp16}
        • bfloat16 X bfloat16 =+ {single, bfloat16}
        • fp16 X single =+ single
        • bfloat16 X single =+ single
      • eight bit floating-point types...
    • Mixed integer/fixed/floating point instructions
  • unums ??

  • complex

    • chunky or interleaved (re,im) vs (im,re)
    • planar or SOA
      • most common for existing GPU and/or vectors without complex support
      • e.g. planar vector vector ops like add needs four inputs and two outputs
        • but doing it as one instruction rather than decomposing improves ratio of compute to data movement
  • Improved "scalar" support in vector registers

    • e.g. instead of having reductions always write vd[0], and "wasting" rest of vd, specify which vector element the reduction "scalar" should be written to
      • both static, and dynamic determined by another scalar
    • similarly for "large scalars" that occupy more than one vector element * LMUL max, as occurs in some crypto instruction proposals
  • More Instructions with three inputs, non-source destroying

    • vd := vs1*vs2 + vs3
    • vector BitBlt funnel "shift"
      • to use vectors for block copies without misaligned
         e.g. vd[i] := concat(vs1,vs2)[i+offset], i := 0..VLEN/SEW-1
Clone this wiki locally