понедельник, 7 июля 2025 г.

sass instructions: uniform registers & wide loading

Having predicates for operands size and properties for type/identification we could write register tracking (well, at least up to sm90). But before we should familiarize yourself with couple of CUDA specific things

Uniform registers

As concise introduction you can read this paper, especially paragraph 3.5.2:
Turing introduces a new feature intended to improve the maximum achievable arithmetic throughput of the main, floating-point capable datapaths, by adding a separate, integer-only, scalar datapath (named the uniform datapath) that operates in parallel with the main datapath

Regular instructions can access both uniform and regular registers. Uniform datapath instructions, instead, focus on uniform instructions almost exclusively

So for example on SM75 you have 255 regular registers and 63 uniform registers UR0-UR62 (and URZ clearly mapped to RZ) + uniform predicates UP0-UP6. Given that they "typically updating array indices, loop indices or pointers" and size of VRAM can be up to 192Gb someone would expect that this is whole new set of registers with width 64bit to access arrays > 4Gb
 
Well, reality is much more boring - they are just virtual mapping of regular 32bit registers. Proof:

S2R R3, SR_TID.X
S2UR UR4, SR_CTAID.Y

Here both SR_XX are so called "special registers" with width 32bit. Also EIATTR_MAXREG_COUNT (being itself 16bit) always contains value 0xff. I saw curious cases when "nvdisasm --print-life-ranges" shows GPR 223 and UGPR 35. If I can use calculator 223 + 35 = 258
I have zero ideas how those URs are mapped to real registers (and uniform predicates to ordinal predicates) - at least there is no EIATTRs for such mapping. Obviously they are not mapped 1:1:

S2R R10, SR_CTAID.Z ; R10 now contains value from special register SR_CTAID.Z
ULDC.64 UR10, c[0x0][0x118]
IMAD.WIDE R2, R10, R3, c[0x0][0x168] ; and here it's value is still alive

Also it's totally unclear how functions get initial values for this URs. I`ve wrote for my nvd parser for EIATTR_PARAM_CBANK & EIATTR_KPARAM_INFO and it seems that often they are loaded exactly from nowhere:

/*30*/  ULDC.64 UR4,c[0][0x118];
 ; unknown cb off 118

/*40*/  IMAD.WIDE R2,PT,R7,R6,c[0][0x168] &req={0};
 ; cb in section 254, offset 168 - 160 = 8

as you can see const bank starts from 0x160 and UR4 was loaded from offset 0x118

Wide loading 

Last example should shake the imagination - we know that UR4 is 32bit, but ULDC.64 pretty obviously loading 64bit value - so where it is stored in fact? Well, lets check MD for ULDC:

FORMAT PREDICATE @[!]UniformPredicate(UPT):UPg Opcode /SZ_U8_S8_U16_S16_32_64("32"):sz
UniformRegister:URd
','C:Sa[UImm(5/0*):Sa_bank]*   [SImm(17)*:Sa_addr]

IDEST_SIZE = (( sz <= 4 ) ? 1 : ( ( sz == 5 ) ? 2 : 4 ))*32;

It loads value to pair of registers - UR4 & UR5. There are tons of wide modifiers for each LDxx, so much easy to use predicate IDEST_SIZE to detect real size of destination operands. Somebody can remember very similar instructions LDP/STP from arm64 - yes. it's right analogy, but IDEST_SIZE can be up to 128bit so you can load/store up to 4 registers in one instruction

One thing is still unclear - given that we load 64bit address it should be presented in memory as addr_low, addr_high - then what parts of it UR4 & UR5 will hold? Clearly need to check with cuda-gdb. I also noticed that MDs always have pair of destination registers as Rd2, Rd - so probably they mapped as is and Rd2 will hold addr_low and Rd - addr_high

6 комментариев:

  1. Through my test, EIATTR_MAXREG_COUNT = n, means that you can use register R0 ~ R⟨n-3⟩. This is because UR's are always usable and one UR is correspond to 1/WARP_SIZE = 1/32 general purpose registers, so (n-2) + 64 * 1/32 = n, as we expected.

    ОтветитьУдалить
  2. interesting note - in that case URs are not mapped to ordinal registers but borrowed from general registers pool, right? Then anyway gpu must keep somewhere which URs are available to some WARP

    ОтветитьУдалить
  3. In my GPU registers and uniform registers are always shared from a big pool (total e.g. 65536 in a SM). For example, if one launch 1024 threads for a kernel, you can use at most 62 GPRs in the SASS code.

    ОтветитьУдалить
  4. yes, this is in good agreement with the observed results
    The next open question is from where it borrows additional uniform predicate registers - I can;t google capacity of predicates pool - do you seen such info somewhere?

    ОтветитьУдалить
  5. Predicate (and barrier) registers seems independent of GPR pool. I guess that because they are few (predicates = 8, barrier = 16, affordable to contain all).

    ОтветитьУдалить
  6. unlike uniform predicates there is limit for barriers in EIATTR_NUM_BARRIERS

    ОтветитьУдалить