Having predicates for operands size and properties for type/identification we could write register tracking (well, at least up to sm90). But before we should familiarize yourself with couple of CUDA specific things
Uniform registers
Turing introduces a new feature intended to improve the maximum achievable arithmetic throughput of the main, floating-point capable datapaths, by adding a separate, integer-only, scalar datapath (named the uniform datapath) that operates in parallel with the main datapath
Regular instructions can access both uniform and regular registers. Uniform datapath instructions, instead, focus on uniform instructions almost exclusively
S2R R3, SR_TID.X
S2UR UR4, SR_CTAID.Y
S2R R10, SR_CTAID.Z ; R10 now contains value from special register SR_CTAID.Z
ULDC.64 UR10, c[0x0][0x118]
IMAD.WIDE R2, R10, R3, c[0x0][0x168] ; and here it's value is still alive
/*30*/ ULDC.64 UR4,c[0][0x118];
; unknown cb off 118
/*40*/ IMAD.WIDE R2,PT,R7,R6,c[0][0x168] &req={0};
; cb in section 254, offset 168 - 160 = 8
Wide loading
FORMAT PREDICATE @[!]UniformPredicate(UPT):UPg Opcode /SZ_U8_S8_U16_S16_32_64("32"):sz
UniformRegister:URd
','C:Sa[UImm(5/0*):Sa_bank]* [SImm(17)*:Sa_addr]
IDEST_SIZE = (( sz <= 4 ) ? 1 : ( ( sz == 5 ) ? 2 : 4 ))*32;
It loads value to pair of registers - UR4 & UR5. There are tons of wide modifiers for each LDxx, so much easy to use predicate IDEST_SIZE to detect real size of destination operands. Somebody can remember very similar instructions LDP/STP from arm64 - yes. it's right analogy, but IDEST_SIZE can be up to 128bit so you can load/store up to 4 registers in one instruction
One thing is still unclear - given that we load 64bit address it should be presented in memory as addr_low, addr_high - then what parts of it UR4 & UR5 will hold? Clearly need to check with cuda-gdb. I also noticed that MDs always have pair of destination registers as Rd2, Rd - so probably they mapped as is and Rd2 will hold addr_low and Rd - addr_high
Through my test, EIATTR_MAXREG_COUNT = n, means that you can use register R0 ~ R⟨n-3⟩. This is because UR's are always usable and one UR is correspond to 1/WARP_SIZE = 1/32 general purpose registers, so (n-2) + 64 * 1/32 = n, as we expected.
ОтветитьУдалитьinteresting note - in that case URs are not mapped to ordinal registers but borrowed from general registers pool, right? Then anyway gpu must keep somewhere which URs are available to some WARP
ОтветитьУдалитьIn my GPU registers and uniform registers are always shared from a big pool (total e.g. 65536 in a SM). For example, if one launch 1024 threads for a kernel, you can use at most 62 GPRs in the SASS code.
ОтветитьУдалитьyes, this is in good agreement with the observed results
ОтветитьУдалитьThe next open question is from where it borrows additional uniform predicate registers - I can;t google capacity of predicates pool - do you seen such info somewhere?
Predicate (and barrier) registers seems independent of GPR pool. I guess that because they are few (predicates = 8, barrier = 16, affordable to contain all).
ОтветитьУдалитьunlike uniform predicates there is limit for barriers in EIATTR_NUM_BARRIERS
ОтветитьУдалить