понедельник, 7 июля 2025 г.

sass instructions: uniform registers & wide loading

Having predicates for operands size and properties for type/identification we could write register tracking (well, at least up to sm90). But before we should familiarize yourself with couple of CUDA specific things

Uniform registers

As concise introduction you can read this paper, especially paragraph 3.5.2:
Turing introduces a new feature intended to improve the maximum achievable arithmetic throughput of the main, floating-point capable datapaths, by adding a separate, integer-only, scalar datapath (named the uniform datapath) that operates in parallel with the main datapath

Regular instructions can access both uniform and regular registers. Uniform datapath instructions, instead, focus on uniform instructions almost exclusively

So for example on SM75 you have 255 regular registers and 63 uniform registers UR0-UR62 (and URZ clearly mapped to RZ) + uniform predicates UP0-UP6. Given that they "typically updating array indices, loop indices or pointers" and size of VRAM can be up to 192Gb someone would expect that this is whole new set of registers with width 64bit to access arrays > 4Gb
 
Well, reality is much more boring - they are just virtual mapping of regular 32bit registers. Proof:

S2R R3, SR_TID.X
S2UR UR4, SR_CTAID.Y

Here both SR_XX are so called "special registers" with width 32bit. Also EIATTR_MAXREG_COUNT (being itself 16bit) always contains value 0xff. I saw curious cases when "nvdisasm --print-life-ranges" shows GPR 223 and UGPR 35. If I can use calculator 223 + 35 = 258
I have zero ideas how those URs are mapped to real registers (and uniform predicates to ordinal predicates) - at least there is no EIATTRs for such mapping. Obviously they are not mapped 1:1:

S2R R10, SR_CTAID.Z ; R10 now contains value from special register SR_CTAID.Z
ULDC.64 UR10, c[0x0][0x118]
IMAD.WIDE R2, R10, R3, c[0x0][0x168] ; and here it's value is still alive

Also it's totally unclear how functions get initial values for this URs. I`ve wrote for my nvd parser for EIATTR_PARAM_CBANK & EIATTR_KPARAM_INFO and it seems that often they are loaded exactly from nowhere:

/*30*/  ULDC.64 UR4,c[0][0x118];
 ; unknown cb off 118

/*40*/  IMAD.WIDE R2,PT,R7,R6,c[0][0x168] &req={0};
 ; cb in section 254, offset 168 - 160 = 8

as you can see const bank starts from 0x160 and UR4 was loaded from offset 0x118

Wide loading 

четверг, 3 июля 2025 г.

sass instructions properties

I've already described so called predicates. Unfortunately they have only size of operands. Unlike predicates properties also have types:

 IDEST_OPERAND_MAP = (1<<INDEX(Rd));
 IDEST_OPERAND_TYPE = (1<<IOPERAND_TYPE_GENERIC);
 IDEST2_OPERAND_MAP = (1<<IOPERAND_MAP_NON_EXISTENT_OPERAND);
 IDEST2_OPERAND_TYPE = (1<<IOPERAND_TYPE_NON_EXISTENT_OPERAND);
 ISRC_B_OPERAND_MAP = (1<<INDEX(Rb));
 ISRC_B_OPERAND_TYPE = (1<<IOPERAND_TYPE_GENERIC);
 ISRC_C_OPERAND_MAP = (1<<INDEX(Rc));
 ISRC_C_OPERAND_TYPE = (1<<IOPERAND_TYPE_TEX);
 ISRC_A_OPERAND_MAP = (1<<INDEX(Ra));
 ISRC_A_OPERAND_TYPE = (1<<IOPERAND_TYPE_SURFACE_COORDINATES); 

This sample for suatom instruction. Here destination has single operand so DEST2 marked with NON_EXISTENT_OPERAND. Unfortunately properties has couple of serious drawbacks:

1) they were cut out by paranoid NVidia somewhere in version 12.7-12.8, so I ripped MDs with properties up to sm90 - sm100, sm101 & sm120 don't have them. I also tried to re-apply properties from sm90 to 3 remained - but this is very unreliable

2) they are not complete. Lets see couple of samples

пятница, 27 июня 2025 г.

curse of IMAD

Found strange case while disassembly some forms of IMAD (btw raison d'être of GPU). Official nvdisasm shows:

IMAD.WIDE R2, R7, R6, c[0x0][0x168] ; /* 0x00005a0007027625 */

my nvd:

; IMAD line 63362 n 1196 15 render items 1 missed: wide
 /*40*/  IMAD R2,P7,R7,R6,c[0][0x168] &req={0}; 

Problem here not only missed P7 - at least it has default value: 

CLASS "imad_wide__RRC_RRC"
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /WIDEONLY:wide /FMT("S32"):fmt
Register:Rd
','Predicate("PT"):Pu
','Register:Ra {/REUSE("noreuse"):reuse_src_a}
','Register:Rb {/REUSE("noreuse"):reuse_src_b}
',' [-] C:Sc[UImm(5/0*):Sc_bank]*   [SImm(17)*:Sc_addr]

Both P7 & PT has the same value 7 (and btw wide does not have corresponding encoding field). Mask for this instruction ends with "011000100101" - 0x5

Main problem is that IMAD with form Reg, Reg, Reg has another mask:

CLASS "imad__RRC_RRC"
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /LOOnly("LO"):wide /FMT("S32"):fmt
Register:Rd
','Register:Ra {/REUSE("noreuse"):reuse_src_a}
','Register:Rb {/REUSE("noreuse"):reuse_src_b}
',' [-] C:Sc[UImm(5/0*):Sc_bank]*   [SImm(17)*:Sc_addr] 

mask ends with "011000100100" - 0x4

As you can see original instruction bytes is  0x00005a0007027625 - nvdisasm just produced incorrect output

Why this happens? I have hypothesis that Nvidia just don't have own official sass asm and so output of nvdisasm never used/verified

воскресенье, 15 июня 2025 г.

nvdisasm sass parser

Having sass assembler it seems like easy task to make parser for it. So I made parser of nvdisasm output
 
Lets check some samples:
SHF.R.S32.HI R209, RZ, 0x2, R209 ;
Looks like easy application of LL(1) parser - you first select instruction, then process it's optional enums (separated by dots) and then just try to match operands separated by commas, right? Hwell, no - grammar of sass is not regular and we can have lots of quirky cases

Instruction names with '.'

It's perfectly legal to meet instructions "UIADD3" & "UIADD3.64". And they have different encodings and even not marked as ALTERNATE

Pseudo opcodes

We can observe totally non-distinguishable enum
PSEUDO_OPCODE "nopseudo_opcode"=0 , "SHL"=0 , "ISCADD"=0 , "IADD"=0 , "MOV"=0;
 
and samples of using:
Opcode /LOOnly("LO"):wide /PSEUDO_OPCODE("nopseudo_opcode"):pseudo_opcode
 
Btw operand pseudo_opcode don't even have corresponding encoding field. In essence instructions like IMAD.IADD, IMAD.MOV & IMAD.SHL have exactly the same encoding form. I don't know how nvdisasm selects PSEUDO_OPCODE - probably they borrowed hallucination generator from chatgpt

Enums can contain '.' too

Yes - enum names can be something like SR_CTAID.X, SR_CTAID.Y & SR_CTAID.Z

Operands not always separated with ','

BRX R2 -0x110 (*"INDIRECT_CALL"*) 

nvidasm can't show some fields

especially batch & pm_pred. Typical instructions tail looks like:
$( { '&' REQ:req '=' BITSET(6/0x0000):req_bit_set } )$
$( { '&' RD:rd '=' UImm(3/0x7):src_rel_sb } )$
$( { '&' WR:wr '=' UImm(3/0x7):dst_wr_sb } )$
$( { '?' USCHED_INFO("DRAIN"):usched_info } )$
$( { '?' BATCH_T("NOP"):batch_t } )$
$( { '?' PM_PRED("PMN"):pm_pred } )$
and nvdisasm output contains only &wr=0x1 for WR, &rd=0x2 for RD and ?something for USCHED_INFO

Results

SMparsing rateavg forms
51.01.0
551.01.0
571.01.0
701.01.002404
751.01.018318
861.01.0
901.01.001589
1001.01.016845
1201.01.000225

Source of ambiguity

Lets run pa with options -Ssv to dump original text and all matched forms. We can see something like:
BAR.SYNC.DEFER_BLOCKING 0x0
2 forms:
 19342 @Pg.D(7) BAR .E:barmode .E:defer_blocking Sb:UImm E:Rc.D(255) req_bit_set:BITSET src_rel_sb:UImm(7) E:usched_info E:batch_t.D(0) E:pm_pred.D(0)
 19286 @Pg.D(7) BAR .E:barmode .E:defer_blocking Sb:UImm ,Sc:UImm req_bit_set:BITSET src_rel_sb:UImm(7) E:usched_info E:batch_t.D(0) E:pm_pred.D(0)

The first form has additional register operand with default value 255 and second has yet another UImm operand Sc with default value 0 (UImm(12/0)*:Sc) - so they cannot be distinguished

пятница, 30 мая 2025 г.

nvidia sass assembler

I am very skeptical about patching of existing .cubin files - it requires too much book-keeping. Let's say we want to insert several additional instructions into some function - then we need

  1. extend section containing code for those function by patching sections table
  2. patch symbols table/relocs
  3. disasm whole function and build code-flow graph for all instructions in function
  4. fix offsets for jumps
  5. fix attributes like EIATTR_INDIRECT_BRANCH_TARGETS & EIATTR_JUMPTABLE_RELOCS
  6. and so on

While points 1-2 can be implemented with ELF patching libraries like elftools it is anyway too much tedious labour

For example CuAssembler prefers to create new .cubin files from scratch. In any case we need some engine to generate sass instructions and this task is perfectly achieve-able when you have ready disassembler. So I add to my sass disasm engine some primary features for code generation:

  • dictionary of all instructions for given SM - method INV_disasm::get_instrs
  • for each instruction add encoders describing how to put values for fields, tables, constant banks & scheduling

As illustration I've implemented interactive sass assembler (with some help of readline for auto-completion)

воскресенье, 4 мая 2025 г.

nvidia sass latency tables

It seems that latency values are the best kept secret - I was able to find only article in internet and author didn't provided any code to decipher those tables. So

Disclaimer

All of the following are the shaky conclusions of my dark mind, almost certainly false and having no connection to reality

 

How they are look like

Descriptions of latency tables are located in files *_2.txt and look like
TABLE_OUTPUT(UGPR) : UDP_subset`{URd @URdRange,URd2 @URd2Range}
                      R2UR_S2UR`{URd @URdRange,URd2 @URd2Range}
                       OP_R2UR_COUPLED`{URd @URdRange,URd2 @URd2Range}
                        ULDC_VOTEU_UMOV_ULEPC`{URd @URdRange,URd2 @URd2Range}=
{
    UDP_subset`{URd @URdRange,URd2 @URd2Range} : 1 4 7 7
    R2UR_S2UR`{URd @URdRange,URd2 @URd2Range} : 1 1 1 1
    OP_R2UR_COUPLED`{URd @URdRange,URd2 @URd2Range} : 4 4 1 10
    ULDC_VOTEU_UMOV_ULEPC`{URd @URdRange,URd2 @URd2Range} : 1 4 1 1
};

пятница, 18 апреля 2025 г.

nvidia sass disassembler, part 7: dual issued instructions

Previous parts: 1, 2, 3, 4, 5 & 6

As you could notice genuine nvdisasm put couple of instructions in curly braces for old sm (always 88bits). So I finally realized how those dual issued instructions are selected - the first one must have USCHED_INFO eq 0x10 (floxy2)

Interesting note that more new sm (since 70) missed 0x10:

 W15EG=15,
 WAIT15_END_GROUP=15,
 W1=17,
 trans1=17,

results