windows deep internals

пятница, 18 июля 2025 г.

sass instructions: registers tracking

I've add tracking of registers to both nvd & pa - you can use -T option. And I have lots of bad news

nvdisasm lies

Yup, again. Let's check this innocent looking code:

CS2R R100, SRZ

You can assume that it stores value from special to single 32bit regular register. Actually it used wide loading and store value to R100 & R101. bcs MD of CS2R looks like

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /QInteger("64"):sz Register:Rd ','SpecialRegister:SRa

PREDICATES IDEST_SIZE = 32 + ((sz==`QInteger@"64"))*32;

QInteger has default value "64" and so omitted in output. Ok, maybe this is bcs special registers are 64bit? Lets check MD for S2UR to store value to uniform register:

FORMAT PREDICATE @[!]UniformPredicate(UPT):UPg Opcode UniformRegister:URd

PREDICATES IDEST_SIZE = 32;

It does not have width modifier at all and destination size is simple 32. So cs2r by default is 64bit and s2ur is 32bit. Srsly? Is it should be obvious? Highly likely this person now inventing mnemonics names for Intel

lack of documentation

In essence we have this short list of instructions and chapter 8.7 from ancient "The CUDA Handbook" (btw published in 2013). Properties & predicates make it a little easier to understand. Unfortunately they contains only info about regular and uniform registers. And we have yet several classes of instructions working with another kind of registers

Predicates

MD for ISETP:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ICmpAll:icmp /REDUX_SZ("S32"):fmt /Bop:bop /EXONLY:ex Predicate:Pu ','Predicate:Pv

PREDICATES IDEST_SIZE = 0; IDEST2_SIZE = 0;

and for HSETP2:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /OFMT_F16_V2_BF16_V2("F16_V2"):ofmt /FCMP:cmp /H_AND("noh_and"):h_and /FTZ("noftz"):ft z /Bop:bop Predicate:Pu ','Predicate:Pv

PREDICATES IDEST_SIZE = 0; IDEST2_SIZE = 0;

I don't know if they set their first predicate Pu only or both Pu & Pv. Btw famous IMAD has very curious MD for some forms:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /HIONLY:wide /FMT("S32"):fmt /XONLY:X
Register:Rd
','Predicate("PT"):Pu
','Register:Ra {/REUSE("noreuse"):reuse_src_a}
','Register:Rb {/REUSE("noreuse"):reuse_src_b}
',' [~] Register:Rc {/REUSE("noreuse"):reuse_src_c}
',' [!]Predicate:Pp

Usually IMAD means multiply and add, so Rd = Ra * Rb + Rc. But here we have two predicates, so should it have semantic Rd = Ra * Rb * Pu + Rc * Pp?

Barriers

MD for BMOV:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ONLY32:sz BD:barReg ','CBU_STATE_NONBAR:cbu_state

PREDICATES IDEST_SIZE = 0; IDEST2_SIZE = 0;

Enum BD described as

BD "B10"=10 , "B11"=11 , "B14"=14 , "B4"=4 , "B5"=5 , "B6"=6 , "B7"=7 , "B0"=0 , "B1"=1 , "B2"=2 , "B3"=3 , "B15"=15 , "B12"=12 , "B8"=8 , "B9"=9 , "B13"=13;

I am sure that there are more...

ptxas produces code that is far from perfect

Дальше »

понедельник, 7 июля 2025 г.

sass instructions: uniform registers & wide loading

Having predicates for operands size and properties for type/identification we could write register tracking (well, at least up to sm90). But before we should familiarize yourself with couple of CUDA specific things

Uniform registers

As concise introduction you can read this paper, especially paragraph 3.5.2:

Turing introduces a new feature intended to improve the maximum achievable arithmetic throughput of the main, floating-point capable datapaths, by adding a separate, integer-only, scalar datapath (named the uniform datapath) that operates in parallel with the main datapath

Regular instructions can access both uniform and regular registers. Uniform datapath instructions, instead, focus on uniform instructions almost exclusively

So for example on SM75 you have 255 regular registers and 63 uniform registers UR0-UR62 (and URZ clearly mapped to RZ) + uniform predicates UP0-UP6. Given that they "typically updating array indices, loop indices or pointers" and size of VRAM can be up to 192Gb someone would expect that this is whole new set of registers with width 64bit to access arrays > 4Gb

Well, reality is much more boring - they are just virtual mapping of regular 32bit registers. Proof:

S2R R3, SR_TID.X S2UR UR4, SR_CTAID.Y

Here both SR_XX are so called "special registers" with width 32bit. Also EIATTR_MAXREG_COUNT (being itself 16bit) always contains value 0xff. I saw curious cases when "nvdisasm --print-life-ranges" shows GPR 223 and UGPR 35. If I can use calculator 223 + 35 = 258

I have zero ideas how those URs are mapped to real registers (and uniform predicates to ordinal predicates) - at least there is no EIATTRs for such mapping. Obviously they are not mapped 1:1:

S2R R10, SR_CTAID.Z ; R10 now contains value from special register SR_CTAID.Z ULDC.64 UR10, c[0x0][0x118] IMAD.WIDE R2, R10, R3, c[0x0][0x168] ; and here it's value is still alive

Also it's totally unclear how functions get initial values for this URs. I`ve wrote for my nvd parser for EIATTR_PARAM_CBANK & EIATTR_KPARAM_INFO and it seems that often they are loaded exactly from nowhere:

/*30*/ ULDC.64 UR4,c[0][0x118]; ; unknown cb off 118

/*40*/ IMAD.WIDE R2,PT,R7,R6,c[0][0x168] &req={0}; ; cb in section 254, offset 168 - 160 = 8

as you can see const bank starts from 0x160 and UR4 was loaded from offset 0x118

Wide loading

Дальше »

четверг, 3 июля 2025 г.

sass instructions properties

I've already described so called predicates. Unfortunately they have only size of operands. Unlike predicates properties also have types:

IDEST_OPERAND_MAP = (1<<INDEX(Rd)); IDEST_OPERAND_TYPE = (1<<IOPERAND_TYPE_GENERIC); IDEST2_OPERAND_MAP = (1<<IOPERAND_MAP_NON_EXISTENT_OPERAND); IDEST2_OPERAND_TYPE = (1<<IOPERAND_TYPE_NON_EXISTENT_OPERAND); ISRC_B_OPERAND_MAP = (1<<INDEX(Rb)); ISRC_B_OPERAND_TYPE = (1<<IOPERAND_TYPE_GENERIC); ISRC_C_OPERAND_MAP = (1<<INDEX(Rc)); ISRC_C_OPERAND_TYPE = (1<<IOPERAND_TYPE_TEX); ISRC_A_OPERAND_MAP = (1<<INDEX(Ra)); ISRC_A_OPERAND_TYPE = (1<<IOPERAND_TYPE_SURFACE_COORDINATES);

This sample for suatom instruction. Here destination has single operand so DEST2 marked with NON_EXISTENT_OPERAND. Unfortunately properties has couple of serious drawbacks:

1) they were cut out by paranoid NVidia somewhere in version 12.7-12.8, so I ripped MDs with properties up to sm90 - sm100, sm101 & sm120 don't have them. I also tried to re-apply properties from sm90 to 3 remained - but this is very unreliable

2) they are not complete. Lets see couple of samples

Дальше »

пятница, 27 июня 2025 г.

curse of IMAD

Found strange case while disassembly some forms of IMAD (btw raison d'être of GPU). Official nvdisasm shows:

IMAD.WIDE R2, R7, R6, c[0x0][0x168] ; /* 0x00005a0007027625 */

my nvd:

; IMAD line 63362 n 1196 15 render items 1 missed: wide /*40*/ IMAD R2,P7,R7,R6,c[0][0x168] &req={0};

Problem here not only missed P7 - at least it has default value:

CLASS "imad_wide__RRC_RRC" FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /WIDEONLY:wide /FMT("S32"):fmt Register:Rd ','Predicate("PT"):Pu ','Register:Ra {/REUSE("noreuse"):reuse_src_a} ','Register:Rb {/REUSE("noreuse"):reuse_src_b} ',' [-] C:Sc[UImm(5/0*):Sc_bank]* [SImm(17)*:Sc_addr]

Both P7 & PT has the same value 7 (and btw wide does not have corresponding encoding field). Mask for this instruction ends with "011000100101" - 0x5

Main problem is that IMAD with form Reg, Reg, Reg has another mask:

CLASS "imad__RRC_RRC" FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /LOOnly("LO"):wide /FMT("S32"):fmt Register:Rd ','Register:Ra {/REUSE("noreuse"):reuse_src_a} ','Register:Rb {/REUSE("noreuse"):reuse_src_b} ',' [-] C:Sc[UImm(5/0*):Sc_bank]* [SImm(17)*:Sc_addr]

mask ends with "011000100100" - 0x4

As you can see original instruction bytes is 0x00005a0007027625 - nvdisasm just produced incorrect output

Why this happens? I have hypothesis that Nvidia just don't have own official sass asm and so output of nvdisasm never used/verified

воскресенье, 15 июня 2025 г.

nvdisasm sass parser

Having sass assembler it seems like easy task to make parser for it. So I made parser of nvdisasm output

Lets check some samples:

SHF.R.S32.HI R209, RZ, 0x2, R209 ;

Looks like easy application of LL(1) parser - you first select instruction, then process it's optional enums (separated by dots) and then just try to match operands separated by commas, right? Hwell, no - grammar of sass is not regular and we can have lots of quirky cases

Instruction names with '.'

It's perfectly legal to meet instructions "UIADD3" & "UIADD3.64". And they have different encodings and even not marked as ALTERNATE

Pseudo opcodes

We can observe totally non-distinguishable enum

PSEUDO_OPCODE "nopseudo_opcode"=0 , "SHL"=0 , "ISCADD"=0 , "IADD"=0 , "MOV"=0;

and samples of using:

Opcode /LOOnly("LO"):wide /PSEUDO_OPCODE("nopseudo_opcode"):pseudo_opcode

Btw operand pseudo_opcode don't even have corresponding encoding field. In essence instructions like IMAD.IADD, IMAD.MOV & IMAD.SHL have exactly the same encoding form. I don't know how nvdisasm selects PSEUDO_OPCODE - probably they borrowed hallucination generator from chatgpt

Enums can contain '.' too

Yes - enum names can be something like SR_CTAID.X, SR_CTAID.Y & SR_CTAID.Z

Operands not always separated with ','

BRX R2 -0x110 (*"INDIRECT_CALL"*)

nvidasm can't show some fields

especially batch & pm_pred. Typical instructions tail looks like:

$( { '&' REQ:req '=' BITSET(6/0x0000):req_bit_set } )$
$( { '&' RD:rd '=' UImm(3/0x7):src_rel_sb } )$
$( { '&' WR:wr '=' UImm(3/0x7):dst_wr_sb } )$
$( { '?' USCHED_INFO("DRAIN"):usched_info } )$
$( { '?' BATCH_T("NOP"):batch_t } )$
$( { '?' PM_PRED("PMN"):pm_pred } )$

and nvdisasm output contains only &wr=0x1 for WR, &rd=0x2 for RD and ?something for USCHED_INFO

Results

SM	parsing rate	avg forms
5	1.0	1.0
55	1.0	1.0
57	1.0	1.0
70	1.0	1.002404
75	1.0	1.018318
86	1.0	1.0
90	1.0	1.001589
100	1.0	1.016845
120	1.0	1.000225

Source of ambiguity

Lets run pa with options -Ssv to dump original text and all matched forms. We can see something like:

BAR.SYNC.DEFER_BLOCKING 0x0 
2 forms:
 19342 @Pg.D(7) BAR .E:barmode .E:defer_blocking Sb:UImm E:Rc.D(255) req_bit_set:BITSET src_rel_sb:UImm(7) E:usched_info E:batch_t.D(0) E:pm_pred.D(0)
 19286 @Pg.D(7) BAR .E:barmode .E:defer_blocking Sb:UImm ,Sc:UImm req_bit_set:BITSET src_rel_sb:UImm(7) E:usched_info E:batch_t.D(0) E:pm_pred.D(0)

The first form has additional register operand with default value 255 and second has yet another UImm operand Sc with default value 0 (UImm(12/0)*:Sc) - so they cannot be distinguished

пятница, 30 мая 2025 г.

nvidia sass assembler

I am very skeptical about patching of existing .cubin files - it requires too much book-keeping. Let's say we want to insert several additional instructions into some function - then we need

extend section containing code for those function by patching sections table
patch symbols table/relocs
disasm whole function and build code-flow graph for all instructions in function
fix offsets for jumps
fix attributes like EIATTR_INDIRECT_BRANCH_TARGETS & EIATTR_JUMPTABLE_RELOCS
and so on

While points 1-2 can be implemented with ELF patching libraries like elftools it is anyway too much tedious labour

For example CuAssembler prefers to create new .cubin files from scratch. In any case we need some engine to generate sass instructions and this task is perfectly achieve-able when you have ready disassembler. So I add to my sass disasm engine some primary features for code generation:

dictionary of all instructions for given SM - method INV_disasm::get_instrs
for each instruction add encoders describing how to put values for fields, tables, constant banks & scheduling

As illustration I've implemented interactive sass assembler (with some help of readline for auto-completion)

Дальше »

воскресенье, 4 мая 2025 г.

nvidia sass latency tables

It seems that latency values are the best kept secret - I was able to find only article in internet and author didn't provided any code to decipher those tables. So

Disclaimer

All of the following are the shaky conclusions of my dark mind, almost certainly false and having no connection to reality

How they are look like

Descriptions of latency tables are located in files *_2.txt and look like

TABLE_OUTPUT(UGPR) : UDP_subset`{URd @URdRange,URd2 @URd2Range}
                      R2UR_S2UR`{URd @URdRange,URd2 @URd2Range}
                       OP_R2UR_COUPLED`{URd @URdRange,URd2 @URd2Range}
                        ULDC_VOTEU_UMOV_ULEPC`{URd @URdRange,URd2 @URd2Range}=
{
    UDP_subset`{URd @URdRange,URd2 @URd2Range} : 1 4 7 7
    R2UR_S2UR`{URd @URdRange,URd2 @URd2Range} : 1 1 1 1
    OP_R2UR_COUPLED`{URd @URdRange,URd2 @URd2Range} : 4 4 1 10
    ULDC_VOTEU_UMOV_ULEPC`{URd @URdRange,URd2 @URd2Range} : 1 4 1 1
};

Дальше »

пятница, 18 июля 2025 г.

sass instructions: registers tracking

nvdisasm lies

lack of documentation

ptxas produces code that is far from perfect

понедельник, 7 июля 2025 г.

sass instructions: uniform registers & wide loading

Uniform registers

Wide loading

четверг, 3 июля 2025 г.

sass instructions properties

пятница, 27 июня 2025 г.

curse of IMAD

воскресенье, 15 июня 2025 г.

nvdisasm sass parser

Instruction names with '.'

Pseudo opcodes

Enums can contain '.' too

Operands not always separated with ','

nvidasm can't show some fields

Results

Source of ambiguity

пятница, 30 мая 2025 г.

nvidia sass assembler

воскресенье, 4 мая 2025 г.

nvidia sass latency tables

Disclaimer

How they are look like

пятница, 18 июля 2025 г.

понедельник, 7 июля 2025 г.

четверг, 3 июля 2025 г.

пятница, 27 июня 2025 г.

воскресенье, 15 июня 2025 г.

пятница, 30 мая 2025 г.

воскресенье, 4 мая 2025 г.