среда, 23 июля 2025 г.

ced: sed-like cubin editor

Unfortunately, the only sass assembler I know of has several drawbacks:

  • it's inactive last couple of years. I dropped email to his author and he didn't replied. Hope he is well
  • it don't support modern sm architectures sm1xx
  • it's matmul solver sometimes produces wrong instructions
  • and it don't support many EIATTRS 

The last problem is not related with CuAssembler itself - it is more general: seems that nvdisasm produces output which cannot be used to assembly cubin files

Also we still don't know format of some sections like SHT_CUDA_RELOCINFO. All this makes task of rebuilding cubin files very hard

However do we really need to rebuild cubin files? In my experience 99.9% of desired patches are just set/remove some instructions attributes like register reusing/caching policy/wait groups for USCHED_INFO etc - just boring tuning to squeeze out the last couple of percent of productivity

So the flow of thought was something like

  • it would be good to make plugin for hex-editor to disasm sass instruction at some known offset and show GUI where I could patch some fields
  • I am talentless at creating GUI - so perhaps it would be better to dump instructions fields in text form and then just edit it
  • hey - if you can parse this text representation and patch it back to sass - you don't need hex-editor at all - you could just use sed-like tool to patch instructions via script

and so being lazy and impatient I wrote such tool - it's called ced. Name similarity to sed is not coincidence - it allows you run text script to patch or replace some sass instructions inside cubin files

суббота, 19 июля 2025 г.

sass instructions: LUT operations

I was asked yesterday why I didn't transformed sample from my previous record

iadd r8, r2, r8 ; r8 = r2 + r8
iadd r8, r8, r8 ; r8 = r8 + r8
iadd r8, r8, ur4 ; r8 = r8 + ur4

to more simple

imad r8, r8, 2, ur4 ; r8 = r8 * 2 + ur4

While this is technically correct the problem here - ISA is non-orthogonal. You can use my ina to check available forms of IMAD for universal registers - and suddenly we will discover that it has only 2 forms

  1. @Pg IMAD E:wide E:fmt E:Rd E:Pu E:Ra E:reuse_src_a E:Rb E:reuse_src_b -E:URc
  2. @Pg IMAD E:wide E:fmt E:Rd E:Pu E:Ra E:reuse_src_a E:URb -E:Rc E:reuse_src_c

And no forms with imm value for Ra/Rb. So you can generate only something like:

imad r8, r8, rXX, ur4

And for UIMAD with imm values we have forms with universal registers only:

  1. @UPg UIMAD E:wide E:fmt E:X E:URd E:UPu E:URa ,Sb ~E:URc !E:UPp
  2. @UPg UIMAD E:wide E:fmt E:URd E:UPu E:URa ,Sb -E:URc
  3. etc

But all this is just kids games compared to LUT operations. In short - you can have 255 combinations of logical operations over 3 operands driven by index. nvdisasm shows them like:

LOP3.LUT R0, R3, R0, RZ, 0x30, !PT 

Very informative, yeah. So I employed sympy to generate table of simplified expressions - however I am too old and lazy to write python scripts. So pretty obvious solution:

  • make perl script to enumerate all possible combinations and generate python script
  • which in turn generates string table
  • and then sed add quotes and commas
And now my disasm shows much clearer output:
LOP3.LUT PT,R0,R3,R0,RZ, 0x30,!PT &req={5}; LUT 30: a & ~b
So here a = R3, b = R0 and result R0 = R3 & ~R0

пятница, 18 июля 2025 г.

sass instructions: registers tracking

I've add tracking of registers to both nvd & pa - you can use -T option. And I have lots of bad news

nvdisasm lies

Yup, again. Let's check this innocent looking code:
CS2R R100, SRZ
You can assume that it stores value from special to single 32bit regular register. Actually it used wide loading and store value to R100 & R101. bcs MD of CS2R looks like

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /QInteger("64"):sz
Register:Rd
','SpecialRegister:SRa

PREDICATES
 IDEST_SIZE = 32 + ((sz==`QInteger@"64"))*32;

QInteger has default value "64" and so omitted in output. Ok, maybe this is bcs special registers are 64bit? Lets check MD for S2UR to store value to uniform register:

FORMAT PREDICATE @[!]UniformPredicate(UPT):UPg Opcode
UniformRegister:URd

PREDICATES
 IDEST_SIZE = 32;

It does not have width modifier at all and destination size is simple 32. So cs2r by default is 64bit and s2ur is 32bit. Srsly? Is it should be obvious? Highly likely this person now inventing mnemonics names for Intel

 

lack of documentation

In essence we have this short list of instructions and chapter 8.7 from ancient "The CUDA Handbook" (btw published in 2013). Properties & predicates make it a little easier to understand. Unfortunately they contains only info about regular and uniform registers. And we have yet several classes of instructions working with another kind of registers

Predicates

MD for ISETP:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ICmpAll:icmp /REDUX_SZ("S32"):fmt /Bop:bop /EXONLY:ex
Predicate:Pu
','Predicate:Pv

PREDICATES
 IDEST_SIZE = 0;
 IDEST2_SIZE = 0;

and for HSETP2:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /OFMT_F16_V2_BF16_V2("F16_V2"):ofmt /FCMP:cmp /H_AND("noh_and"):h_and /FTZ("noftz"):ft
z /Bop:bop
Predicate:Pu
','Predicate:Pv

PREDICATES
 IDEST_SIZE = 0;
 IDEST2_SIZE = 0;

I don't know if they set their first predicate Pu only or both Pu & Pv. Btw famous IMAD has very curious MD for some forms:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /HIONLY:wide /FMT("S32"):fmt /XONLY:X
Register:Rd
','Predicate("PT"):Pu
','Register:Ra {/REUSE("noreuse"):reuse_src_a}
','Register:Rb {/REUSE("noreuse"):reuse_src_b}
',' [~] Register:Rc {/REUSE("noreuse"):reuse_src_c}
',' [!]Predicate:Pp

Usually IMAD means multiply and add, so Rd = Ra * Rb + Rc. But here we have two predicates, so should it have semantic Rd = Ra * Rb * Pu + Rc * Pp?

Barriers

MD for BMOV: 

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ONLY32:sz
BD:barReg
','CBU_STATE_NONBAR:cbu_state

PREDICATES
 IDEST_SIZE = 0;
 IDEST2_SIZE = 0;

Enum BD described as
BD "B10"=10 , "B11"=11 , "B14"=14 , "B4"=4 , "B5"=5 , "B6"=6 , "B7"=7 , "B0"=0 , "B1"=1 , "B2"=2 , "B3"=3 , "B15"=15 , "B12"=12 , "B8"=8 , "B9"=9 , "B13"=13;
 
I am sure that there are more...


ptxas produces code that is far from perfect

понедельник, 7 июля 2025 г.

sass instructions: uniform registers & wide loading

Having predicates for operands size and properties for type/identification we could write register tracking (well, at least up to sm90). But before we should familiarize yourself with couple of CUDA specific things

Uniform registers

As concise introduction you can read this paper, especially paragraph 3.5.2:
Turing introduces a new feature intended to improve the maximum achievable arithmetic throughput of the main, floating-point capable datapaths, by adding a separate, integer-only, scalar datapath (named the uniform datapath) that operates in parallel with the main datapath

Regular instructions can access both uniform and regular registers. Uniform datapath instructions, instead, focus on uniform instructions almost exclusively

So for example on SM75 you have 255 regular registers and 63 uniform registers UR0-UR62 (and URZ clearly mapped to RZ) + uniform predicates UP0-UP6. Given that they "typically updating array indices, loop indices or pointers" and size of VRAM can be up to 192Gb someone would expect that this is whole new set of registers with width 64bit to access arrays > 4Gb
 
Well, reality is much more boring - they are just virtual mapping of regular 32bit registers. Proof:

S2R R3, SR_TID.X
S2UR UR4, SR_CTAID.Y

Here both SR_XX are so called "special registers" with width 32bit. Also EIATTR_MAXREG_COUNT (being itself 16bit) always contains value 0xff. I saw curious cases when "nvdisasm --print-life-ranges" shows GPR 223 and UGPR 35. If I can use calculator 223 + 35 = 258
I have zero ideas how those URs are mapped to real registers (and uniform predicates to ordinal predicates) - at least there is no EIATTRs for such mapping. Obviously they are not mapped 1:1:

S2R R10, SR_CTAID.Z ; R10 now contains value from special register SR_CTAID.Z
ULDC.64 UR10, c[0x0][0x118]
IMAD.WIDE R2, R10, R3, c[0x0][0x168] ; and here it's value is still alive

Also it's totally unclear how functions get initial values for this URs. I`ve wrote for my nvd parser for EIATTR_PARAM_CBANK & EIATTR_KPARAM_INFO and it seems that often they are loaded exactly from nowhere:

/*30*/  ULDC.64 UR4,c[0][0x118];
 ; unknown cb off 118

/*40*/  IMAD.WIDE R2,PT,R7,R6,c[0][0x168] &req={0};
 ; cb in section 254, offset 168 - 160 = 8

as you can see const bank starts from 0x160 and UR4 was loaded from offset 0x118

Wide loading 

четверг, 3 июля 2025 г.

sass instructions properties

I've already described so called predicates. Unfortunately they have only size of operands. Unlike predicates properties also have types:

 IDEST_OPERAND_MAP = (1<<INDEX(Rd));
 IDEST_OPERAND_TYPE = (1<<IOPERAND_TYPE_GENERIC);
 IDEST2_OPERAND_MAP = (1<<IOPERAND_MAP_NON_EXISTENT_OPERAND);
 IDEST2_OPERAND_TYPE = (1<<IOPERAND_TYPE_NON_EXISTENT_OPERAND);
 ISRC_B_OPERAND_MAP = (1<<INDEX(Rb));
 ISRC_B_OPERAND_TYPE = (1<<IOPERAND_TYPE_GENERIC);
 ISRC_C_OPERAND_MAP = (1<<INDEX(Rc));
 ISRC_C_OPERAND_TYPE = (1<<IOPERAND_TYPE_TEX);
 ISRC_A_OPERAND_MAP = (1<<INDEX(Ra));
 ISRC_A_OPERAND_TYPE = (1<<IOPERAND_TYPE_SURFACE_COORDINATES); 

This sample for suatom instruction. Here destination has single operand so DEST2 marked with NON_EXISTENT_OPERAND. Unfortunately properties has couple of serious drawbacks:

1) they were cut out by paranoid NVidia somewhere in version 12.7-12.8, so I ripped MDs with properties up to sm90 - sm100, sm101 & sm120 don't have them. I also tried to re-apply properties from sm90 to 3 remained - but this is very unreliable

2) they are not complete. Lets see couple of samples