Parts 1 & 2

Pseudo instructions

Surprise-surprise - some PTX instructions not mapped directly to underlying SASS 1:1. Instead they generate lots of another PTX code. I already extracted their decrypted bodies, so it's time to describe how they connected to specific PTX pseudo instructions

There is function somewhere deep inside ptxas which register lots of handlers for dumping real PTX for pseudo instructions. Code for registration of single item looks like

  mov     rdi, [rbx+250h] ; dictionary of pseudo-instructions
  lea     rdx, emit_multimem_ld_reduce ; handler
  lea     rsi, aMultimemLdRedu         ; "multimem.ld_reduce" - pseudo instruction name
  call    reg_sm_cb

There are 587 such handlers - although 473 have strange names like "1030557441". I don't know what they mean - highly likely that this is product of another encryption somewhere inside parser - at least each such string has exactly 1 reference

Lets look inside some handler

  call    get_pool
  mov     rdi, [rax+18h]
  mov     esi, 0C350h ; 50000₁bytes - they don't skimp on matches 
  call    alloc_buf
  test    rax, rax
  mov     r12, rax ; r12 holds address of string buffer
  jz      loc_5626FC1E9D78 ; die in alloc_failed
loc_5626FC1E9733:            ; CODE XREF: emit_multimem_ld_reduce+67D↓j
  lea     rdx, [r13+1A5E95h] ; whut ?
  lea     rsi, aS_11         ; "%s"
  mov     rdi, r12           ; s
  xor     eax, eax
  call    _sprintf ; note that even not snprintf - security above all!
  lea     rdx, [r13+1A5E98h] ; whut again ?
  movsxd  rdi, eax ; store in rdi length of written string
  lea     rsi, aS_11         ; "%s"
  mov     rbx, rdi
  xor     eax, eax
  add     rdi, r12           ; s
  call    _sprintf

Debugger showed that R13 holds address of those decrypted string pool in memory.

Just assess the level of paranoia - there is huge encrypted blob with strings 1.8Mb. Then they wrote 587 functions where each string from those blob can be used only by offset - 21042 unique offsets! Nvidia definitely didn't want us to see its dirty secrets.

So I wrote some code to extract all emitters, then all string offsets - see result. Now it would be good to link offsets from each emitter with real string, right?

Nothing is simpler - yet another Perl XS module to load memory mapped file + small perl script - and finally we can see this

Lexer brute-force

Grigory Evko already made great work to identify some of lex tokes returned from yylex function. Unfortunately his analysis contains a lot of omissions - many states described just as

single-valued keyword

so let's try do some heavy work to identify as much lex tokens as we can

Setting breakpoint in right place

function yylex has single exit block (glory to SSA forms & clang), and it looks like

mov rsi, [rsp+68h+var_40] ; stack canary xor rsi, fs:28h mov eax, r15d ; r15d holds returned token jnz loc_5626FBB31BEF ; die in ___stack_chk_fail add rsp, 38h ; so huge function has very small stack frame pop rbx pop rbp pop r12 pop r13 pop r14 pop r15 retn

I could also find address of yytext (and yyleng for complete happiness) but fortunately here old value of RSI holds pointer to yytext (except EOF of course). So we could use some scriptable debugger like gdb/lldb/drgn and just get log in real-time

Being lazy I used IDA Pro - it allow to connect simple IDC script to breakpoints:

read RSI with get_reg_value
check resulting token in R15D
IDA Pro does not provide convenient function for reading C-style strings so I used ancient trick with per-byte read via read_dbg_byte

As you can see - nothing special and it doesn't even look too nauseating

Preparing data for brute-force

What we trying to brute-force here? Well, obviously indices of instruction attributes, located in tabs sub-dir

So I add to iptx.pl new option -l - then script reads whole set of tables, do some filtering and produces output ptx file with 595 patched strings (as usually found couple of bad tables)

I always spit out sm101 - seems that no lexer logic uses SM version checks

Results

Then we finally can run our specially crafted PTX file under debugger and get log. For analysis I wrote yet another simple perl script. So what tokens of attributes we know now:

259/0x103: .a .alloc .ashift .async .b .b1 .b2 .b4 .b4x16_p64 .b6x16_p32 .b8x16 .block_scale .clustershared .collector::a::discard .collector::a::fill .collector::a::lastuse .collector::a::use .collector::b0::discard .collector::b0::fill .collector::b0::lastuse .collector::b0::use .collector::b1::discard .collector::b1::fill .collector::b1::lastuse .collector::b1::use .collector::b2::discard .collector::b2::fill .collector::b2::lastuse .collector::b2::use .collector::b3::discard .collector::b3::fill .collector::b3::lastuse .collector::b3::use .collector::b::discard .collector::b::fill .collector::b::lastuse .collector::b::use .cp .cp_mask .cta1 .cta2 .cta_group::1 .cta_group::2 .dealloc .desc .e0m3 .e0m3x2 .e0m3x4 .e2m1 .e2m3 .e3m2 .e3m4 .e3m4x2 .e3m4x4 .enaq .f8k4 .fc .fc::2gb4 .frd .frel .g .gather4 .get_first_ctaid .get_first_ctaid::x .get_first_ctaid::y .get_first_ctaid::z .h8k4 .im2col .im2col::w .im2col::w::128 .im2col_no_offs .im2col_no_offs::w .is_canceled .keeprefcount .kind::f16 .kind::f8f6f4 .kind::fp16 .kind::fp4 .kind::fp8 .kind::i8 .kind::int8 .kind::mx8 .kind::mxf4 .kind::mxf4nvf4 .kind::mxf8f6f4 .kind::mxint8 .kind::nvf4 .kind::tf32 .kind::ti16 .launch_dependents .ld .matrix::A .matrix::D .matrix::scaleA .matrix::scaleB .matrix::sparseMetadata .mma .multicast .negAB .no_atexit .o128 .o4 .override::global_address .override::global_dim .override::global_dim::global_stride .override::global_dim_stride .pack::16b .packed_offsets .r .read .regoffset .release::ordered .s .s2m6 .s2m6x2 .s32k2 .satnarrow .scale_vec::1X .scale_vec::2X .scale_vec::4X .scaled::n1::ue8m0 .scaled::n2::ue8m0 .scatter4 .self .shift .st .sub a::a .tf32 .tile .tile::gather4 .tile::scatter4 .ue4m3 .ue5m3x2 .ue8m0 .unpack::16b .v8 .w .w128 .wait .warpx2::01_23 .warpx2::02_13 .warpx4 .write .xorsign
273/0x111: .weak
288/0x120: .rm .rmi .rn .rna .rni .rp .rpi .rs .rz .rzi
289/0x121: .finite .infinite .normal .notanumber .number .subnormal
290/0x122: .ca .cg .cs .cv .inv .invall .lu .wb .wt
291/0x123: .L1 .L2 .L2 .L2 .L2 .tensormap
292/0x124: .clamp .trap .wrap .zero
293/0x125: .shr15 .shr7
294/0x126: .po
295/0x127: .b4e .ecl .ecr .f4e .rc16 .rc8
296/0x128: .bfly .idx .up
299/0x12B: .footprint
300/0x12C: .coarse
308/0x134: .1d_buffer .2dms .a1d .a2d .a2dms .acube .cube
309/0x135: .addr_mode_0 .addr_mode_1 .addr_mode_2 .array_size .channel_data_type .channel_order .depth .filter_mode .force_unnormalized_coords .height .memory_layout .normalized_coords .num_mipmap_levels .num_samples .width
310/0x136: .1d .2d .3d .4d .5d
320/0x140: .eq .equ .ge .geu .gt .gtu .hi .hs .le .leu .lo .ls .lt .ltu .nan .ne .neu .num
322/0x142: .and .or .xor
323/0x143: .cas .dec .exch .inc .safeadd
324/0x144: .add .max .maxabs .min .popc
325/0x145: .uni
326/0x146: .conv .div .unanimous
327/0x147: .sync
328/0x148: .aligned
329/0x149: .all .any
330/0x14A: .dual
331/0x14B: .close
337/0x151: .cluster .cta .gl .gpu .sys
338/0x152: .v2 .v4
349/0x15D: .alias
352/0x160: .ballot
353/0x161: .approx
354/0x162: .relu
355/0x163: .ftz
356/0x164: .noftz
357/0x165: .sat
358/0x166: .satfinite
359/0x167: .cc
360/0x168: .shiftamt
361/0x169: .acq_rel .acquire .relaxed .release .sc .volatile
362/0x16A: .mmio
363/0x16B: .nc
365/0x16D: .NaN
366/0x16E: .bulk_group .mbarrier .mbarrier::arrive::one .mbarrier::complete_tx::bytes .mbarrier::meet_tx::bytes
367/0x16F: .down
368/0x170: .no_membermask_overlap
386/0x182: .128x128b .128x256b .16x128b .16x256b .16x32bx2 .16x64b .32x128b .32x32b .4x256b .64x128b .m16n16 .m16n16k16 .m16n16k8 .m16n8 .m16n8k128 .m16n8k16 .m16n8k256 .m16n8k32 .m16n8k4 .m16n8k64 .m16n8k8 .m32n8k16 .m64n104k16 .m64n104k32 .m64n104k64 .m64n104k8 .m64n112k16 .m64n112k256 .m64n112k32 .m64n112k64 .m64n112k8 .m64n120k16 .m64n120k32 .m64n120k64 .m64n120k8 .m64n128k16 .m64n128k256 .m64n128k32 .m64n128k64 .m64n128k8 .m64n136k16 .m64n136k32 .m64n136k64 .m64n136k8 .m64n144k16 .m64n144k256 .m64n144k32 .m64n144k64 .m64n144k8 .m64n152k16 .m64n152k32 .m64n152k64 .m64n152k8 .m64n160k16 .m64n160k256 .m64n160k32 .m64n160k64 .m64n160k8 .m64n168k16 .m64n168k32 .m64n168k64 .m64n168k8 .m64n16k16 .m64n16k256 .m64n16k32 .m64n16k64 .m64n16k8 .m64n176k16 .m64n176k256 .m64n176k32 .m64n176k64 .m64n176k8 .m64n184k16 .m64n184k32 .m64n184k64 .m64n184k8 .m64n192k16 .m64n192k256 .m64n192k32 .m64n192k64 .m64n192k8 .m64n200k16 .m64n200k32 .m64n200k64 .m64n200k8 .m64n208k16 .m64n208k256 .m64n208k32 .m64n208k64 .m64n208k8 .m64n216k16 .m64n216k32 .m64n216k64 .m64n216k8 .m64n224k16 .m64n224k256 .m64n224k32 .m64n224k64 .m64n224k8 .m64n232k16 .m64n232k32 .m64n232k64 .m64n232k8 .m64n240k16 .m64n240k256 .m64n240k32 .m64n240k64 .m64n240k8 .m64n248k16 .m64n248k32 .m64n248k64 .m64n248k8 .m64n24k16 .m64n24k256 .m64n24k32 .m64n24k64 .m64n24k8 .m64n256k16 .m64n256k256 .m64n256k32 .m64n256k64 .m64n256k8 .m64n32k16 .m64n32k256 .m64n32k32 .m64n32k64 .m64n32k8 .m64n40k16 .m64n40k32 .m64n40k64 .m64n40k8 .m64n48k16 .m64n48k256 .m64n48k32 .m64n48k64 .m64n48k8 .m64n56k16 .m64n56k32 .m64n56k64 .m64n56k8 .m64n64k16 .m64n64k256 .m64n64k32 .m64n64k64 .m64n64k8 .m64n72k16 .m64n72k32 .m64n72k64 .m64n72k8 .m64n80k16 .m64n80k256 .m64n80k32 .m64n80k64 .m64n80k8 .m64n88k16 .m64n88k32 .m64n88k64 .m64n88k8 .m64n8k16 .m64n8k256 .m64n8k32 .m64n8k64 .m64n8k8 .m64n96k16 .m64n96k256 .m64n96k32 .m64n96k64 .m64n96k8 .m8n16 .m8n32 .m8n32k16 .m8n64 .m8n8 .m8n8k128 .m8n8k16 .m8n8k32 .m8n8k4 .m8n8k64
387/0x183: .col .row
389/0x185: .128B .256B .64B
390/0x186: .exclusive
391/0x187: .transA
392/0x188: .negA
393/0x189: .transB
394/0x18A: .negB
395/0x18B: .ignoreC
396/0x18C: .ignoreC_pred
397/0x18D: .L1::evict_first .L1::evict_last .L1::evict_normal .L1::evict_unchanged .L1::no_allocate .L2::evict_first .L2::evict_last .L2::evict_normal .L2::evict_unchanged .L2::no_allocate
400/0x190: .sync_restrict::shared::cluster .sync_restrict::shared::cta
401/0x191: .x1 .x128 .x16 .x2 .x32 .x4 .x64 .x8
402/0x192: .trans
403/0x193: .pair .quad .thread
404/0x194: .bf16 .bf16x2 .s2 .s4 .u2 .u4
405/0x195: .lower::16b
406/0x196: .expand
407/0x197: .sp::ordered_metadata
410/0x19A: .noComplete
411/0x19B: .noinc
412/0x19C: .abs
415/0x19F: .asc::b32 .asc::b64
416/0x1A0: .acc::f16 .acc::f32
417/0x1A1: .box_dim .element_stride .elemtype .fill_mode .global_address .global_dim .global_stride .interleave_layout .rank .swizzle_atomicity .swizzle_mode
418/0x1A2: .b1024
419/0x1A3: .async::generic .tensormap::generic
420/0x1A4: ::after_thread_sync ::before_thread_sync
422/0x1A6: .block16 .block3

As you can see table still has some omissions like 398, 399, 408, 409, 413, 414 etc

I have zero ideas what attributes should be placed at those indices bcs I used whole set of tables from ptxas

PS: Hardest part of (f)lex generated code analysis is that does not make lots of string comparison operations - instead it builds DFA and all strings are fused into those DFA. So it's not easy to extract them back

windows deep internals

вторник, 2 июня 2026 г.

RE of PTX grammar from ptxas, part 3

Pseudo instructions

Lexer brute-force

Setting breakpoint in right place

Preparing data for brute-force

Results

Комментариев нет:

Отправить комментарий

вторник, 2 июня 2026 г.

RE of PTX grammar from ptxas, part 3

Pseudo instructions

Lexer brute-force

Setting breakpoint in right place

Preparing data for brute-force

Results

Комментариев нет:

Отправить комментарий

вторник, 2 июня 2026 г.