Pseudo instructions
Surprise-surprise - some PTX instructions not mapped directly to underlying SASS 1:1. Instead they generate lots of another PTX code. I already extracted their decrypted bodies, so it's time to describe how they connected to specific PTX pseudo instructions
There is function somewhere deep inside ptxas which register lots of handlers for dumping real PTX for pseudo instructions. Code for registration of single item looks like
mov rdi, [rbx+250h] ; dictionary of pseudo-instructions
lea rdx, emit_multimem_ld_reduce ; handler
lea rsi, aMultimemLdRedu ; "multimem.ld_reduce" - pseudo instruction name
call reg_sm_cb
There are 587 such handlers - although 473 have strange names like "1030557441". I don't know what they mean - highly likely that this is product of another encryption somewhere inside parser - at least each such string has exactly 1 reference
Lets look inside some handler
call get_pool
mov rdi, [rax+18h]
mov esi, 0C350h ; 50000₁bytes - they don't skimp on matches
call alloc_buf
test rax, rax
mov r12, rax ; r12 holds address of string buffer
jz loc_5626FC1E9D78 ; die in alloc_failed
loc_5626FC1E9733: ; CODE XREF: emit_multimem_ld_reduce+67D↓j
lea rdx, [r13+1A5E95h] ; whut ?
lea rsi, aS_11 ; "%s"
mov rdi, r12 ; s
xor eax, eax
call _sprintf ; note that even not snprintf - security above all!
lea rdx, [r13+1A5E98h] ; whut again ?
movsxd rdi, eax ; store in rdi length of written string
lea rsi, aS_11 ; "%s"
mov rbx, rdi
xor eax, eax
add rdi, r12 ; s
call _sprintf
Debugger showed that R13 holds address of those decrypted string pool in memory.
Just assess the level of paranoia - there is huge encrypted blob with strings 1.8Mb. Then they wrote 587 functions where each string from those blob can be used only by offset - 21042 unique offsets! Nvidia definitely didn't want us to see its dirty secrets.
So I wrote some code to extract all emitters, then all string offsets - see result. Now it would be good to link offsets from each emitter with real string, right?
Nothing is simpler - yet another Perl XS module to load memory mapped file + small perl script - and finally we can see this
Lexer brute-force
Grigory Evko already made great work to identify some of lex tokes returned from yylex function. Unfortunately his analysis contains a lot of omissions - many states described just as
single-valued keyword
so let's try do some heavy work to identify as much lex tokens as we can
Setting breakpoint in right place
function yylex has single exit block (glory to SSA forms & clang), and it looks like
mov rsi, [rsp+68h+var_40] ; stack canary
xor rsi, fs:28h
mov eax, r15d ; r15d holds returned token
jnz loc_5626FBB31BEF ; die in ___stack_chk_fail
add rsp, 38h ; so huge function has very small stack frame
pop rbx
pop rbp
pop r12
pop r13
pop r14
pop r15
retn
I could also find address of yytext (and yyleng for complete happiness) but fortunately here old value of RSI holds pointer to yytext (except EOF of course). So we could use some scriptable debugger like gdb/lldb/drgn and just get log in real-time
Being lazy I used IDA Pro - it allow to connect simple IDC script to breakpoints:
- read RSI with get_reg_value
- check resulting token in R15D
- IDA Pro does not provide convenient function for reading C-style strings so I used ancient trick with per-byte read via read_dbg_byte
Preparing data for brute-force
What we trying to brute-force here? Well, obviously indices of instruction attributes, located in tabs sub-dir
So I add to iptx.pl new option -l - then script reads whole set of tables, do some filtering and produces output ptx file with 595 patched strings (as usually found couple of bad tables)
I always spit out sm101 - seems that no lexer logic uses SM version checks
Results
Then we finally can run our specially crafted PTX file under debugger and get log. For analysis I wrote yet another simple perl script. So what tokens of attributes we know now:
- 259/0x103: .a .alloc .ashift .async .b .b1 .b2 .b4 .b4x16_p64 .b6x16_p32 .b8x16 .block_scale .clustershared .collector::a::discard .collector::a::fill .collector::a::lastuse .collector::a::use .collector::b0::discard .collector::b0::fill .collector::b0::lastuse .collector::b0::use .collector::b1::discard .collector::b1::fill .collector::b1::lastuse .collector::b1::use .collector::b2::discard .collector::b2::fill .collector::b2::lastuse .collector::b2::use .collector::b3::discard .collector::b3::fill .collector::b3::lastuse .collector::b3::use .collector::b::discard .collector::b::fill .collector::b::lastuse .collector::b::use .cp .cp_mask .cta1 .cta2 .cta_group::1 .cta_group::2 .dealloc .desc .e0m3 .e0m3x2 .e0m3x4 .e2m1 .e2m3 .e3m2 .e3m4 .e3m4x2 .e3m4x4 .enaq .f8k4 .fc .fc::2gb4 .frd .frel .g .gather4 .get_first_ctaid .get_first_ctaid::x .get_first_ctaid::y .get_first_ctaid::z .h8k4 .im2col .im2col::w .im2col::w::128 .im2col_no_offs .im2col_no_offs::w .is_canceled .keeprefcount .kind::f16 .kind::f8f6f4 .kind::fp16 .kind::fp4 .kind::fp8 .kind::i8 .kind::int8 .kind::mx8 .kind::mxf4 .kind::mxf4nvf4 .kind::mxf8f6f4 .kind::mxint8 .kind::nvf4 .kind::tf32 .kind::ti16 .launch_dependents .ld .matrix::A .matrix::D .matrix::scaleA .matrix::scaleB .matrix::sparseMetadata .mma .multicast .negAB .no_atexit .o128 .o4 .override::global_address .override::global_dim .override::global_dim::global_stride .override::global_dim_stride .pack::16b .packed_offsets .r .read .regoffset .release::ordered .s .s2m6 .s2m6x2 .s32k2 .satnarrow .scale_vec::1X .scale_vec::2X .scale_vec::4X .scaled::n1::ue8m0 .scaled::n2::ue8m0 .scatter4 .self .shift .st .sub a::a .tf32 .tile .tile::gather4 .tile::scatter4 .ue4m3 .ue5m3x2 .ue8m0 .unpack::16b .v8 .w .w128 .wait .warpx2::01_23 .warpx2::02_13 .warpx4 .write .xorsign
- 273/0x111: .weak
- 288/0x120: .rm .rmi .rn .rna .rni .rp .rpi .rs .rz .rzi
- 289/0x121: .finite .infinite .normal .notanumber .number .subnormal
- 290/0x122: .ca .cg .cs .cv .inv .invall .lu .wb .wt
- 291/0x123: .L1 .L2 .L2 .L2 .L2 .tensormap
- 292/0x124: .clamp .trap .wrap .zero
- 293/0x125: .shr15 .shr7
- 294/0x126: .po
- 295/0x127: .b4e .ecl .ecr .f4e .rc16 .rc8
- 296/0x128: .bfly .idx .up
- 299/0x12B: .footprint
- 300/0x12C: .coarse
- 308/0x134: .1d_buffer .2dms .a1d .a2d .a2dms .acube .cube
- 309/0x135: .addr_mode_0 .addr_mode_1 .addr_mode_2 .array_size .channel_data_type .channel_order .depth .filter_mode .force_unnormalized_coords .height .memory_layout .normalized_coords .num_mipmap_levels .num_samples .width
- 310/0x136: .1d .2d .3d .4d .5d
- 320/0x140: .eq .equ .ge .geu .gt .gtu .hi .hs .le .leu .lo .ls .lt .ltu .nan .ne .neu .num
- 322/0x142: .and .or .xor
- 323/0x143: .cas .dec .exch .inc .safeadd
- 324/0x144: .add .max .maxabs .min .popc
- 325/0x145: .uni
- 326/0x146: .conv .div .unanimous
- 327/0x147: .sync
- 328/0x148: .aligned
- 329/0x149: .all .any
- 330/0x14A: .dual
- 331/0x14B: .close
- 337/0x151: .cluster .cta .gl .gpu .sys
- 338/0x152: .v2 .v4
- 349/0x15D: .alias
- 352/0x160: .ballot
- 353/0x161: .approx
- 354/0x162: .relu
- 355/0x163: .ftz
- 356/0x164: .noftz
- 357/0x165: .sat
- 358/0x166: .satfinite
- 359/0x167: .cc
- 360/0x168: .shiftamt
- 361/0x169: .acq_rel .acquire .relaxed .release .sc .volatile
- 362/0x16A: .mmio
- 363/0x16B: .nc
- 365/0x16D: .NaN
- 366/0x16E: .bulk_group .mbarrier .mbarrier::arrive::one .mbarrier::complete_tx::bytes .mbarrier::meet_tx::bytes
- 367/0x16F: .down
- 368/0x170: .no_membermask_overlap
- 386/0x182: .128x128b .128x256b .16x128b .16x256b .16x32bx2 .16x64b .32x128b .32x32b .4x256b .64x128b .m16n16 .m16n16k16 .m16n16k8 .m16n8 .m16n8k128 .m16n8k16 .m16n8k256 .m16n8k32 .m16n8k4 .m16n8k64 .m16n8k8 .m32n8k16 .m64n104k16 .m64n104k32 .m64n104k64 .m64n104k8 .m64n112k16 .m64n112k256 .m64n112k32 .m64n112k64 .m64n112k8 .m64n120k16 .m64n120k32 .m64n120k64 .m64n120k8 .m64n128k16 .m64n128k256 .m64n128k32 .m64n128k64 .m64n128k8 .m64n136k16 .m64n136k32 .m64n136k64 .m64n136k8 .m64n144k16 .m64n144k256 .m64n144k32 .m64n144k64 .m64n144k8 .m64n152k16 .m64n152k32 .m64n152k64 .m64n152k8 .m64n160k16 .m64n160k256 .m64n160k32 .m64n160k64 .m64n160k8 .m64n168k16 .m64n168k32 .m64n168k64 .m64n168k8 .m64n16k16 .m64n16k256 .m64n16k32 .m64n16k64 .m64n16k8 .m64n176k16 .m64n176k256 .m64n176k32 .m64n176k64 .m64n176k8 .m64n184k16 .m64n184k32 .m64n184k64 .m64n184k8 .m64n192k16 .m64n192k256 .m64n192k32 .m64n192k64 .m64n192k8 .m64n200k16 .m64n200k32 .m64n200k64 .m64n200k8 .m64n208k16 .m64n208k256 .m64n208k32 .m64n208k64 .m64n208k8 .m64n216k16 .m64n216k32 .m64n216k64 .m64n216k8 .m64n224k16 .m64n224k256 .m64n224k32 .m64n224k64 .m64n224k8 .m64n232k16 .m64n232k32 .m64n232k64 .m64n232k8 .m64n240k16 .m64n240k256 .m64n240k32 .m64n240k64 .m64n240k8 .m64n248k16 .m64n248k32 .m64n248k64 .m64n248k8 .m64n24k16 .m64n24k256 .m64n24k32 .m64n24k64 .m64n24k8 .m64n256k16 .m64n256k256 .m64n256k32 .m64n256k64 .m64n256k8 .m64n32k16 .m64n32k256 .m64n32k32 .m64n32k64 .m64n32k8 .m64n40k16 .m64n40k32 .m64n40k64 .m64n40k8 .m64n48k16 .m64n48k256 .m64n48k32 .m64n48k64 .m64n48k8 .m64n56k16 .m64n56k32 .m64n56k64 .m64n56k8 .m64n64k16 .m64n64k256 .m64n64k32 .m64n64k64 .m64n64k8 .m64n72k16 .m64n72k32 .m64n72k64 .m64n72k8 .m64n80k16 .m64n80k256 .m64n80k32 .m64n80k64 .m64n80k8 .m64n88k16 .m64n88k32 .m64n88k64 .m64n88k8 .m64n8k16 .m64n8k256 .m64n8k32 .m64n8k64 .m64n8k8 .m64n96k16 .m64n96k256 .m64n96k32 .m64n96k64 .m64n96k8 .m8n16 .m8n32 .m8n32k16 .m8n64 .m8n8 .m8n8k128 .m8n8k16 .m8n8k32 .m8n8k4 .m8n8k64
- 387/0x183: .col .row
- 389/0x185: .128B .256B .64B
- 390/0x186: .exclusive
- 391/0x187: .transA
- 392/0x188: .negA
- 393/0x189: .transB
- 394/0x18A: .negB
- 395/0x18B: .ignoreC
- 396/0x18C: .ignoreC_pred
- 397/0x18D: .L1::evict_first .L1::evict_last .L1::evict_normal .L1::evict_unchanged .L1::no_allocate .L2::evict_first .L2::evict_last .L2::evict_normal .L2::evict_unchanged .L2::no_allocate
- 400/0x190: .sync_restrict::shared::cluster .sync_restrict::shared::cta
- 401/0x191: .x1 .x128 .x16 .x2 .x32 .x4 .x64 .x8
- 402/0x192: .trans
- 403/0x193: .pair .quad .thread
- 404/0x194: .bf16 .bf16x2 .s2 .s4 .u2 .u4
- 405/0x195: .lower::16b
- 406/0x196: .expand
- 407/0x197: .sp::ordered_metadata
- 410/0x19A: .noComplete
- 411/0x19B: .noinc
- 412/0x19C: .abs
- 415/0x19F: .asc::b32 .asc::b64
- 416/0x1A0: .acc::f16 .acc::f32
- 417/0x1A1: .box_dim .element_stride .elemtype .fill_mode .global_address .global_dim .global_stride .interleave_layout .rank .swizzle_atomicity .swizzle_mode
- 418/0x1A2: .b1024
- 419/0x1A3: .async::generic .tensormap::generic
- 420/0x1A4: ::after_thread_sync ::before_thread_sync
- 422/0x1A6: .block16 .block3
As you can see table still has some omissions like 398, 399, 408, 409, 413, 414 etc
I have zero ideas what attributes should be placed at those indices bcs I used whole set of tables from ptxas
PS: Hardest part of (f)lex generated code analysis is that does not make lots of string comparison operations - instead it builds DFA and all strings are fused into those DFA. So it's not easy to extract them back
Комментариев нет:
Отправить комментарий