вторник, 28 октября 2025 г.

sass disasm on perl

as an illustration of the use of the modules presented in my previous post I made yet another sass disasm - fully written on Perl. It is almost exact copy of my nvd - implemented just in 460 LoC, the only unsupported feature is registers tracking - bcs I still don't make perl binding for it. What it can do better than original nvdisasm:

and the most important thing - bcs it's based on Ced - you can patch any instruction from your script. Or customize output/save it somewhere like DB via Perl DBI/add your own passes to reveal some dirty nvidia secrets

like

Barriers

Typical description of their GPU can tell you
  • memory size
  • SM count
  • L1 & L2 cache sizes 
  • CUDA version

and that's all. "our GPUs are greatest GPUs in the world!"

If you are curious CUDA programmer then with cudaGetDeviceProperties you also can extract things like

But what they never told you - count of available registers is not only limited factor, there are at least yet couple of hardware limits - amount of Predicate registers & Barriers (and most guarded top-secret - size of cache for so called "reused" registers)

Why those numbers can be important? Lets assume that you wrote your cute CUDA kernel and using various dirty tricks made sure it only uses part of the available registers - lets say only 64. Now you expect that you can run regsPerMultiprocessor / 64 blocks on each SM, right? Well, in general this can be false bcs of predicate registers/barriers shortage

Anyway returning to barriers - I always was confused how only 6 barriers can guaranty synchronization for 256 registers? And even worse - sass disasm shows that barriers were placed in some strange and looking randomly places

So I made separate logic (option -b for dg.pl) to track what instructions issued read/write barries and what do they have in common. And after some numerous trials I know answer:

ptxas places read/write barriers (src_rel_sb/dst_wr_sb fields) only for instructions having non-zero MIN_WAIT_NEEDED property

Lets check this sample of code:
; s2r_ line 91391 min_wait: 1
; stall 1 total 12 cword 731 B--:R-:W1:-:S1
/*30*/ S2R R9,SR_TID.X &wr=0x1 ?trans1 ;
...
; stall 5 total 13 cword FE5 B01:R-:W-:Y:S5
; wait 0 (W) at 10 stall diff 4
/*40*/ IMAD.LO.U32 R0,R9, 0x200,RZ &req={0} ?WAIT5_END_GROUP ;

Obviously ptxas tracks registers liveness in graph and when some value produced by instruction with non-zero MIN_WAIT_NEEDED it inserts appropriate rd/rw barrier on it and req_bit_set with barriers mask for instruction consuming those value

The opposite is not true

not every instruction having non-zero MIN_WAIT_NEEDED raises barrier - if produced value used enough far. Here the most subtle thing is how "far" it must be - please don't ask me, it's clear requires some additional research on your specific GPU. Or better ask nvidia :-)

Комментариев нет:

Отправить комментарий