as an illustration of the use of the modules presented in my previous post I made yet another sass disasm - fully written on Perl. It is almost exact copy of my nvd - implemented just in 460 LoC, the only unsupported feature is registers tracking - bcs I still don't make perl binding for it. What it can do better than original nvdisasm:
- shows LUT operations
- shows instructions properties/predicates
- shows relocs for each code section
- shows const bank params
and the most important thing - bcs it's based on Ced - you can patch any instruction from your script. Or customize output/save it somewhere like DB via Perl DBI/add your own passes to reveal some dirty nvidia secrets
like
Barriers
- memory size
- SM count
- L1 & L2 cache sizes
- CUDA version
and that's all. "our GPUs are greatest GPUs in the world!"
If you are curious CUDA programmer then with cudaGetDeviceProperties you also can extract things likeBut what they never told you - count of available registers is not only limited factor, there are at least yet couple of hardware limits - amount of Predicate registers & Barriers (and most guarded top-secret - size of cache for so called "reused" registers)
Why those numbers can be important? Lets assume that you wrote your cute CUDA kernel and using various dirty tricks made sure it only uses part of the available registers - lets say only 64. Now you expect that you can run regsPerMultiprocessor / 64 blocks on each SM, right? Well, in general this can be false bcs of predicate registers/barriers shortage
Anyway returning to barriers - I always was confused how only 6 barriers can guaranty synchronization for 256 registers? And even worse - sass disasm shows that barriers were placed in some strange and looking randomly places
So I made separate logic (option -b for dg.pl) to track what instructions issued read/write barries and what do they have in common. And after some numerous trials I know answer:
ptxas places read/write barriers (src_rel_sb/dst_wr_sb fields) only for instructions having non-zero MIN_WAIT_NEEDED property
Lets check this sample of code: ; s2r_ line 91391 min_wait: 1
; stall 1 total 12 cword 731 B--:R-:W1:-:S1
/*30*/ S2R R9,SR_TID.X &wr=0x1 ?trans1 ;
...
; stall 5 total 13 cword FE5 B01:R-:W-:Y:S5
; wait 0 (W) at 10 stall diff 4
/*40*/ IMAD.LO.U32 R0,R9, 0x200,RZ &req={0} ?WAIT5_END_GROUP ;
Obviously ptxas tracks registers liveness in graph and when some value produced by instruction with non-zero MIN_WAIT_NEEDED it inserts appropriate rd/rw barrier on it and req_bit_set with barriers mask for instruction consuming those value

Комментариев нет:
Отправить комментарий