Lets continue to compose some useful things based on perl driven Ced. This time I add couple of new options to test script dg.pl for registers reusing
What is it at all? Nvidia as usually don't want you to know. It implemented in SASS as set of operand attributes "reuse_src_XX" and located usually in scheduler tables like TABLES_opex_X (more new like reuse_src_e & reuse_src_h are enums of type REUSE)
We can consider registers reusing as hint for GPU scheduler that some register in an instruction can reuse the physical register already allocated to one of its source operands, avoiding a full register allocation and reducing register pressure - or in other words as some registers cache
So the first question is how we can detect size of those cache? I made new pass (option -u) to collect all "reuse" attributes and find maximum of acting simultaneously - see function add_ruc
Results are not very exciting - I was unable to find in cublass functions with cache size more than 2. I remember somewhere in numerous papers about dissecting GPU came across the statement that it is equal to 4 - unfortunately I can't remember name of those paper :-(
And the next thing is: can we automatically detect where registers can be reused and patch SASS?
For this I add yet another pass (option -U) - see functions collect_reuse & resolve_rusage
Results: on my kernel (which I am ashamed to show) script found 29 reuse cases, solved 22. I manually selected 12 from most internal loop (to keep max cache size 4), patched them with Ced and got +3% speedup
Also +240LoC (including 70 for detailed comments in POD format).

Комментариев нет:
Отправить комментарий