windows deep internals: sass registers reusing

четверг, 13 ноября 2025 г.

sass registers reusing

Lets continue to compose some useful things based on perl driven Ced. This time I add couple of new options to test script dg.pl for registers reusing

What is it at all? Nvidia as usually don't want you to know. It implemented in SASS as set of operand attributes "reuse_src_XX" and located usually in scheduler tables like TABLES_opex_X (more new like reuse_src_e & reuse_src_h are enums of type REUSE)

We can consider registers reusing as hint for GPU scheduler that some register in an instruction can reuse the physical register already allocated to one of its source operands, avoiding a full register allocation and reducing register pressure - or in other words as some registers cache

So the first question is how we can detect size of those cache? I made new pass (option -u) to collect all "reuse" attributes and find maximum of acting simultaneously - see function add_ruc

Results are not very exciting - I was unable to find in cublass functions with cache size more than 2. I remember somewhere in numerous papers about dissecting GPU came across the statement that it is equal to 4 - unfortunately I can't remember name of those paper :-(

And the next thing is: can we automatically detect where registers can be reused and patch SASS?

For this I add yet another pass (option -U) - see functions collect_reuse & resolve_rusage

Results: on my kernel (which I am ashamed to show) script found 29 reuse cases, solved 22. I manually selected 12 from most internal loop (to keep max cache size 4), patched them with Ced and got +3% speedup

Also +240LoC (including 70 for detailed comments in POD format).

windows deep internals

четверг, 13 ноября 2025 г.

sass registers reusing

Комментариев нет:

Отправить комментарий

четверг, 13 ноября 2025 г.

sass registers reusing

Комментариев нет:

Отправить комментарий

четверг, 13 ноября 2025 г.