Patching of cubin files is good, but loading and running them requires lots of code and using of Driver API. It would be much more convenient to patch SASS directly in binaries produced by nvcc
Unfortunately evil nvidia as usually shows it's paranoia:
- cuobjdump can list & extract content but not replace. Also it is extremely buggy on old libraries like libcublas.so v7
- official fatbinary is too complex and rebuilds whole file from scratch
- format of fatbinary is undocumented
- list files with -v option
- extract file at some index: -i idx -o output.filename
- replace file at some index: -i idx -r replace.filename
Perl binding
Being lazy I prefer to use perl scripts to automate as much as possible, so I also made perl XS module ELF::FatBinary. Having also module ELF::Reader this allows more fine filtering of ELF files - like if file contains section/symbol with some specific name etc. See simple example how it might look like
Limitation
The tool can replace files inside fatbinary only in-place, so
- compressed fatbinaries not supported
- size of files must be the same
Some results
Without false modesty, I believe that the main goal of whole project was achieved - now I can patch SASS instruction with simple text scripts and run them. No hex-editors, you can integrate those steps in your favorite CI/CD pipeline and even in Makefiles.
So I made trivial sample and played a bit with ced for patching some SASS
Registers order on wide load/store
See details here
Original cubin file stores 64bit integer 0x61 to output buffer:
/*0010*/ S2R R2, SR_TID.X ;
/*0020*/ MOV R3, 0x8 ; size of 64bit int
/*0030*/ MOV R4, 0x61 ; lo part in R4
/*0040*/ MOV R5, 0x0 ; hi part in R5
/*0050*/ IMAD.WIDE R2, R2, R3, c[0x0][0x160] ;
/*0060*/ STG.E.64.SYS [R2], R4 ; store pair R4 & R5
as you can see low part of 64bit constant was stored in R4, high in R5 and then STG.64 stores pair R4 & R5, so order of registers is now known (little-endian)
Btw this is useful trick to load/store 2 (or even 4) adjacent 32bit values from/to array with single instruction
Addresses of function parameters
Then I replaced instruction at 0x60 to
STG.E.64.SYS [R2], R2
with corresponding ced script - so now cubin just returns address of output buffer. Results:
d_i 0x7f3fa5a00000
from device 7f3fa5a00000
00000000 00 00 A0 A5-3F 7F 00 00|00 00 00 00-00 00 00 00 ....?..........
value of host variable d_i is exactly the same as result from function. I don't know if this is true for all SM architectures although
This trick can be used for example to reduce amount of function parameters - instead of N addresses you can pass single array with N addresses filled on host
Getting PC address
There is curious form of RPCMOV instruction:
CLASS "rpcmov_srcPc_"
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ONLY32:sz
Register:Rd
','PC_REG:RpcN
it theoretically allows you to retrieve PC in pair of registers, so I made ced script like
s 9
# remember - order is little-endian, so low part in R4
30 r RPCMOV R4, Rpc.LO
40 r RPCMOV R5, Rpc.HI
Unfortunately patched cubin always returns zero instead of PC value. Also if loading of PC splits on couple of instructions - what value of PC will be placed in result? From first instruction or from second?
Ok, no problem - there is another french instruction LEPC:
CLASS "lepc_"
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode
Register:Rd
PREDICATES
IDEST_SIZE = 64;
this time black magic works:
from device 7fe177e32d40
00000000 40 2D E3 77-E1 7F 00 00|00 00 00 00-00 00 00 00 @-.w...........
When LEPC at offset 0x30 result ends with 0x30, so I conclude that it loads value of PC before changing it
Section/segments attributes
finally I played a bit with attributes of text section and executable segment - patch them to make writeable. It works fine, although I am not sure if Driver API really cares about those attributes
Happy hacking!
Комментариев нет:
Отправить комментарий