Spent couple of days in debugging rare bug in my sass disasm. I tested it on thousands of .cubin files and got bad instruction decoding for one. Btw I never saw papers about testing of disassemblers - compilers like gcc/clang has huge set of tests to detect regressions, so probably I should do the same. The problem is that I periodically add new features and smxx.so files generating every time
My nvd has option -N to dump unrecognized opcodes, so I got for sm55
Not found at E8 0000100000010111111100010101110000011000100000100000000000000011101001110000000000000000
nvdisasm v11 swears that this pile of 0 & 1 must be ISCADD instruction somehow. Ok, lets run ead.pl and check if it can find it:
perl ead.pl -BFvamrzN 0000100000010111111100010101110000011000100000100000000000000011101001110000000000000000 ../data/sm55_1.txt
found 4
........................0.0111000..11...................................................
0000-0-------111111-----0101110001011-------000--00000000000----------------------------
0000000-----------------0101110000111000-00000---00000000000----------------------------
00000--------111111-----01011100000110---000-----00000000000----------------------------
000000-------111111-----0001110---------------------------------------------------------
matched: 0
the first thought was that MD are just too old bcs were extracted from cuda 10, so I made decryptor for cuda 11 (paranoid nvidia removed instructions properties since version 12, so 11 is last source of MD), extracted data, rebuild sm55.cc and sm55.so and run test again
The bug has not disappeared
Time to check in generated sm55.cc what mask has instruction ISCADD:
00000--------111111-----01011100000110---000-----00000000000----------------------------
buggy instruction is
0000100000010111111100010101110000011000100000100000000000000011101001110000000000000000
As you can see this is fourth mask from ead.pl output and it don't match bcs of 1 at position 4. Lets try to find mask for this bad bit:
OEReuseC <- this 1
....X...................................................................................
And how it looks in MD file:
ENCODINGAs you can see this form of ISCADD does not have RegC field and so has mask !OEReuseC, but buggy ptxas nevertheless put instruction where this mask is filled with 1
Opcode13 = Opcode;
Pred = Pg;
PredNot = Pg@not;
Dest = Rd;
RegA = Ra;
RegB = Rb;
WriteCC = writeCC;
Imm5I = shift;
PSign = PSign(PO,Ra@negate,Rb@negate);
!BFiller;
!NencISCADD;
OEUSchedInfo = usched_info;
OEWaitOnSb = req_sb_bitset;
OEVarLatDest = 7;
OEVarLatSrc = 7;
OEReuseA = reuse_src_a;
OEReuseB = reuse_src_b;
!OEReuseC;
OECoupled = 0;
!OEReserved1;
So what exactly happens - ead.pl see !OEReuseC and generate zero mask for bit in this mask, and when my disasm encounters with non-zero bit at those position it can't find matched mask
I don't know how original nvdisasm solves this problem - maybe they add crutch to ignore zero OEReuseC - I am too lazy to make some reverse engineering. So I just patched MD and it works fine now
moral of the story: when you have independent tools to disasm and ptxas to produce buggy code - similar bugs are inevitable
Комментариев нет:
Отправить комментарий