воскресенье, 27 августа 2023 г.

dwarf5 from clang 14

It seems that clang in version 14 utilize more advanced features from DWARF5, so I add their support to my dwarfdump. IMHO most exciting features are:

Section .debug_line_str

In old versions of dwarf filenames have duplicates for each compilation unit. Since dwarf version 5 they storing in separate section and thus shared and save some space. Obviously this space reducing is negligible compared to overhead from types duplication

Section .debug_str_offsets

Also for space reducing each compilation unit has so called base index for strings passed via DW_AT_str_offsets_base. But there is problem - some attributes already can have name before DW_AT_str_offsets_base occurs:


  <0><c>: Abbrev Number: 1 (DW_TAG_compile_unit)
    <d>   DW_AT_producer    : (indexed string: 0): clang version 14.0.6 (git@github.com:github/semmle-code 5c87e7737f331823ed8ed280883888566f08cdea)
    <e>   DW_AT_language    : 33        (C++14)
    <10>   DW_AT_name        : (indexed string: 0x1): c/extractor/src/extractor.cpp
    <11>   DW_AT_str_offsets_base: 0x8


As you can see here 2 tags have names before we have value of string base. Much harder to parse in one pass now

New locations format

I think this is the most cool and useful feature - now each variable and parameter has set of locations linked with address ranges (that`s often case for highly optimized code). Sample:

   Offset Entry 2077
    0024ef56 00000000000006b4 (index into .debug_addr) 004fb3c500000000 (base address)
    0024ef59 0000000000000000 000000000000001c DW_OP_reg5 (rdi)

This cryptic message means that starting from address 0x4fb3c5 (note - most tools like objdump or llvm-dwarfdump cannot correctly show this new locations, in this case objdump showed address in bad format) some local variable located in register rdi until next address range. Seems that both IDA Pro and Binary Ninja cannot use this debug information:
.text:00004FB3C5     mov     rdi, cs:compilation_tf
.text:00004FB3CC     cmp     dword ptr [rdi+0Ch], 0

Global var compilation_tf has type a_trap_file_ptr - pointer to a_trap_file. IDA Pro has that types information from debug info but anyway cannot show access to field of a_trap_file at offset 0xC for next instruction

 
As result of all my patches now I can for example inspect IL structures from Microsoft CodeQL C++ extractor:

суббота, 19 августа 2023 г.

gcc plugin to collect cross-references, part 6

Part 1, 2, 3, 4 & 5
Finally I was able to compile and collect cross-references for enough big open-source projects like linux kernel and botan:
wc -l botan.db
2108274 botan.db
grep Err: botan.db | wc -l
540

So lets check how we can extract access to record fields. If you take quick look at tree.def you can notice very prominent type COMPONENT_REF:
Value is structure or union component.
 Operand 0 is the structure or union (an expression).
 Operand 1 is the field (a node of type FIELD_DECL).
 Operand 2, if present, is the value of DECL_FIELD_OFFSET

 

Sounds easy? "In theory there is no difference between theory and practice". In practice you can encounter many other types in any combinations, like in this relative simple RTL:
(call_insn:TI 1482 1481 2856 35 (call (mem:QI (mem/f:DI (plus:DI (reg/f:DI 0 ax [orig:340 MEM[(struct Server_Hello_13 *)_325].D.264452.D.264115._vptr.Handshake_Message ] [340])
                    (const_int 24 [0x18])) [744 MEM[(int (*) () *)_199 + 24B]+0 S8 A64]) [0 *OBJ_TYPE_REF(_200;&MEM[(struct _Uninitialized *)&D.349029].D.305525._M_storage->3B) S1 A8])
        (const_int 0 [0])) "/usr/local/include/c++/12.2.1/bits/stl_construct.h":88:18 898 {*call}
     (expr_list:REG_CALL_ARG_LOCATION (expr_list:REG_DEP_TRUE (concat:DI (reg:DI 5 di)
                (reg/f:DI 41 r13 [386]))
            (nil))
        (expr_list:REG_DEAD (reg:DI 5 di)
            (expr_list:REG_DEAD (reg/f:DI 0 ax [orig:340 MEM[(struct Server_Hello_13 *)_325].D.264452.D.264115._vptr.Handshake_Message ] [340])
                (expr_list:REG_EH_REGION (const_int 0 [0])
                    (expr_list:REG_CALL_DECL (nil)
                        (nil))))))
    (expr_list:DI (use (reg:DI 5 di))
        (nil)))

So I`ll describe in brief some TREE types and how to deal with them to extract something useful

среда, 16 августа 2023 г.

gcc plugin to collect cross-references, part 5

Part 1, 2, 3 & 4

Lets check how RTL describes jump tables. I made simple test and output of gcc -fdump-final-insns looks like:

(jump_insn # 0 0 8 (parallel [
            (set (pc)
                (reg:DI 0 ax [93]))
            (use (label_ref #))
        ]) "swtest.c":14:3# {*tablejump_1}
     (nil)
 -> 8)
(barrier # 0 0)
(code_label # 0 0 8 (nil) [2 uses])
(jump_table_data # 0 0 (addr_vec:DI [
            (label_ref:DI #)
            (label_ref:DI #)
...
        ]))

As you can see jump_insn uses opcode tablejump_1 refering to label 8. Right after this label located RTL with code jump_table_data - perhaps this is bad idea to assume that it always will be true so it`s better to use function jump_table_for_label. Also for some unknown reason option -fdump-final-insns does not show content of jump tables. So at least lets try to find jump_table_datas from plugin

Surprisingly you cannot find then when iterating on instructions within each block (using FOR_ALL_BB_FN/FOR_BB_INSNS macros). I suspect this due to the fact that both label and jump_table belong to block with index 0. So I used another cycle: 
for ( insn = get_insns(); insn; insn = NEXT_INSN(insn) )
Then we can check if current RTL instruction is jump table with JUMP_TABLE_DATA_P. Jump tables have addr_vec in element with index 3 and each element is label_ref. Length of vector can be obtained from field num_elem. Pretty easy, so what we can do with this knowledge?

понедельник, 14 августа 2023 г.

gcc plugin to collect cross-references, part 4

Let`s apply priceless knowledge from previous part - for example to extract string literals and insert polymorphic decryption
Typical call to printf/printk in RTL looks usually like

(insn 57 56 58 9 (set (reg:DI 5 di)
        (symbol_ref/f:DI ("*.LC0") [flags 0x2] <var_decl 0x7f480e9edea0 *.LC0>)) "swtest.c":17:7 80 {*movdi_internal} 

(call_insn 59 58 191 9 (set (reg:SI 0 ax)
        (call (mem:QI (symbol_ref:DI ("printf") [flags 0x41] <function_decl 0x7f480e8c7000 printf>) [0 __builtin_printf S1 A8])
            (const_int 0 [0]))) "swtest.c":17:7 909 {*call_value}
     (nil)
    (expr_list (use (reg:QI 0 ax))
        (expr_list:DI (use (reg:DI 5 di))
            (nil))))

Translation for mere mortals

суббота, 12 августа 2023 г.

gcc plugin to collect cross-references, part 3

Part 1 & 2
Lets start walk climb on TREEs. Main sources for reference are tree.h, tree-core.h & print-tree.cc
 
Caution: bcs we traveling during RTL pass some of tree types already was removed so it is unlikely to meet TYPE_BINFO/BINFO_VIRTUALS etc

Main structure is tree_base, it included as first field in all other types - for example tree_type_non_common has first field with type tree_type_with_lang_specific,
which has field common with type tree_type_common,
which again has field common with type tree_common
which has field typed with type tree_typed,
which has field base with type tree_base 
Kind of ancient inheritance in pure C

Caution 2: many fields has totally different meaning for concrete types, so GCC strictly stimulate to use macros from tree.h to access all fields

Type of TREE can be obtained with CODE_TREE and name of code with function get_tree_code_name
CODE_TREE returns enum tree_code and it has a lot of values - MAX_TREE_CODES eq 0x175. So lets check only important subset

среда, 9 августа 2023 г.

gcc plugin to collect cross-references, part 2

Because I still fighting with endless variants of unnamed types while processing linux kernel lets talk about persistence
 
The final goal of this plugin is to make from sources database of functions for methods they call and fields they use. After that you can find set of functions referring to some field/method and investigate them later with disasm for example
 
So sure plugin must store it`s results to somewhere
Probably graph databases is better suited for such data - like you can put symbols as vertices and references as edges, then all references to some symbol is just all of it`s incoming edges. But I am too lazy to install JVM and Neo4j, so I used SQLite (and simple YAML-like files for debugging). You can connect your own storage by implementing interface FPersistence