пятница, 31 марта 2023 г.

DWARF size overhead

I made today simple script to estimate size overhead due types duplication. This is hard task for C++ - bcs some types can have specialized (or partially specialized) template parameters and sure this types should be considered as different. But for plain C we can safely get all high-level types and assume that types with the same name and declared at the same line and column are equal

Next I ran this script on objdump -g dump for linux kernel. Script gave me digit 252741370

Lets find size of .debug_info section

objdump -h vmlinux | grep debug_info
 35 .debug_info   118205ec  0000000000000000  0000000000000000  03037230  2**0

Size is 0x118205ec = 293733868

And finally lets calculate share of unnecessary info: 252741370 /  293733868 = 0,8604

I am shocked - 86%!!! Looks like hd manufacturers conspiracy 

Update: for C++ I made another version of this script to support namespaces and got following results:

  • cc1 from gcc 111858917 / 130272407 = 0,8587
  • gdb 105034993 / 139899598 = 0,7508
  • llvm-dwarfdump 194031570 / 263053224 = 0,7376

четверг, 30 марта 2023 г.

dwarfdump

I made pale analog of world famous pdbdump to dump types and functions from DWARF. Before introducing my tool I have several words about DWARF - it is excess, compiler-specific, inconsistent and dangerous

Redudancy

gcc and llvm put every used types set in each compilation unit. This is really terrible if you use lots of templates like STL/boost - you will have duplicated declarations of std::map, std::string etc. Yep, this is main reason why stripped binaries becomes much smaller:

ls -l llvm-dwarfdump llvm-dwarfdump.stripped

-rwxrwxr-x 1 redp redp 471241104 mar 29 00:52 llvm-dwarfdump
-rwxrwxr-x 1 redp redp 22170696  mar 29 17:49
llvm-dwarfdump.stripped

Another example - lets check how many times function console_printk declared in debug info from linux kernel:
grep console_printk vm.g | wc -l
2883

It is the same function declared in file include/linux/printk.h line 65 column 0xc - why linker can`t merge it`s type producing debug output?
 
Golang tries to fix this problem using types declarations once and then referring to them from another units (and at the same time compressing debug sections with zlib) - this is very ironically bcs anyway binaries on go typically have size in several Mb (btw llvm-dwarfdump cannot process compressed sections)

 

compiler-specific 

This is pretty obvious - each programming language has some unique features and DWARF must deal with all of them
But just look at this:
 <0><b>: Abbrev Number: 1 (DW_TAG_compile_unit)
    <c>   DW_AT_name        : internal/cpu
    <19>   DW_AT_language    : 22       (Go)
    <1a>   DW_AT_stmt_list   : 0x0
    <1e>   DW_AT_low_pc      : 0x401000
    <26>   DW_AT_ranges      : 0x0
    <2a>   DW_AT_comp_dir    : .
    <2c>   DW_AT_producer    : Go cmd/compile go1.13.8
    <44>   Unknown AT value: 2905: cpu

I was unable to find in golang sources meaning of this custom attributes

 

Inconsistency

DWARF specification don`t define lots of important things. Just to name few:
  • order of tags, so you can have mix of formal parameters with types at the same nesting level
  • which attributes are mandatory for tags - I saw lots of missed DW_AT_sibling for example
  • when locations info should be placed in separate section .debug_loc - seems that this happens for inlined subroutines only
  • encoding of addresses. You have DW_AT_low_pc for functions address. But also there is DW_AT_abstract_origin (and DW_AT_specification). The same function can have different addresses even in plain C via this attributes: 
     <1><191cde>: Abbrev Number: 194 (DW_TAG_subprogram)
        <191ce0>   DW_AT_external    : 1
        <191ce0>   DW_AT_name        : (indirect string, offset: 0x24d2f): perf_events_lapic_init
        <191ce4>   DW_AT_decl_file   : 1
        <191ce5>   DW_AT_decl_line   : 1719
        <191ce7>   DW_AT_decl_column : 6
        <191ce8>   DW_AT_prototyped  : 1
        <191ce8>   DW_AT_inline      : 1    (inlined)
     <1><19a945>: Abbrev Number: 96 (DW_TAG_subprogram)
        <19a946>   DW_AT_abstract_origin: <0x191cde>
        <19a94a>   DW_AT_low_pc      : 0xffffffff81004dc0
     <1><19b3c7>: Abbrev Number: 96 (DW_TAG_subprogram)
        <19b3c8>   DW_AT_abstract_origin: <0x191cde>
        <19b3cc>   DW_AT_low_pc      : 0xffffffff81007930


 All of this lead us to conclusion that DWARF is just

Dangerous

True ant-debugging trick - what if attribute DW_AT_type for DW_TAG_pointer_type points to the same tag? How about negative offset in DW_AT_sibling? I believe that this is very reach area for fuzzing

 

Features of dwarfdump