segfault in library / addr2line / objdump

Written by
Walter Doekes
Published on 2023-09-14

Yesterday, we spotted some SEGFAULTs on an Ubuntu/Focal server. We did not have core dumps, but the kernel message in dmesg was sufficient to find a culprit.

The observed messages were these:

nginx[854]: segfault at 6d702e746379 ip 00007ff40dc2f5a3 sp 00007fffd51c8420 error 4 in libperl.so.5.30.0[7ff40dbc7000+166000]
Code: 48 89 43 10 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 40 00 0f b6 7f 30 48 c1 e8 03 48 29 f8 48 89 c3 74 89 48 8b 02 <4c> 8b 68 10 4d 85 ed 0f 84 28 01 00 00 0f b6 40 30 49 c1 ed 03 49

nginx[951947]: segfault at 10 ip 00007fba4a1645a3 sp 00007ffe57b0f8a0 error 4 in libperl.so.5.30.0 (deleted)[7fba4a0fc000+166000]
Code: 48 89 43 10 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 40 00 0f b6 7f 30 48 c1 e8 03 48 29 f8 48 89 c3 74 89 48 8b 02 <4c> 8b 68 10 4d 85 ed 0f 84 28 01 00 00 0f b6 40 30 49 c1 ed 03 49

And after upgrading libperl5.30 from 5.30.0-9ubuntu0.3 to 5.30.0-9ubuntu0.4, we got these similar ones:

traps: nginx[955774] general protection fault ip:7f6af33345a3 sp:7ffe74310100 error:0 in libperl.so.5.30.0[7f6af32cc000+166000]

nginx[1049280]: segfault at 205bd ip 00007f5e60d265d9 sp 00007ffe7b5f08c0 error 4 in libperl.so.5.30.0[7f5e60cbe000+166000]
Code: 00 0f b6 40 30 49 c1 ed 03 49 29 c5 0f 84 17 01 00 00 48 8b 76 10 48 8b 52 10 4c 8d 3c fe 4c 8d 0c c2 84 c9 0f 84 c7 02 00 00 <49> 83 39 00 0f 85 ad 03 00 00 49 83 c1 08 49 83 ed 01 49 8d 74 1d

Apparently they were triggered by an nginx reload.

If we had a proper core dump, we could extract lots of useful info from it: where the crash occurred, which registers and variables were set, and the call chain (backtrace). With the info from above, we can at most get where the crash happened, and maybe which register had a bad value. But it is definitely better than nothing.

Feeding calculated offset to addr2line

For the most basic attempt, I found a box which still had libperl version 5.30.0-9ubuntu0.3. I installed the perl-debug apt package — perl-debug_5.30.0-9ubuntu0.3_amd64.deb from https://launchpadlibrarian.net/ — there. From the kernel message “nginx[854]: segfault at 6d702e746379 ip 00007ff40dc2f5a3 sp 00007fffd51c8420 error 4 in libperl.so.5.30.0[7ff40dbc7000+166000]” we take the instruction pointer 00007ff40dc2f5a3 and subtract the library starting position 7ff40dbc7000:

0x7ff40dc2f5a3 - 0x7ff40dbc7000 = 0x685a3

Feed that to addr2line and get the location of the crash... right?

$ addr2line -Cfe /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 685a3
Perl_vload_module
op.c:7750

At first glance that appears okay. But when we check what happens in the machine instructions there, it is not:

$ objdump -d /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 --disassemble=Perl_vload_module
...
00000000000685a3 <Perl_vload_module@@Base+0xd3>:
...
   6859b:       83 c0 08                add    $0x8,%eax
   6859e:       49 03 57 10             add    0x10(%r15),%rdx
   685a2:       41 89 07                mov    %eax,(%r15)
   685a5:       48 8b 0a                mov    (%rdx),%rcx
   685a8:       45 31 e4                xor    %r12d,%r12d
   685ab:       48 85 c9                test   %rcx,%rcx
...

There is no instruction start at 0x685a3!

Searching for machine code inside a binary

What if we simply look for the instructions as shown in the Code: message?

To this end, I hacked together a script that does the following:

Spawn a copy of objdump to disassemble the binary;
look for the instructions as passed on the command line;
display where the instructions are found.

The objdump-find-instructions.py script is collapsed here (see “details”):

details of objdump-find-instructions.py

#!/usr/bin/env python3
import re
import subprocess
import sys

# Look for these:
# >   19640c: 48 89 44 24 28        mov    %rax,0x28(%rsp)
# >   196411: 31 c0                 xor    %eax,%eax
# >   196413: 48 85 db              test   %rbx,%rbx
code_re = re.compile(
    br'^\s+(?P<addr>[0-9a-f]+):(?P<code>(\s[0-9a-f]{2})+)\s+'
    br'(?P<decoded>.*)')
code_without_decoded_re = re.compile(
    br'^\s+(?P<addr>[0-9a-f]+):(?P<code>(\s[0-9a-f]{2})+)\s*$')

# Look for these:
# > 000000000004ea40 <Perl_ck_concat@@Base>:
func_re = re.compile(br'^(?P<addr>[0-9a-f]+) <(?P<name>[^<>]*)>:')

# Look for blanks:
blank_re = re.compile(br'^\s*$')

# Lines to ignore:
ignore_re = re.compile(
    br'^/.*:\s+file format |^Disassembly of section ')

def to_bin(binstr_array):
    return bytes([int(i, 16) for i in binstr_array])

def to_hex(binarray):
    return ' '.join('{:02x}'.format(i) for i in binarray)

# Get executable/binary from argv
executable = sys.argv[1]  # /usr/lib/x86_64-linux-gnu/libperl.so.5.30

# Get needle from argv
needle = [i.encode() for i in sys.argv[2:]]  # ['48', '89', '44', '24', '28']
needle_len = len(needle)
assert needle_len >= 2, 'must specify XX XX XX bytes to search for'
needle_bin = to_bin(needle)
MAX_BUF = needle_len + 30

class Matcher:
    def search(self, haystack, regex):
        self.match = regex.search(haystack)
        if self.match:
            self.dict = self.match.groupdict()
        return self.match

    def get(self, key):
        return self.dict[key]

# Execute
proc = subprocess.Popen(
    ['/usr/bin/objdump', '-d', executable], stdout=subprocess.PIPE)

# Parse
code_bin = bytearray()
last_func = None
last_addr = None
matcher = Matcher()
for line in proc.stdout:
    line = line.rstrip()
    if matcher.search(line, blank_re):
        last_func = None
        last_addr = None
    elif matcher.search(line, func_re):
        last_func = matcher.get('name')
        last_addr = matcher.get('addr')
    elif (matcher.search(line, code_re) or
            matcher.search(line, code_without_decoded_re)):
        new_code_bin = to_bin(matcher.get('code').lstrip().split())
        code_bin.extend(new_code_bin)
        code_bin = code_bin[-MAX_BUF:]  # truncate early

        # This contains search on binary is pretty fast, compared to doing
        # sub-array comparisons.
        # real  0m9.873s  --> 0m4.000s
        # user  0m12.637s --> 0m6.624s
        if needle_bin in code_bin:
            print(
                last_addr.decode(), last_func.decode(),
                matcher.get('addr').decode(),
                matcher.get('decoded').decode(),
                to_hex(new_code_bin))
            # print('//', to_hex(code_bin))
            assert needle_len > 1
            code_bin = code_bin[(-needle_len + 1):]  # skip same results
    elif matcher.search(line, ignore_re):
        pass
    else:
        print('discarding', line)
        exit(2)

The script is invoked like this:

$ python3 objdump-find-instructions.py PATH_TO_BINARY INSTRUCTIONS...

We include all instructions up to and including the <4c> and invoke it like this:

$ python3 objdump-find-instructions.py \
    /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 \
    48 89 43 10 48 83 c4 18 5b 5d 41 5c 41 5d 41 \
    5e 41 5f c3 0f 1f 40 00 0f b6 7f 30 48 c1 e8 \
    03 48 29 f8 48 89 c3 74 89 48 8b 02 4c

It spews out this one line:

00000000000b0500 Perl__invlist_intersection_maybe_complement_2nd@@Base
  b05a3 mov    0x10(%rax),%r13 4c 8b 68 10

That contains the following info:

The function Perl__invlist_intersection_maybe_complement_2nd@@Base starts at 00000000000b0500.
At 0xb05a3 there is a mov 0x10(%rax),%r13 instruction.
That instruction is 4c 8b 68 10 in machine code.

That instruction corresponds with the position in the Code: log line.

Code: [...  8b 02] <4c> 8b 68 10 [4d 85 ...]

This looks like a much better candidate than the Perl_vload_module we got from addr2line. The reading of 0x10(%rax) matches the second crash perfectly: if the %rax register is 0 — a common value — then this would produce a segfault at 10.

Getting the surrounding code from objdump:

$ objdump -d /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 --start-address=0xb0500
...
00000000000b0500 <Perl__invlist_intersection_maybe_complement_2nd@@Base>:
...
   b059e:       74 89                   je     b0529 <Perl__invlist_intersection_maybe_complement_2nd@@Base+0x29>
   b05a0:       48 8b 02                mov    (%rdx),%rax
   b05a3:       4c 8b 68 10             mov    0x10(%rax),%r13
   b05a7:       4d 85 ed                test   %r13,%r13
   b05aa:       0f 84 28 01 00 00       je     b06d8 <Perl__invlist_intersection_maybe_complement_2nd@@Base+0x1d8>
...

Offset 0x48000

I was confident that this is the right crash location. And because Perl did have a problem with the code in this vicinity, it was easy to file a lp2035339 bug report.

But I could not explain yet why the calculated offset of 0x685a3 is off. The difference between 0x685a3 and 0xb05a3 is 0x48000.

A bit of poking around the binary did turn up this:

$ objdump -p  /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0
...
Dynamic Section:
  NEEDED               libdl.so.2
  NEEDED               libm.so.6
  NEEDED               libpthread.so.0
  NEEDED               libc.so.6
  NEEDED               libcrypt.so.1
  SONAME               libperl.so.5.30
  INIT                 0x0000000000048000
  FINI                 0x00000000001ad6b4
...

The machine instructions reside between 0x48000 and 0x1ad6b4. That's where we got the extra 0x48000 we need.

So, next time we do an addr2line lookup of a library, we should check the INIT offset, and add that to calculated instruction pointer position.

Check with newer version

After upgrading both libperl and perl-debug on the test box, we could confirm that the latest crashes were caused by the same problem.

From “traps: nginx[955774] general protection fault ip:7f6af33345a3 sp:7ffe74310100 error:0 in libperl.so.5.30.0[7f6af32cc000+166000]” and the INIT offset of 0x48000 we get 0xb05a3 and from “nginx[1049280]: segfault at 205bd ip 00007f5e60d265d9 sp 00007ffe7b5f08c0 error 4 in libperl.so.5.30.0[7f5e60cbe000+166000]” we get 0xb05d9.

addr2line gives us:

$ addr2line -Cfe /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 b05a3 b05d9
Perl__invlist_intersection_maybe_complement_2nd
invlist_inline.h:51
Perl__invlist_intersection_maybe_complement_2nd
regcomp.c:9841

Both in Perl__invlist_intersection_maybe_complement_2nd. Same problem.

general protection fault vs. segfault

Lastly, why did we get a “traps: ... general protection fault ... error:0” for one crash and “segfault at ... ip ... error 4” for the others?

I'm not entirely sure. As far as I can gather, this could be the difference between the segmentation violation happening while running in kernel mode versus running in user mode. The error code of 0 vs. 4 does indicate as much. (See “details” for a snippet from arch/x86/include/asm/trap_pf.h.)

details of error_code

/*
 * Page fault error code bits:
 *
 *   bit 0 ==    0: no page found       1: protection fault
 *   bit 1 ==    0: read access         1: write access
 *   bit 2 ==    0: kernel-mode access  1: user-mode access
 *   bit 3 ==                           1: use of reserved bit detected
 *   bit 4 ==                           1: fault was an instruction fetch
 *   bit 5 ==                           1: protection keys block access
 *   bit 15 ==                          1: SGX MMU page-fault
 */
enum x86_pf_error_code {
        X86_PF_PROT     =               1 << 0,
        X86_PF_WRITE    =               1 << 1,
        X86_PF_USER     =               1 << 2,
        X86_PF_RSVD     =               1 << 3,
        X86_PF_INSTR    =               1 << 4,
        X86_PF_PK       =               1 << 5,
        X86_PF_SGX      =               1 << 15,
};

But maybe it has a different reason, like the specific memory location that was tried (we don't see it in this message). Let me know if you know!

segfault in library / addr2line / objdump