ART虚拟机 | Android应用中SIGSEGV信号的处理流程

释放双眼,带上耳机,听听看~!

本文分析基于Android R(11)

SIGSEGV是信号11,其在内存访问错误时产生。信号产生后需要送往用户空间进行处理,纯native的进程由debuggerd_signal_handler进行处理,应用进程(zygote及其子进程)则由SignalChain::Hanler进行处理。

和纯native进程相比,应用进程多了一层封装和分发,主要是为了检测Java世界的NPE(NullPointerException)和SOE(StackOverflowError)。众所周知,Java代码有两种执行模式,一种是解释执行,另一种是机器码执行。解释执行不会产生SIGSEGV,是因为每条指令的参数都可以在解释前进行判断,因此NPE和SOE可以在判断失败的时候抛出。而机器码执行是直接操作汇编指令,每一次的ldr/str不会有事先判断,因此可能产生SIGSEGV。

下面从源码的角度分别分析处理函数的注册和分发过程。

信号处理函数的注册

Android应用进程都是从zygote进程中fork出来的,因此每个信号的处理方式也继承于zygote。

zygote是在init fork出的子进程中通过执行app_process可执行文件得到的,在执行app_process可执行文件时,一般都以其中的main()函数作为我们程序的入口,但其实main只是我们程序逻辑上的入口。当exec系统调用发生时,实际上会去调用/system/bin/linker64_start入口,将链接器启动起来后,再去调用main函数。

ENTRY(_start)
  // Force unwinds to end in this function.
  .cfi_undefined x30

  mov x0, sp
  bl __linker_init

  /* linker init returns the _entry address in the main image */
  br x0
END(_start)

__linker_init中最终会调用linker_debuggerd_init(),其中会将SIGSEGV的信号处理函数注册为debuggerd_signal_handler。因此整个进程对于SIGSEGV的第一次注册发生在linker自举的过程中,它比app_process的main函数执行时间还要早。

当zygote进程运行后,它需要启动ART虚拟机。在Runtime::Init的时候会初始化全局变量fault_manager,并注册NPE和SOE的处理函数。

// Dex2Oat's Runtime does not need the signal chain or the fault handler.
if (implicit_null_checks_ || implicit_so_checks_ || implicit_suspend_checks_) {
  fault_manager.Init();

  // These need to be in a specific order.  The null point check handler must be
  // after the suspend check and stack overflow check handlers.
  //
  // Note: the instances attach themselves to the fault manager and are handled by it. The
  //       manager will delete the instance on Shutdown().
  if (implicit_suspend_checks_) {
    new SuspensionHandler(&fault_manager);
  }

  if (implicit_so_checks_) {
    new StackOverflowHandler(&fault_manager);
  }

  if (implicit_null_checks_) {
    new NullPointerHandler(&fault_manager);
  }

  if (kEnableJavaStackTraceHandler) {
    new JavaStackTraceHandler(&fault_manager);
  }
}

全局变量fault_manager在Init的过程中会再次注册SIGSEGV,将SIGSEGV原有的注册函数debuggerd_signal_handledr指针存入action_字段,将SignalChain::Handler注册为新的处理函数。

void Register(int signo) {
    struct sigaction64 handler_action = {};
    sigfillset64(&handler_action.sa_mask);
    ...
    handler_action.sa_sigaction = SignalChain::Handler;
    handler_action.sa_flags = SA_RESTART | SA_SIGINFO | SA_ONSTACK |
                              SA_UNSUPPORTED | SA_EXPOSE_TAGBITS;
    linked_sigaction64(signo, &handler_action, &action_);

上述代码有个地方需要注意,注册使用的函数是linked_sigaction64,而不是sigaction64。这是因为系统默认的sigaction64是由libc实现的,而libsigchain中也实现了sigaction64函数,它将libcsigaction64记录为linked_sigaction64,进而对libc进行屏蔽。因此后续APP的动态库如果调用sigaction64的话,都将进入libsigchain中。

这么做的目的,是为了让APP动态库中的注册行为不影响NPE和SOE的检测。

下面是libsigchain Android.bp文件中的代码,通过-z,global的编译选项使得libcsigaction符号被屏蔽。

//  Make libsigchain symbols global, so that an app library which
//  is loaded in a classloader linker namespace looks for
//  libsigchain symbols before libc.
//  -z,global marks the binary with the DF_1_GLOBAL flag which puts the symbols
//  in the global group. It does not affect their visibilities like the version
//  script does.
ldflags: ["-Wl,-z,global"],

在目前的Android版本中,implicit_null_checks_implicit_so_checks_默认打开,而implicit_suspend_checks_kEnableJavaStackTraceHandler默认关闭。

StackOverflowHandlerNullPointerHandler都继承自FaultHandler,它们在构造的时候便将自身的Action方法注册到generated_code_handlers_数组中。譬如StackOverflowHandlerStackOverflowHandler::Action注册到数组中。之所以数组叫"generated code",是因为APK文件中最初只有dex文件,只有在手机中经过dex2oat才能生成机器码,因此生成的机器码这里又叫做"generated code"。

信号的分发

应用进程SIGSEGV的分发规则如下:

  1. SIGSEGV由SignalChain::Handler接收处理,同时传入fault pc和fault address的信息。
  2. 首先遍历generated_code_handlers_里所有的handler,这些handler是在虚拟机启动时注册的,一个用来抛出NPE,另一个用来抛出SOE。每个handler根据自己的判定规则,决定当前错误是否属于自己的类型,如果属于则抛出Java异常并结束分发过程,如果不属于则遍历下一个handler。
  3. 遍历other_handlers_,默认情况下这个handler数组为空。
  4. 调用debuggerd_signal_handler进行处理,处理的结果是生成一份tombstone文件,里面包含所有线程的调用栈信息,以及memory map等信息。
分发过程.png

关于SignalChain的含义,虽然文中开头已经说明了,但更精准的表述可以参考源码中的注释。

// libsigchain provides an interception layer for signal handlers, to allow ART and others to give
// their signal handlers the first stab at handling signals before passing them on to user code.
//
// It implements wrapper functions for signal, sigaction, and sigprocmask, and a handler that
// forwards signals appropriately.

SignalChain::Handler中,首先去遍历special_handlers_的处理函数,接着再调用action_字段存储的函数。

void SignalChain::Handler(int signo, siginfo_t* siginfo, void* ucontext_raw) {
  // Try the special handlers first.
  // If one of them crashes, we'll reenter this handler and pass that crash onto the user handler.
  if (!GetHandlingSignal()) {
    for (const auto& handler : chains[signo].special_handlers_) {
      if (handler.sc_sigaction == nullptr) {
        break;
      }
      sigset_t previous_mask;
      linked_sigprocmask(SIG_SETMASK, &handler.sc_mask, &previous_mask);
      ScopedHandlingSignal restorer;
      SetHandlingSignal(true);
      if (handler.sc_sigaction(signo, siginfo, ucontext_raw)) {
        return;
      }
      linked_sigprocmask(SIG_SETMASK, &previous_mask, nullptr);
    }
  }

  // Forward to the user's signal handler.
  chains[signo].action_.sa_sigaction(signo, siginfo, ucontext_raw);
}

在遍历special_handlers_时,代码中有两点需要注意:

  1. 在调用处理函数前需要将mask改为handler.sc_mask,处理完后将mask恢复。对于art_fault_handler而言,handler.sc_mask设置如下。这样设置的目的是为了预防信号处理函数中再次产生信号的情况。
sigfillset(&mask);
sigdelset(&mask, SIGABRT);
sigdelset(&mask, SIGBUS);
sigdelset(&mask, SIGFPE);
sigdelset(&mask, SIGILL);
sigdelset(&mask, SIGSEGV);
  1. 在调用处理函数之前需要setHandlingSignal(true),配合1一起使用便可以在第二次进入SignalChain::Handler时跳过art的处理。因为第二次进入往往意味着art的处理函数中出现了问题。

art_fault_handler会调用FaultManager::HandleFault函数。其中先判断fault pc是否属于Java编译生成的机器码,如果属于则进一步检测NPE和SOE,否则跳过generated_code_handlers_直接遍历other_handlers_

bool FaultManager::HandleFault(int sig, siginfo_t* info, void* context) {
  if (IsInGeneratedCode(info, context, true)) {
    VLOG(signals) << "in generated code, looking for handler";
    for (const auto& handler : generated_code_handlers_) {
      VLOG(signals) << "invoking Action on handler " << handler;
      if (handler->Action(sig, info, context)) {
        // We have handled a signal so it's time to return from the
        // signal handler to the appropriate place.
        return true;
      }
    }
  }
  // We hit a signal we didn't handle.  This might be something for which
  // we can give more information about so call all registered handlers to
  // see if it is.
  if (HandleFaultByOtherHandlers(sig, info, context)) {
    return true;
  }
  return false;
}

IsInGeneratedCode的检测过程如下,如果当前线程状态是Runnable,且持有mutator读写锁(表明可以操作Java堆),则基本可以证明此线程正在运行Java编译的机器码。之后根据Java栈的排列规则(栈顶存储ArtMethod对象)找到ArtMethod,判断fault pc是否在ArtMethod的指令范围内,如果在指令范围内,则进一步证明确实是generated code。

// This function is called within the signal handler.  It checks that
// the mutator_lock is held (shared).  No annotalysis is done.
bool FaultManager::IsInGeneratedCode(siginfo_t* siginfo, void* context, bool check_dex_pc) {
  // We can only be running Java code in the current thread if it
  // is in Runnable state.
  Thread* thread = Thread::Current();
  ThreadState state = thread->GetState();
  if (state != kRunnable) {
    VLOG(signals) << "not runnable";
    return false;
  }
  // Current thread is runnable.
  // Make sure it has the mutator lock.
  if (!Locks::mutator_lock_->IsSharedHeld(thread)) {
    VLOG(signals) << "no lock";
    return false;
  }

  ArtMethod* method_obj = nullptr;
  uintptr_t return_pc = 0;
  uintptr_t sp = 0;
  bool is_stack_overflow = false;

  // Get the architecture specific method address and return address.  These
  // are in architecture specific files in arch/<arch>/fault_handler_<arch>.
  GetMethodAndReturnPcAndSp(siginfo, context, &method_obj, &return_pc, &sp, &is_stack_overflow);

  const OatQuickMethodHeader* method_header = method_obj->GetOatQuickMethodHeader(return_pc);  //如果pc不在ArtMethod范围内,则返回nullptr

  if (method_header == nullptr) {
    VLOG(signals) << "no compiled code";
    return false;
  }

  dexpc = method_header->ToDexPc(reinterpret_cast<ArtMethod**>(sp), return_pc, false);
  return !check_dex_pc || dexpc != dex::kDexNoIndex;
}

之后分别介绍NPE和SOE具体的检测规则。

NullPointerException的检测规则

NullPointerException的检测需要调用到NullPointerHandler::Action函数。

bool NullPointerHandler::Action(int sig ATTRIBUTE_UNUSED, siginfo_t* info, void* context) {
  if (!IsValidImplicitCheck(info)) {
    return false;
  }
  // The code that looks for the catch location needs to know the value of the
  // PC at the point of call.  For Null checks we insert a GC map that is immediately after
  // the load/store instruction that might cause the fault.

  struct ucontext *uc = reinterpret_cast<struct ucontext*>(context);
  struct sigcontext *sc = reinterpret_cast<struct sigcontext*>(&uc->uc_mcontext);

  // Push the gc map location to the stack and pass the fault address in LR.
  sc->sp -= sizeof(uintptr_t);
  *reinterpret_cast<uintptr_t*>(sc->sp) = sc->pc + 4;
  sc->regs[30] = reinterpret_cast<uintptr_t>(info->si_addr);

  sc->pc = reinterpret_cast<uintptr_t>(art_quick_throw_null_pointer_exception_from_signal);
  VLOG(signals) << "Generating null pointer exception";
  return true;
}

检测需要经由IsValidImplicitCheck判断,该函数的判断逻辑很简单,即fault address是否小于1页。为什么是小于1页,而不是等于0呢?原因是很多时候我们访问的是一个对象的字段或vtable,而不是对象本身。不论是字段还是vtable,它们相对于对象的起始地址都存在偏移,如果对象起始地址为0,则最终内存访问的就是一个很小的偏移值。

static bool IsValidImplicitCheck(siginfo_t* siginfo) {
  // Our implicit NPE checks always limit the range to a page.
  // Note that the runtime will do more exhaustive checks (that we cannot
  // reasonably do in signal processing code) based on the dex instruction
  // faulting.
  return CanDoImplicitNullCheckOn(reinterpret_cast<uintptr_t>(siginfo->si_addr));
}
// Returns whether the given memory offset can be used for generating
// an implicit null check.
static inline bool CanDoImplicitNullCheckOn(uintptr_t offset) {
  return offset < kPageSize;
}

判定为NPE后,NullPointerHandler::Action会修改原始上下文的pc值。当前我们正处于信号处理函数中,当我们从函数返回时,默认情况下程序会重新执行"错误"指令。但如果我们在其中修改了原始上下文的pc值,那么函数返回后将会跳转到pc指定的位置。

art_quick_throw_null_pointer_exception_from_signal会做两件事,我们在"异常如何抛出"小节中再做详解。这里先简单罗列下。

  1. 生成Java层的NullPointerException对象。
  2. 跳转到可以捕获该异常的catch代码块中。

StackOverflowError的检测规则

在介绍SOE的检测规则之前,得先了解ART中栈的结构。

Stack layout.png

栈的最顶部有两页是无法读写的,一旦读写就会发生内存错误。另外栈的动态增长是在函数中完成的,因此检测必须要和函数调用结合起来。在AArch64架构中,每次函数调用时都会执行以下汇编指令,将0值写入sp-0x2000的位置。如果栈中可用空间大于2页,则sp-0x2000仍然落在可读写范围内;但如果可用空间小于2页,那么sp-0x2000将落到不可读写的红色区域。一旦往一块不可读写的区域写入数据,既会引发SIGSEGV。

sub x16, sp, #0x2000 (8192)
ldr wzr, [x16]

因此实际的检测就是判断sp-0x2000和fault address是否相等,如果相等,则证明这个SIGSEGV是由上述代码产生的,也即SOE实际地发生了。

bool StackOverflowHandler::Action(int sig ATTRIBUTE_UNUSED, siginfo_t* info ATTRIBUTE_UNUSED,
                                  void* context) {
  struct ucontext *uc = reinterpret_cast<struct ucontext *>(context);
  struct sigcontext *sc = reinterpret_cast<struct sigcontext*>(&uc->uc_mcontext);
  VLOG(signals) << "stack overflow handler with sp at " << std::hex << &uc;
  VLOG(signals) << "sigcontext: " << std::hex << sc;

  uintptr_t sp = sc->sp;
  VLOG(signals) << "sp: " << std::hex << sp;

  uintptr_t fault_addr = sc->fault_address;
  VLOG(signals) << "fault_addr: " << std::hex << fault_addr;
  VLOG(signals) << "checking for stack overflow, sp: " << std::hex << sp <<
      ", fault_addr: " << fault_addr;

  uintptr_t overflow_addr = sp - GetStackOverflowReservedBytes(InstructionSet::kArm64);  // sp - 0x2000

  // Check that the fault address is the value expected for a stack overflow.
  if (fault_addr != overflow_addr) {
    VLOG(signals) << "Not a stack overflow";
    return false;
  }

  VLOG(signals) << "Stack overflow found";

  // Now arrange for the signal handler to return to art_quick_throw_stack_overflow.
  // The value of LR must be the same as it was when we entered the code that
  // caused this fault.  This will be inserted into a callee save frame by
  // the function to which this handler returns (art_quick_throw_stack_overflow).
  sc->pc = reinterpret_cast<uintptr_t>(art_quick_throw_stack_overflow);

  // The kernel will now return to the address in sc->pc.
  return true;
}

如果SOE判断通过后,处理函数返回后将会执行art_quick_throw_stack_overflow

异常如何抛出

NPE检测通过后执行如下代码。

ENTRY art_quick_throw_null_pointer_exception_from_signal
    // The fault handler pushes the gc map address, i.e. "return address", to stack
    // and passes the fault address in LR. So we need to set up the CFI info accordingly.
    .cfi_def_cfa_offset __SIZEOF_POINTER__
    .cfi_rel_offset lr, 0
    // Save all registers as basis for long jump context.
    INCREASE_FRAME (FRAME_SIZE_SAVE_EVERYTHING - __SIZEOF_POINTER__)
    SAVE_REG x29, (FRAME_SIZE_SAVE_EVERYTHING - 2 * __SIZEOF_POINTER__)  // LR already saved.
    SETUP_SAVE_EVERYTHING_FRAME_DECREMENTED_SP_SKIP_X29_LR
    mov x0, lr                        // pass the fault address stored in LR by the fault handler.
    mov x1, xSELF                     // pass Thread::Current.
    bl  artThrowNullPointerExceptionFromSignal  // (arg, Thread*).
    brk 0
END art_quick_throw_null_pointer_exception_from_signal
extern "C" NO_RETURN void artThrowNullPointerExceptionFromSignal(uintptr_t addr, Thread* self)
    REQUIRES_SHARED(Locks::mutator_lock_) {
  ScopedQuickEntrypointChecks sqec(self);
  ThrowNullPointerExceptionFromDexPC(/* check_address= */ true, addr);
  self->QuickDeliverException();
}

SOE检测通过后最终执行如下代码。

extern "C" NO_RETURN void artThrowStackOverflowFromCode(Thread* self)
    REQUIRES_SHARED(Locks::mutator_lock_) {
  ScopedQuickEntrypointChecks sqec(self);
  ThrowStackOverflowError(self);
  self->QuickDeliverException();
}

ThrowNullPointerExceptionFromDexPCThrowStackOverflowError的作用都是构造Java世界的Throwable对象,只不过一个构造的是NullPointerException,另一个是StackOverflowError。构造完的Throwable对象有两个关键的信息,一个是提示字符串,另一个是调用栈。构造的对象会存入thread->tlsPtr_.exception字段,这样线程的其他地方都可以取到它。

接下来重点分析QuickDeliverException函数。它的功能是跳转到对应的catch代码块中去。

void Thread::QuickDeliverException() {
  // Get exception from thread.
  ObjPtr<mirror::Throwable> exception = GetException();
  // Don't leave exception visible while we try to find the handler, which may cause class
  // resolution.
  ClearException();
  QuickExceptionHandler exception_handler(this, false);
  exception_handler.FindCatch(exception);
  exception_handler.DoLongJump();
}

首先通过FindCatch找到两个信息。

  1. 可以捕获该异常的catch代码块所在的那一帧,记录下该帧的sp。
  2. 可以捕获该异常的catch代码块的起始地址,记录下机器码的起始地址pc或字节码的地址dex_pc。

之后通过DoLongJump跳转过去。

void QuickExceptionHandler::DoLongJump(bool smash_caller_saves) {
  // Place context back on thread so it will be available when we continue.
  self_->ReleaseLongJumpContext(context_);
  context_->SetSP(reinterpret_cast<uintptr_t>(handler_quick_frame_));
  CHECK_NE(handler_quick_frame_pc_, 0u);
  context_->SetPC(handler_quick_frame_pc_);
  context_->SetArg0(handler_quick_arg0_);
  if (smash_caller_saves) {
    context_->SmashCallerSaves();
  }
  if (!is_deoptimization_ &&
      handler_method_header_ != nullptr &&
      handler_method_header_->IsNterpMethodHeader()) {
    context_->SetNterpDexPC(reinterpret_cast<uintptr_t>(
        GetHandlerMethod()->DexInstructions().Insns() + handler_dex_pc_));
  }
  context_->DoLongJump();
  UNREACHABLE();
}

首先将那一帧的帧地址存入SP字段中,接着将机器码地址存入PC字段中。如果该帧由解释器执行,则机器码地址指向一个跳板(trampoline)函数,而真正的字节码地址dex_pc将存入x22字段,最终会在解释器执行时取出。

接着将所有字段写入实际的寄存器中,然后通过br指令头也不回地跳到catch代码块中去。

ENTRY art_quick_do_long_jump
    // Load FPRs
    ldp d0, d1, [x1, #0]
    ldp d2, d3, [x1, #16]
    ldp d4, d5, [x1, #32]
    ldp d6, d7, [x1, #48]
    ldp d8, d9, [x1, #64]
    ldp d10, d11, [x1, #80]
    ldp d12, d13, [x1, #96]
    ldp d14, d15, [x1, #112]
    ldp d16, d17, [x1, #128]
    ldp d18, d19, [x1, #144]
    ldp d20, d21, [x1, #160]
    ldp d22, d23, [x1, #176]
    ldp d24, d25, [x1, #192]
    ldp d26, d27, [x1, #208]
    ldp d28, d29, [x1, #224]
    ldp d30, d31, [x1, #240]

    // Load GPRs. Delay loading x0, x1 because x0 is used as gprs_.
    ldp x2, x3, [x0, #16]
    ldp x4, x5, [x0, #32]
    ldp x6, x7, [x0, #48]
    ldp x8, x9, [x0, #64]
    ldp x10, x11, [x0, #80]
    ldp x12, x13, [x0, #96]
    ldp x14, x15, [x0, #112]
    // Do not load IP0 (x16) and IP1 (x17), these shall be clobbered below.
    // Don't load the platform register (x18) either.
    ldr      x19, [x0, #152]      // xSELF.
    ldp x20, x21, [x0, #160]      // For Baker RB, wMR (w20) is reloaded below.
    ldp x22, x23, [x0, #176]
    ldp x24, x25, [x0, #192]
    ldp x26, x27, [x0, #208]
    ldp x28, x29, [x0, #224]
    ldp x30, xIP0, [x0, #240]     // LR and SP, load SP to IP0.

    // Load PC to IP1, it's at the end (after the space for the unused XZR).
    ldr xIP1, [x0, #33*8]

    // Load x0, x1.
    ldp x0, x1, [x0, #0]

    // Set SP. Do not access fprs_ and gprs_ from now, they are below SP.
    mov sp, xIP0

    REFRESH_MARKING_REGISTER

    br  xIP1
END art_quick_do_long_jump

为TA充电
共{{data.count}}人
人已赞赏
Android

利用Live Templates打造埋点自动化利器

2021-5-27 13:20:32

Android

【Android 文件管理】分区存储 ( 修改与删除图片文件 )

2021-5-28 9:07:33

0 条回复 A文章作者 M管理员
    暂无讨论,说说你的看法吧
个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索