转自:https://juejin.cn/post/6989592850170118158

前言

作为一个Android开发者,不管是App或者是System开发者,经常会遇到一些分析日志的场景,本文就通过一个场景介绍一下Android 系统触发Watdog重启时的分析思路。Watchdog俗称看门狗,Android如果一些系统服务发生异常,会触发Watchdog,导致系统重启。

场景介绍

QA报了一个系统重启的Bug,就是电视通过Hdmi连接盒子,在进入Hdmi页面时,系统奔溃了,既然是系统奔溃一般情况都是系统服务挂掉了,系统服务挂掉一遍会触发Watchdog.下面就是触发WatDog时的日志,一般查看Android Log中是否触发WatDog可以搜索Watchdog,或者GOODBYE关键字.

07-26 06:34:10.100 2112 2135 W Watchdog: *** Watchdog KILLING SYSTEM PROCESS: Blocked in handler on main thread (main)

07-26 06:34:10.105 2112 2135 W Watchdog: main annotated stack trace:

07-26 06:34:10.106 2112 2135 W Watchdog: at com.android.server.tv.TvInputHardwareManager.getHardwareList(TvInputHardwareManager.java:250)

07-26 06:34:10.106 2112 2135 W Watchdog: - waiting to lock <0x0c25fdf3> (a java.lang.Object)

07-26 06:34:10.107 2112 2135 W Watchdog: at com.android.server.tv.TvInputManagerService$InputServiceConnection.onServiceConnected(TvInputManagerService.java:2419)

07-26 06:34:10.107 2112 2135 W Watchdog: - locked <0x0c7f5ab0> (a java.lang.Object)

07-26 06:34:10.108 2112 2135 W Watchdog: at android.app.LoadedApk$ServiceDispatcher.doConnected(LoadedApk.java:1954)

07-26 06:34:10.108 2112 2135 W Watchdog: at android.app.LoadedApk$ServiceDispatcher$RunConnection.run(LoadedApk.java:1986)

07-26 06:34:10.109 2112 2135 W Watchdog: at android.os.Handler.handleCallback(Handler.java:938)

07-26 06:34:10.109 2112 2135 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)

07-26 06:34:10.109 2112 2135 W Watchdog: at android.os.Looper.loop(Looper.java:223)

07-26 06:34:10.110 2112 2135 W Watchdog: at com.android.server.SystemServer.run(SystemServer.java:669)

07-26 06:34:10.110 2112 2135 W Watchdog: at com.android.server.SystemServer.main(SystemServer.java:413)

07-26 06:34:10.111 2112 2135 W Watchdog: at java.lang.reflect.Method.invoke(Native Method)

07-26 06:34:10.111 2112 2135 W Watchdog: at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:592)

07-26 06:34:10.111 2112 2135 W Watchdog: at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1060)

07-26 06:34:10.112 2112 2135 W Watchdog: *** GOODBYE!

07-26 06:34:10.112 2112 2135 I Process : Sending signal. PID: 2112 SIG: 9

复制代码

上面TvInputHardwareManager.getHardwareList方法被调用的时候block住了,从下一行日志得知他block的原因是等待一个锁。

07-26 06:34:10.106 2112 2135 W Watchdog: at com.android.server.tv.TvInputHardwareManager.getHardwareList(TvInputHardwareManager.java:250)

07-26 06:34:10.106 2112 2135 W Watchdog: - waiting to lock <0x0c25fdf3> (a java.lang.Object)

复制代码

public List getHardwareList() {

synchronized (mLock) {

return Collections.unmodifiableList(mHardwareList);

}

}

复制代码

那这个锁被谁持有呢?现在只看Android log已经无能为力了,需要分析trace文件,做过app开发的应该都经常遇到anr,遇到anr一般都需要分析trace文件, 这个trace文件在/data/anr/目录下。我们知道触发ANR,dump trace文件的场景如下:

Service Timeout:比如前台服务在20s内未执行完成

BroadcastQueue Timeout:比如前台广播在10s内未执行完成

ContentProvider Timeout:内容提供者,在publish过超时10s

InputDispatching Timeout: 输入事件分发超时5s,包括按键和触摸事件

其实还有一种情况也会生成trace文件,那就是触发Framework的Watchdog的时候。看一下Watchdog.java的run方法,符合条件时,会触发dump trace的操作。

//frameworks/base/services/core/java/com/android/server/Watchdog.java

@Override

public void run() {

···

if (!fdLimitTriggered) {

final int waitState = evaluateCheckerCompletionLocked();

if (waitState == COMPLETED) {

// The monitors have returned; reset

waitedHalf = false;

continue;

} else if (waitState == WAITING) {

// still waiting but within their configured intervals; back off and recheck

continue;

} else if (waitState == WAITED_HALF) {

if (!waitedHalf) {

Slog.i(TAG, "WAITED_HALF");

// We've waited half the deadlock-detection interval. Pull a stack

// trace and wait another half.

ArrayList pids = new ArrayList<>(mInterestingJavaPids);

//dump trace文件

ActivityManagerService.dumpStackTraces(pids, null, null,

getInterestingNativePids(), null);

waitedHalf = true;

}

continue;

}

···

}

复制代码

触发dump trace时,日志里会有如下的日志:

07-26 06:33:59.485 2112 2135 I ActivityManager: Dumping to /data/anr/anr_2021-07-26-06-33-59-483

复制代码

分析trace文件

"main" prio=5 tid=1 Blocked

| group="main" sCount=1 dsCount=0 flags=1 obj=0x71684c28 self=0xe11c5410

| sysTid=2112 nice=-2 cgrp=default sched=0/0 handle=0xe9762470

| state=S schedstat=( 0 0 0 ) utm=377 stm=96 core=1 HZ=100

| stack=0xff11b000-0xff11d000 stackSize=8192KB

| held mutexes=

at com.android.server.tv.TvInputHardwareManager.getHardwareList(TvInputHardwareManager.java:250)

- waiting to lock <0x0c25fdf3> (a java.lang.Object) held by thread 100

at com.android.server.tv.TvInputManagerService$InputServiceConnection.onServiceConnected(TvInputManagerService.java:2419)

- locked <0x0c7f5ab0> (a java.lang.Object)

at android.app.LoadedApk$ServiceDispatcher.doConnected(LoadedApk.java:1954)

at android.app.LoadedApk$ServiceDispatcher$RunConnection.run(LoadedApk.java:1986)

at android.os.Handler.handleCallback(Handler.java:938)

at android.os.Handler.dispatchMessage(Handler.java:99)

at android.os.Looper.loop(Looper.java:223)

at com.android.server.SystemServer.run(SystemServer.java:669)

at com.android.server.SystemServer.main(SystemServer.java:413)

at java.lang.reflect.Method.invoke(Native method)

at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:592)

at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1060)

复制代码

这个堆栈我们在Android log里已经看到了,block的原因是等待0x0c25fdf3这个锁,那我们就搜一下0x0c25fdf3这个锁现在被谁持有.

"Binder:2112_6" prio=5 tid=100 Blocked

| group="main" sCount=1 dsCount=0 flags=1 obj=0x135c0808 self=0xb9698810

| sysTid=2491 nice=0 cgrp=default sched=0/0 handle=0xb478b1c0

| state=S schedstat=( 0 0 0 ) utm=32 stm=10 core=1 HZ=100

| stack=0xb4690000-0xb4692000 stackSize=1008KB

| held mutexes=

at com.android.server.tv.TvInputHardwareManager$TvInputHardwareImpl.release(TvInputHardwareManager.java:847)

- waiting to lock <0x0234b2e5> (a java.lang.Object) held by thread 11

at com.android.server.tv.TvInputHardwareManager$Connection.resetLocked(TvInputHardwareManager.java:652)

at com.android.server.tv.TvInputHardwareManager$Connection.binderDied(TvInputHardwareManager.java:722)

- locked <0x0c25fdf3> (a java.lang.Object)

at android.os.IBinder$DeathRecipient.binderDied(IBinder.java:305)

at android.os.BinderProxy.sendDeathNotice(BinderProxy.java:654)

复制代码

从上面的trace发现0x0c25fdf3被Connection.binderDied这个方法持有的,而该方法被block原因是它调用了TvInputHardwareImpl.release方法,这个TvInputHardwareImpl.release方法被另外一个锁0x0234b2e5block住了,那么0x0234b2e5锁被谁持有呢?

"Binder:2112_2" prio=5 tid=11 Native

| group="main" sCount=1 dsCount=0 flags=1 obj=0x131c03c8 self=0xe11c6210

| sysTid=2127 nice=0 cgrp=default sched=0/0 handle=0xbd2401c0

| state=S schedstat=( 0 0 0 ) utm=73 stm=20 core=3 HZ=100

| stack=0xbd145000-0xbd147000 stackSize=1008KB

| held mutexes=

native: #00 pc 000715e4 /apex/com.android.runtime/lib/bionic/libc.so (__ioctl+8)

native: #01 pc 0003f2a3 /apex/com.android.runtime/lib/bionic/libc.so (ioctl+30)

native: #02 pc 0006d78f /system/lib/libhidlbase.so (android::hardware::IPCThreadState::transact(int, unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int)+438)

native: #03 pc 0006a265 /system/lib/libhidlbase.so (android::hardware::BpHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function)+36)

native: #04 pc 0000c3f9 /system/lib/android.hardware.tv.input@1.0.so (android::hardware::tv::input::V1_0::BpHwTvInput::_hidl_closeStream(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, int, int)+228)

native: #05 pc 0000c7c5 /system/lib/android.hardware.tv.input@1.0.so (android::hardware::tv::input::V1_0::BpHwTvInput::closeStream(int, int)+36)

native: #06 pc 00069801 /system/lib/libandroid_servers.so (android::JTvInputHal::removeStream(int, int)+440)

at com.android.server.tv.TvInputHal.nativeRemoveStream(Native method)

at com.android.server.tv.TvInputHal.removeStream(TvInputHal.java:111)

- locked <0x08f771dc> (a java.lang.Object)

at com.android.server.tv.TvInputHardwareManager$TvInputHardwareImpl.setSurface(TvInputHardwareManager.java:872)

- locked <0x0234b2e5> (a java.lang.Object)

at android.media.tv.ITvInputHardware$Stub.onTransact(ITvInputHardware.java:136)

at android.os.Binder.execTransactInternal(Binder.java:1154)

at android.os.Binder.execTransact(Binder.java:1123)

复制代码

这个0x0234b2e5锁被TvInputHardwareImpl.setSurface方法持有了,那么他被block的原因是:android.hardware.tv.input@1.0.so中有个_hidl_closeStream方法被block住了。

native: #04 pc 0000c3f9 /system/lib/android.hardware.tv.input@1.0.so (android::hardware::tv::input::V1_0::BpHwTvInput::_hidl_closeStream(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, int, int)+228)

复制代码

由于再往下追会涉及到公司的私有代码了,这里就不在继续往下分析了,但是上面的思路已经很清楚了,结合trace中的堆栈和源码定位问题应该不成问题。

总结

不管是普通的anr问题或者是这种触发Watchdog的重启,分析方式都类似的,就是先查看Android log,然后在分析trace文件,一步一步的往下挖掘。在上面的介绍中提到的WatDog,其实是Android Framework中一个很重要的机制,后面将出文章专门介绍一下其实现,^v^。

作者:CodingDev链接:https://juejin.cn/post/6989592850170118158来源:稀土掘金著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

查看原文