淺談Android ANR在線監控原理

2019-10-22 18:17:10

字體：大中小

來源：轉載

供稿：網友

Android中的Watchdog

在Android中，Watchdog是用來監測關鍵服務是否發生了死鎖，如果發生了死鎖就kill進程，重啟SystemServer
Android的Watchdog是在SystemServer中進行初始化的，所以Watchdog是運行在SystemServer進程中
Watchdog是運行一個單獨的線程中的，每次wait 30s之后就會發起一個監測行為，如果系統休眠了，那Watchdog的wait行為也會休眠，此時需要等待系統喚醒之后才會重新恢復監測
想要被Watchdog監測的對象需要實現Watchdog.Monitor接口的monitor()方法，然后調用addMonitor()方法
其實framework里面的Watchdog實現除了能監控線程死鎖以外還能夠監控線程卡頓，addMonitor()方法是監控線程死鎖的，而addThread()方法是監控線程卡頓的

Watchdog線程死鎖監控實現

Watchdog監控線程死鎖需要被監控的對象實現Watchdog.Monitor接口的monitor()方法，然后再調用addMonitor()方法，例如ActivityManagerService：

public final class ActivityManagerService extends ActivityManagerNative    implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback { public ActivityManagerService(Context systemContext) {  Watchdog.getInstance().addMonitor(this); } public void monitor() {    synchronized (this) { }  }// ...}

如上是從ActivityManagerService提取出來關于Watchdog監控ActivityManagerService這個對象鎖的相關代碼，而監控的實現如下，Watchdog是一個線程對象，start這個線程之后就會每次wait 30s后檢查一次，如此不斷的循環檢查：

public void addMonitor(Monitor monitor) {    synchronized (this) {      if (isAlive()) {        throw new RuntimeException("Monitors can't be added once the Watchdog is running");      }      mMonitorChecker.addMonitor(monitor);    }  }@Override  public void run() {    boolean waitedHalf = false;    while (true) {      final ArrayList<HandlerChecker> blockedCheckers;      final String subject;      final boolean allowRestart;      int debuggerWasConnected = 0;      synchronized (this) {        long timeout = CHECK_INTERVAL;        // Make sure we (re)spin the checkers that have become idle within        // this wait-and-check interval        for (int i=0; i<mHandlerCheckers.size(); i++) {          HandlerChecker hc = mHandlerCheckers.get(i);          hc.scheduleCheckLocked();        }        if (debuggerWasConnected > 0) {          debuggerWasConnected--;        }        // NOTE: We use uptimeMillis() here because we do not want to increment the time we        // wait while asleep. If the device is asleep then the thing that we are waiting        // to timeout on is asleep as well and won't have a chance to run, causing a false        // positive on when to kill things.        long start = SystemClock.uptimeMillis();        while (timeout > 0) {          if (Debug.isDebuggerConnected()) {            debuggerWasConnected = 2;          }          try {            wait(timeout);          } catch (InterruptedException e) {            Log.wtf(TAG, e);          }          if (Debug.isDebuggerConnected()) {            debuggerWasConnected = 2;          }          timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);        }        final int waitState = evaluateCheckerCompletionLocked();        if (waitState == COMPLETED) {          // The monitors have returned; reset          waitedHalf = false;          continue;        } else if (waitState == WAITING) {          // still waiting but within their configured intervals; back off and recheck          continue;        } else if (waitState == WAITED_HALF) {          if (!waitedHalf) {            // We've waited half the deadlock-detection interval. Pull a stack            // trace and wait another half.            ArrayList<Integer> pids = new ArrayList<Integer>();            pids.add(Process.myPid());            ActivityManagerService.dumpStackTraces(true, pids, null, null,                NATIVE_STACKS_OF_INTEREST);            waitedHalf = true;          }          continue;        }        // something is overdue!        blockedCheckers = getBlockedCheckersLocked();        subject = describeCheckersLocked(blockedCheckers);        allowRestart = mAllowRestart;      }      // If we got here, that means that the system is most likely hung.      // First collect stack traces from all threads of the system process.      // Then kill this process so that the system will restart.      EventLog.writeEvent(EventLogTags.WATCHDOG, subject);      ArrayList<Integer> pids = new ArrayList<Integer>();      pids.add(Process.myPid());      if (mPhonePid > 0) pids.add(mPhonePid);      // Pass !waitedHalf so that just in case we somehow wind up here without having      // dumped the halfway stacks, we properly re-initialize the trace file.      final File stack = ActivityManagerService.dumpStackTraces(          !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);      // Give some extra time to make sure the stack traces get written.      // The system's been hanging for a minute, another second or two won't hurt much.      SystemClock.sleep(2000);      // Pull our own kernel thread stacks as well if we're configured for that      if (RECORD_KERNEL_THREADS) {        dumpKernelStackTraces();      }      String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);      String traceFileNameAmendment = "_SystemServer_WDT" + mTraceDateFormat.format(new Date());      if (tracesPath != null && tracesPath.length() != 0) {        File traceRenameFile = new File(tracesPath);        String newTracesPath;        int lpos = tracesPath.lastIndexOf (".");        if (-1 != lpos)          newTracesPath = tracesPath.substring (0, lpos) + traceFileNameAmendment + tracesPath.substring (lpos);        else          newTracesPath = tracesPath + traceFileNameAmendment;        traceRenameFile.renameTo(new File(newTracesPath));        tracesPath = newTracesPath;      }      final File newFd = new File(tracesPath);      // Try to add the error to the dropbox, but assuming that the ActivityManager      // itself may be deadlocked. (which has happened, causing this statement to      // deadlock and the watchdog as a whole to be ineffective)      Thread dropboxThread = new Thread("watchdogWriteToDropbox") {          public void run() {            mActivity.addErrorToDropBox(                "watchdog", null, "system_server", null, null,                subject, null, newFd, null);          }        };      dropboxThread.start();      try {        dropboxThread.join(2000); // wait up to 2 seconds for it to return.      } catch (InterruptedException ignored) {}      // At times, when user space watchdog traces don't give an indication on      // which component held a lock, because of which other threads are blocked,      // (thereby causing Watchdog), crash the device to analyze RAM dumps      boolean crashOnWatchdog = SystemProperties                    .getBoolean("persist.sys.crashOnWatchdog", false);      if (crashOnWatchdog) {        // Trigger the kernel to dump all blocked threads, and backtraces        // on all CPUs to the kernel log        Slog.e(TAG, "Triggering SysRq for system_server watchdog");        doSysRq('w');        doSysRq('l');        // wait until the above blocked threads be dumped into kernel log        SystemClock.sleep(3000);        // now try to crash the target        doSysRq('c');      }      IActivityController controller;      synchronized (this) {        controller = mController;      }      if (controller != null) {        Slog.i(TAG, "Reporting stuck state to activity controller");        try {          Binder.setDumpDisabled("Service dumps disabled due to hung system process.");          // 1 = keep waiting, -1 = kill system          int res = controller.systemNotResponding(subject);          if (res >= 0) {            Slog.i(TAG, "Activity controller requested to coninue to wait");            waitedHalf = false;            continue;          }        } catch (RemoteException e) {        }      }      // Only kill the process if the debugger is not attached.      if (Debug.isDebuggerConnected()) {        debuggerWasConnected = 2;      }      if (debuggerWasConnected >= 2) {        Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");      } else if (debuggerWasConnected > 0) {        Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");      } else if (!allowRestart) {        Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");      } else {        Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);        for (int i=0; i<blockedCheckers.size(); i++) {          Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");          StackTraceElement[] stackTrace              = blockedCheckers.get(i).getThread().getStackTrace();          for (StackTraceElement element: stackTrace) {            Slog.w(TAG, "  at " + element);          }        }        Slog.w(TAG, "*** GOODBYE!");        Process.killProcess(Process.myPid());        System.exit(10);      }      waitedHalf = false;    }  }

首先，ActivityManagerService調用addMonitor()方法把自己添加到了Watchdog的mMonitorChecker對象中，這是Watchdog的一個全局變量，這個全部變量在Watchdog的構造方法中已經事先初始化好并添加到mHandlerCheckers：ArrayList<HandlerChecker>這個監控對象列表中了，mMonitorChecker是一個HandlerChecker類的實例對象，代碼如下：

public final class HandlerChecker implements Runnable {    private final Handler mHandler;    private final String mName;    private final long mWaitMax;    private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();    private boolean mCompleted;    private Monitor mCurrentMonitor;    private long mStartTime;    HandlerChecker(Handler handler, String name, long waitMaxMillis) {      mHandler = handler;      mName = name;      mWaitMax = waitMaxMillis;      mCompleted = true;    }    public void addMonitor(Monitor monitor) {      mMonitors.add(monitor);    }    public void scheduleCheckLocked() {      if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {        // If the target looper has recently been polling, then        // there is no reason to enqueue our checker on it since that        // is as good as it not being deadlocked. This avoid having        // to do a context switch to check the thread. Note that we        // only do this if mCheckReboot is false and we have no        // monitors, since those would need to be executed at this point.        mCompleted = true;        return;      }      if (!mCompleted) {        // we already have a check in flight, so no need        return;      }      mCompleted = false;      mCurrentMonitor = null;      mStartTime = SystemClock.uptimeMillis();      mHandler.postAtFrontOfQueue(this);    }    public boolean isOverdueLocked() {      return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);    }    public int getCompletionStateLocked() {      if (mCompleted) {        return COMPLETED;      } else {        long latency = SystemClock.uptimeMillis() - mStartTime;        if (latency < mWaitMax/2) {          return WAITING;        } else if (latency < mWaitMax) {          return WAITED_HALF;        }      }      return OVERDUE;    }    public Thread getThread() {      return mHandler.getLooper().getThread();    }    public String getName() {      return mName;    }    public String describeBlockedStateLocked() {      if (mCurrentMonitor == null) {        return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";      } else {        return "Blocked in monitor " + mCurrentMonitor.getClass().getName()            + " on " + mName + " (" + getThread().getName() + ")";      }    }    @Override    public void run() {      final int size = mMonitors.size();      for (int i = 0 ; i < size ; i++) {        synchronized (Watchdog.this) {          mCurrentMonitor = mMonitors.get(i);        }        mCurrentMonitor.monitor();      }      synchronized (Watchdog.this) {        mCompleted = true;        mCurrentMonitor = null;      }    }  }

HandlerChecker類中的mMonitors也是監控對象列表，這里是監控所有實現了Watchdog.Monitor接口的監控對象，而那些沒有實現Watchdog.Monitor接口的對象則會單獨創建一個HandlerChecker類并add到Watchdog的mHandlerCheckers監控列表中，當Watchdog線程開始健康那個的時候就回去遍歷mHandlerCheckers列表，并逐一的調用HandlerChecker的scheduleCheckLocked方法：

public void scheduleCheckLocked() {      if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {        // If the target looper has recently been polling, then        // there is no reason to enqueue our checker on it since that        // is as good as it not being deadlocked. This avoid having        // to do a context switch to check the thread. Note that we        // only do this if mCheckReboot is false and we have no        // monitors, since those would need to be executed at this point.        mCompleted = true;        return;      }      if (!mCompleted) {        // we already have a check in flight, so no need        return;      }      mCompleted = false;      mCurrentMonitor = null;      mStartTime = SystemClock.uptimeMillis();      mHandler.postAtFrontOfQueue(this);    }

HandlerChecker這個類中有幾個比較重要的標志，一個是mCompleted，標識著本次監控掃描是否在指定時間內完成，mStartTime標識本次開始掃描的時間mHandler，則是被監控的線程的handler，scheduleCheckLocked是開啟本次對與改線程的監控，里面理所當然的會把mCompleted置為false并設置開始時間，可以看到，監控原理就是向被監控的線程的Handler的消息隊列中post一個任務，也就是HandlerChecker本身，然后HandlerChecker這個任務就會在被監控的線程對應Handler維護的消息隊列中被執行，如果消息隊列因為某一個任務卡住，那么HandlerChecker這個任務就無法及時的執行到，超過了指定的時間后就會被認為當前被監控的這個線程發生了卡死（死鎖造成的卡死或者執行耗時任務造成的卡死），在HandlerChecker這個任務中：

@Override    public void run() {      final int size = mMonitors.size();      for (int i = 0 ; i < size ; i++) {        synchronized (Watchdog.this) {          mCurrentMonitor = mMonitors.get(i);        }        mCurrentMonitor.monitor();      }      synchronized (Watchdog.this) {        mCompleted = true;        mCurrentMonitor = null;      }    }

首先遍歷mMonitors列表中的監控對象并調用monitor()方法來開啟監控，通常在被監控對象實現的monitor()方法都是按照如下實現的：

public void monitor() {    synchronized (this) { }  }

即監控某一個死鎖，然后就是本次監控完成，mCompleted設置為true，而當所有的scheduleCheckLocked都執行完了之后，Watchdog就開始wait，而且一定要wait for 30s，這里有一個實現細節：

long start = SystemClock.uptimeMillis();        while (timeout > 0) {          if (Debug.isDebuggerConnected()) {            debuggerWasConnected = 2;          }          try {            wait(timeout);          } catch (InterruptedException e) {            Log.wtf(TAG, e);          }          if (Debug.isDebuggerConnected()) {            debuggerWasConnected = 2;          }          timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);        }

原先，我看到這段代碼的時候，首先關注到SystemClock.uptimeMillis()在設備休眠的時候是不計時的，因此猜測會不會是因為設備休眠了，wait也停止了，Watchdog在wait到15s的時候設備休眠了，并且連續休眠30分鐘后才又被喚醒，那么這時候wait會不會馬上被喚醒，答案是：正常情況下wait會繼續，知道直到剩下的15s也wait完成后才會喚醒，所以我疑惑了，于是查看下下Thread的wait()方法的接口文檔，終于找到如下解釋：

A thread can also wake up without being notified, interrupted, or   * timing out, a so-called <i>spurious wakeup</i>. While this will rarely   * occur in practice, applications must guard against it by testing for   * the condition that should have caused the thread to be awakened, and   * continuing to wait if the condition is not satisfied. In other words,   * waits should always occur in loops, like this one:   * <pre>   *   synchronized (obj) {   *     while (<condition does not hold>)   *       obj.wait(timeout);   *     ... // Perform action appropriate to condition   *   }   * </pre>

大致意思是說當Thread在wait的時候除了會被主動喚醒（notify或者notifyAll），中斷（interrupt），或者wait的時間到期而喚醒，還有可能被假喚醒，而這種假喚醒在實踐中發生的幾率非常低，不過針對這種假喚醒，程序需要通過驗證喚醒條件來區分線程是真的喚醒還是假的喚醒，如果是假的喚醒那么就繼續wait直到真喚醒，事實上，在我們實際的開發過程中確實要注意這種微小的細節，可能99%的情況下不會發生，但是要是遇到1%的情況發生之后，那么這個問題將會是非常隱晦的，而且在查找問題的時候也會變得很困難，很奇怪，為什么線程好好的wait過程中突然被喚醒了呢，甚至可能懷疑我們以前對于線程wait在設備休眠狀態下的執行情況？，廢話就扯到這里，繼續來研究Watchdog機制，在Watchdog等待30s之后會調用evaluateCheckerCompletionLocked()方法來檢測被監控對象的運行情況：

private int evaluateCheckerCompletionLocked() {    int state = COMPLETED;    for (int i=0; i<mHandlerCheckers.size(); i++) {      HandlerChecker hc = mHandlerCheckers.get(i);      state = Math.max(state, hc.getCompletionStateLocked());    }    return state;  }

通過調用HandlerChecker的getCompletionStateLocked來獲取每一個HandlerChecker的監控狀態：

public int getCompletionStateLocked() {      if (mCompleted) {        return COMPLETED;      } else {        long latency = SystemClock.uptimeMillis() - mStartTime;        if (latency < mWaitMax/2) {          return WAITING;        } else if (latency < mWaitMax) {          return WAITED_HALF;        }      }      return OVERDUE;    }

從這里，我們就看到了其實是通過mCompleted這個標志來區分30s之前和30s之后的不通狀態，因為30s之前對被監控的線程對應的Handler的消息對了中post了一個HandlerChecker任務，然后mCompleted = false，等待了30s后，如果HandlerChecker被及時的執行了，那么mCompleted = true表示任務及時執行完畢，而如果發現mCompleted = false那就說明HandlerChecker依然未被執行，當mCompleted = false的時候，會繼續檢測HandlerChecker任務的執行時間，如果在喚醒狀態下的執行時間小于30秒，那重新post監控等待，如果在30秒到60秒之間，那就會dump出一些堆棧信息，然后重新post監控等待，當等待時間已經超過60秒了，那就認為這是異常情況了（要么死鎖，要么耗時任務太久），這時候就會搜集各種相關信息，例如代碼堆棧信息，kernel信息，cpu信息等，生成trace文件，保存相關信息到dropbox文件夾下，然后殺死該進程，到這里監控就結束了

Watchdog線程卡頓監控實現

之前我們提到Watchdog監控的實現是通過post一個HandlerChecker到線程對應的Handler對的消息對了中的，而死鎖的監控對象都是保存在HandlerChecker的mMonitors列表中的，所以外部調用addMonitor()方法，最終都會add到Watchdog的全局變量mMonitorChecker中的監控列表，一次所有線程的死鎖監控都由mMonitorChecker來負責實現，那么對于線程耗時任務的監控，Watchdog是通過addThread()方法來實現的：

public void addThread(Handler thread) {    addThread(thread, DEFAULT_TIMEOUT);  }  public void addThread(Handler thread, long timeoutMillis) {    synchronized (this) {      if (isAlive()) {        throw new RuntimeException("Threads can't be added once the Watchdog is running");      }      final String name = thread.getLooper().getThread().getName();      mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));    }  }

addThread()方法實際上是創建了一個新的HandlerChecker對象，通過該對象來實現耗時任務的監控，而該HandlerChecker對象的mMonitors列表實際上是空的，因此在執行任務的時候并不會執行monitor()方法了，而是直接設置mCompleted標志位，所以可以這么解釋：Watchdog監控者是HandlerChecker，而HandlerChecker實現了線程死鎖監控和耗時任務監控，當有Monitor對象的時候就會同時監控線程死鎖和耗時任務，而沒有Monitor的時候就只是監控線程的耗時任務造成的卡頓

Watchdog監控流程

Android,ANR,在線監控

理解了Watchdog的監控流程，我們可以考慮是否把Watchdog機制運用到我們實際的項目中去實現監控在多線程場景中重要線程的死鎖，以及實時監控主線程的anr的發生？當然是可以的，事實上，Watchdog的在framework中的重要作用就是監控主要的系統服務器是否發生死鎖或者發生卡頓，例如監控ActivityManagerService，如果發生異常情況，那么Watchdog將會殺死進程重啟，這樣可以保證重要的系統服務遇到類似問題的時候可以通過重啟來恢復，Watchdog實際上相當于一個最后的保障，及時的dump出異常信息，異常恢復進程運行環境

對于應用程序中，健康那個重要線程的死鎖問題實現原理可以和Watchdog保持一致

對于監控應用的anr卡頓的實現原理可以從Watchdog中借鑒，具體實現稍微有點不一樣，Activity是5秒發生anr，Broadcast是10秒，Service是20秒，但是實際四大組件都是運行在主線程中的，所以可以用像Watchdog一樣，wait 30秒發起一次監控，通過設置mCompleted標志位來檢測post到MessageQueue的任務是否被卡住并未及時的執行，通過mStartTime來計算出任務的執行時間，然后通過任務執行的時間來檢測MessageQueue中其他的任務執行是否存在耗時操作，如果發現執行時間超過5秒，那么可以說明消息隊列中存在耗時任務，這時候可能就有anr的風險，應該及時dump線程棧信息保存，然后通過大數據上報后臺分析，記住這里一定是計算設備活躍的狀態下的時間，如果是設備休眠，MessageQueue本來就會暫停運行，這時候其實并不是死鎖或者卡頓

Android,ANR,在線監控

WatchDog機制的anr在線監控實現與demo

https://github.com/liuhongda/anrmonitor/tree/master/anrmonitor

Watchdog機制總結

每一個線程都可以對應一個Looper，一個Looper對應一個MessageQueue，所以可以通過向MessageQueue中post檢測任務來預測該檢測任務是否被及時的執行，以此達到檢測線程任務卡頓的效果，但是前提是該線程要先創建一個Looper

Watchdog必須獨自運行在一個單獨的線程中，這樣才可以監控其他線程而不互相影響

使用Watchdog機制來實現在線的anr監控可能并不能百分百準確，比如5秒發生anr，在快到5秒的臨界值的時候耗時任務正好執行完成了，這時候執行anr檢測任務，在檢測任務執行過程中，有可能Watchdog線程wait的時間也到了，這時候發現檢測任務還沒執行完于是就報了一個anr，這是不準確的；另一種情況可能是5秒anr已經發生了，但是Watchdog線程檢測還沒還是wait，也就是anr發生的時間和Watchdog線程wait的時間錯開了，等到下一次Watchdog線程開始wait的時候，anr已經發生完了，主線程可能已經恢復正常，這時候就會漏掉這次發生的anr信息搜集，所以當anr卡頓的時間是Watchdog線程wait時間的兩倍的時候，才能完整的掃描到anr并記錄，也就是說Watchdog的wait時間為2.5秒，這個在實際應用中有點過于頻繁了，如果設備不休眠，Watchdog相當于每間隔2.5秒就會運行一下，可能會有耗電風險

以上就是本文的全部內容，希望對大家的學習有所幫助，也希望大家多多支持VEVB武林網。

注：相關教程知識閱讀請移步到Android開發頻道。

上一篇：Android 接收推送消息跳轉到指定頁面的方法

下一篇：Android.mk文件中添加第三方jar文件的方法