Commit 00fa8d35 authored May 30, 2019 by David Gross
Partially recover from a driver crash

If a driver crashes, every object associated with that driver becomes
"dead", and any method invocation on such an object fails with a
transport error.  In the NNAPI, this is a problem for IDevice and
IPreparedModel objects.  Without some mechanism to recover from a
driver crash, all further uses of an IDevice or IPreparedModel will
fail -- e.g., it's impossible to execute an already-compiled model,
and it's impossible to create a new compiled model.  The only way to
recover from this is to restart the application.

This fix addresses the first part of this problem.  All references to
IDevice in the runtime go through VersionedIDevice, so it sufficies to
replace the IDevice reference in a VersionedIDevice when the IDevice
dies.  Therefore, it is now possible to create a new compiled model
after a driver crash (the crash will appear to be a transient error).
A previously-compiled model is still dead, and this fix does not
address that problem.

When we attempt to replace the IDevice, we use tryGetService() rather
than getService(): Rather than waiting for the driver to become
available, we recover it if it is available, and otherwise retain the
behavior prior to this change -- i.e., the attempt to use the IDevice
fails, and the runtime employs a fallback path if possible.  This way we
avoid a potentially long wait for the driver to come back up (up to 5
seconds, by default, per init start_period behavior).

As an alternative approach, it might be possible to handle recovery by
means of a death recipient, rather than during a VersionedIDevice
method call.  However, that alternative approach would probably result
in more transient failures because of a crash, because the recovery
would then be asynchronous with respect to calls that are vulnerable
to a dead driver.

Bug: 118623798

Test: NeuralNetworksTest_static
Test: NeuralNetworksTest_mt_static
Test: Ran NeuralNetworksTest_static --gtest_filter=TrivialTest.AddTwo --gtest_repeat=-1
      and killed driver during the running; verified that there are
      no failures (we use the CPU fallback path) and that we eventually
      recover from the driver death (saw in the logcat that we run on
      device, then attempt recovery and fail several times and so run on
      CPU, then succeed in recovery and go back to running on device).
Test: Modified VersionedIDevice::recoverable<> so that the first time we
      find a dead object, we sleep 20 seconds, allowing time for another
      thread to recover from the driver crash, so that the sleeper needs
      to tolerate the recovery already having happened.  Ran
      NeuralNetworksTest_mt_static --gtest_filter=GeneratedTests.add --gtest_repeat=-1
      and killed driver during the running; verified that there are no
      failures (we use the CPU fallback path) and that we took the
      recovery path (by observing that the sleep happened and by
      inspecting the logcat).
Test: Modified NeuralNetworksTest_static TrivialTest.AddTwo to use
      introspection/control interface to force a particular driver;
      set debug.nn.partition to 2 to turn off CPU fallback;
      ran NeuralNetworksTest_static --gtest_filter=TrivialTest.AddTwo --gtest_repeat=-1
      and killed driver during the running; verified that there are
      several failures (as we attempt recovery and fail several times)
      but that we eventually recover from the driver death (saw in the logcat
      that we went through the recovery path and that we go back to
      using the driver).
Test: Modified each sample-* driver to sleep(10) when it begins its
      asynchronous execution; ran NeuralNetworksTest_static
      --gtest_filter=GeneratedTests.add with
        useCpuOnly = 0, computeMode = ComputeMode::ASYNC, allowSyncExecHal = 0
      and killed driver and confirmed (1) that the runtime was not blocked and
      (2) that an appropriate log message was recorded.  See http://ag/6575732.
Test: Modified each sample-* driver to do asynchronous prepareModel and to sleep(10)
      when it begins its asynchronous preparation; ran NeuralNetworksTest_static
      --gtest_filter=GeneratedTests.add with
        useCpuOnly = 0, computeMode = ComputeMode::ASYNC, allowSyncExecHal = 0
      and killed driver and confirmed (1) that the runtime was not blocked and
      (2) that an appropriate log message was recorded.  See http://ag/6575732.
Test: Modified each sample-* driver to return an error for launching an
      asynchronous call (tested execution and prepareModel separately), but not
      make the corresponding call to callback->notify; ran NeuralNetworksTest_static
      --gtest_filter=GeneratedTests.add with
        useCpuOnly = 0, computeMode = ComputeMode::ASYNC, allowSyncExecHal = 0
      and confirmed that the execution succeeded and that appropriate
      messages were logged (preparation or execution failure followed by CPU fallback).
      See http://ag/7669359.

Change-Id: I55b779bc2a38243d5df122433672a9f2e073c8b4
parent 7d1ea8b7
Expand all Show whitespace changes
Inline Side-by-side
Please to comment