Partially recover from a driver crash
If a driver crashes, every object associated with that driver becomes
"dead", and any method invocation on such an object fails with a
transport error. In the NNAPI, this is a problem for IDevice and
IPreparedModel objects. Without some mechanism to recover from a
driver crash, all further uses of an IDevice or IPreparedModel will
fail -- e.g., it's impossible to execute an already-compiled model,
and it's impossible to create a new compiled model. The only way to
recover from this is to restart the application.
This fix addresses the first part of this problem. All references to
IDevice in the runtime go through VersionedIDevice, so it sufficies to
replace the IDevice reference in a VersionedIDevice when the IDevice
dies. Therefore, it is now possible to create a new compiled model
after a driver crash (the crash will appear to be a transient error).
A previously-compiled model is still dead, and this fix does not
address that problem.
When we attempt to replace the IDevice, we use tryGetService() rather
than getService(): Rather than waiting for the driver to become
available, we recover it if it is available, and otherwise retain the
behavior prior to this change -- i.e., the attempt to use the IDevice
fails, and the runtime employs a fallback path if possible. This way we
avoid a potentially long wait for the driver to come back up (up to 5
seconds, by default, per init start_period behavior).
As an alternative approach, it might be possible to handle recovery by
means of a death recipient, rather than during a VersionedIDevice
method call. However, that alternative approach would probably result
in more transient failures because of a crash, because the recovery
would then be asynchronous with respect to calls that are vulnerable
to a dead driver.
Bug: 118623798
Test: NeuralNetworksTest_static
Test: NeuralNetworksTest_mt_static
Test: Ran NeuralNetworksTest_static --gtest_filter=TrivialTest.AddTwo --gtest_repeat=-1
and killed driver during the running; verified that there are
no failures (we use the CPU fallback path) and that we eventually
recover from the driver death (saw in the logcat that we run on
device, then attempt recovery and fail several times and so run on
CPU, then succeed in recovery and go back to running on device).
Test: Modified VersionedIDevice::recoverable<> so that the first time we
find a dead object, we sleep 20 seconds, allowing time for another
thread to recover from the driver crash, so that the sleeper needs
to tolerate the recovery already having happened. Ran
NeuralNetworksTest_mt_static --gtest_filter=GeneratedTests.add --gtest_repeat=-1
and killed driver during the running; verified that there are no
failures (we use the CPU fallback path) and that we took the
recovery path (by observing that the sleep happened and by
inspecting the logcat).
Test: Modified NeuralNetworksTest_static TrivialTest.AddTwo to use
introspection/control interface to force a particular driver;
set debug.nn.partition to 2 to turn off CPU fallback;
ran NeuralNetworksTest_static --gtest_filter=TrivialTest.AddTwo --gtest_repeat=-1
and killed driver during the running; verified that there are
several failures (as we attempt recovery and fail several times)
but that we eventually recover from the driver death (saw in the logcat
that we went through the recovery path and that we go back to
using the driver).
Test: Modified each sample-* driver to sleep(10) when it begins its
asynchronous execution; ran NeuralNetworksTest_static
--gtest_filter=GeneratedTests.add with
useCpuOnly = 0, computeMode = ComputeMode::ASYNC, allowSyncExecHal = 0
and killed driver and confirmed (1) that the runtime was not blocked and
(2) that an appropriate log message was recorded. See http://ag/6575732.
Test: Modified each sample-* driver to do asynchronous prepareModel and to sleep(10)
when it begins its asynchronous preparation; ran NeuralNetworksTest_static
--gtest_filter=GeneratedTests.add with
useCpuOnly = 0, computeMode = ComputeMode::ASYNC, allowSyncExecHal = 0
and killed driver and confirmed (1) that the runtime was not blocked and
(2) that an appropriate log message was recorded. See http://ag/6575732.
Test: Modified each sample-* driver to return an error for launching an
asynchronous call (tested execution and prepareModel separately), but not
make the corresponding call to callback->notify; ran NeuralNetworksTest_static
--gtest_filter=GeneratedTests.add with
useCpuOnly = 0, computeMode = ComputeMode::ASYNC, allowSyncExecHal = 0
and confirmed that the execution succeeded and that appropriate
messages were logged (preparation or execution failure followed by CPU fallback).
See http://ag/7669359.
Change-Id: I55b779bc2a38243d5df122433672a9f2e073c8b4
Loading
Please sign in to comment