From e44565f62a72064e686f7a852137595ec94d78f2 Mon Sep 17 00:00:00 2001 From: "Luis R. Rodriguez" Date: Thu, 20 Jul 2017 13:13:09 -0700 Subject: [PATCH 1/3] firmware: fix batched requests - wake all waiters The firmware cache mechanism serves two purposes, the secondary purpose is not well documented nor understood. This fixes a regression with the secondary purpose of the firmware cache mechanism: batched requests on successful lookups. Without this fix *any* time a batched request is triggered, secondary requests for which the batched request mechanism was designed for will seem to last forver and seem to never return. This issue is present for all kernel builds possible, and a hard reset is required. The firmware cache is used for: 1) Addressing races with file lookups during the suspend/resume cycle by keeping firmware in memory during the suspend/resume cycle 2) Batched requests for the same file rely only on work from the first file lookup, which keeps the firmware in memory until the last release_firmware() is called Batched requests *only* take effect if secondary requests come in prior to the first user calling release_firmware(). The devres name used for the internal firmware cache is used as a hint other pending requests are ongoing, the firmware buffer data is kept in memory until the last user of the buffer calls release_firmware(), therefore serializing requests and delaying the release until all requests are done. Batched requests wait for a wakup or signal so we can rely on the first file fetch to write to the pending secondary requests. Commit 5b029624948d ("firmware: do not use fw_lock for fw_state protection") ported the firmware API to use swait, and in doing so failed to convert complete_all() to swake_up_all() -- it used swake_up(), loosing the ability for *some* batched requests to take effect. We *could* fix this by just using swake_up_all() *but* swait is now known to be very special use case, so its best to just move away from it. So we just go back to using completions as before commit 5b029624948d ("firmware: do not use fw_lock for fw_state protection") given this was using complete_all(). Without this fix it has been reported plugging in two Intel 6260 Wifi cards on a system will end up enumerating the two devices only 50% of the time [0]. The ported swake_up() should have actually handled the case with two devices, however, *if more than two cards are used* the swake_up() would not have sufficed. This change is only part of the required fixes for batched requests. Another fix is provided in the next patch. This particular change should fix the cases where more than three requests with the same firmware name is used, otherwise batched requests will wait for MAX_SCHEDULE_TIMEOUT and just timeout eventually. Below is a summary of tests triggering batched requests on different kernel builds. Before this patch: ============================================================================ CONFIG_FW_LOADER_USER_HELPER_FALLBACK=n CONFIG_FW_LOADER_USER_HELPER=y Most common Linux distribution setup. API-type no-firmware-found firmware-found ---------------------------------------------------------------------- request_firmware() FAIL FAIL request_firmware_direct() FAIL FAIL request_firmware_nowait(uevent=true) FAIL FAIL request_firmware_nowait(uevent=false) FAIL FAIL ============================================================================ CONFIG_FW_LOADER_USER_HELPER_FALLBACK=n CONFIG_FW_LOADER_USER_HELPER=n Only possible if CONFIG_DELL_RBU=n and CONFIG_LEDS_LP55XX_COMMON=n, rare. API-type no-firmware-found firmware-found ---------------------------------------------------------------------- request_firmware() FAIL FAIL request_firmware_direct() FAIL FAIL request_firmware_nowait(uevent=true) FAIL FAIL request_firmware_nowait(uevent=false) FAIL FAIL ============================================================================ CONFIG_FW_LOADER_USER_HELPER_FALLBACK=y CONFIG_FW_LOADER_USER_HELPER=y Google Android setup. API-type no-firmware-found firmware-found ---------------------------------------------------------------------- request_firmware() FAIL FAIL request_firmware_direct() FAIL FAIL request_firmware_nowait(uevent=true) FAIL FAIL request_firmware_nowait(uevent=false) FAIL FAIL ============================================================================ After this patch: ============================================================================ CONFIG_FW_LOADER_USER_HELPER_FALLBACK=n CONFIG_FW_LOADER_USER_HELPER=y Most common Linux distribution setup. API-type no-firmware-found firmware-found ---------------------------------------------------------------------- request_firmware() FAIL OK request_firmware_direct() FAIL OK request_firmware_nowait(uevent=true) FAIL OK request_firmware_nowait(uevent=false) FAIL OK ============================================================================ CONFIG_FW_LOADER_USER_HELPER_FALLBACK=n CONFIG_FW_LOADER_USER_HELPER=n Only possible if CONFIG_DELL_RBU=n and CONFIG_LEDS_LP55XX_COMMON=n, rare. API-type no-firmware-found firmware-found ---------------------------------------------------------------------- request_firmware() FAIL OK request_firmware_direct() FAIL OK request_firmware_nowait(uevent=true) FAIL OK request_firmware_nowait(uevent=false) FAIL OK ============================================================================ CONFIG_FW_LOADER_USER_HELPER_FALLBACK=y CONFIG_FW_LOADER_USER_HELPER=y Google Android setup. API-type no-firmware-found firmware-found ---------------------------------------------------------------------- request_firmware() OK OK request_firmware_direct() FAIL OK request_firmware_nowait(uevent=true) OK OK request_firmware_nowait(uevent=false) OK OK ============================================================================ [0] https://bugzilla.kernel.org/show_bug.cgi?id=195477 CC: [4.10+] Cc: Ming Lei Fixes: 5b029624948d ("firmware: do not use fw_lock for fw_state protection") Reported-by: Jakub Kicinski Signed-off-by: Luis R. Rodriguez Signed-off-by: Greg Kroah-Hartman --- drivers/base/firmware_class.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c index b9f907eedbf7..f50ec6e367bd 100644 --- a/drivers/base/firmware_class.c +++ b/drivers/base/firmware_class.c @@ -30,7 +30,6 @@ #include #include #include -#include #include @@ -112,13 +111,13 @@ static inline long firmware_loading_timeout(void) * state of the firmware loading. */ struct fw_state { - struct swait_queue_head wq; + struct completion completion; enum fw_status status; }; static void fw_state_init(struct fw_state *fw_st) { - init_swait_queue_head(&fw_st->wq); + init_completion(&fw_st->completion); fw_st->status = FW_STATUS_UNKNOWN; } @@ -131,9 +130,8 @@ static int __fw_state_wait_common(struct fw_state *fw_st, long timeout) { long ret; - ret = swait_event_interruptible_timeout(fw_st->wq, - __fw_state_is_done(READ_ONCE(fw_st->status)), - timeout); + ret = wait_for_completion_interruptible_timeout(&fw_st->completion, + timeout); if (ret != 0 && fw_st->status == FW_STATUS_ABORTED) return -ENOENT; if (!ret) @@ -148,7 +146,7 @@ static void __fw_state_set(struct fw_state *fw_st, WRITE_ONCE(fw_st->status, status); if (status == FW_STATUS_DONE || status == FW_STATUS_ABORTED) - swake_up(&fw_st->wq); + complete_all(&fw_st->completion); } #define fw_state_start(fw_st) \ From 90d41e74a9c36a84e2efbd2a5b8d79299feee6fa Mon Sep 17 00:00:00 2001 From: "Luis R. Rodriguez" Date: Thu, 20 Jul 2017 13:13:10 -0700 Subject: [PATCH 2/3] firmware: fix batched requests - send wake up on failure on direct lookups Fix batched requests from waiting forever on failure. The firmware API batched requests feature has been broken since the API call request_firmware_direct() was introduced on commit bba3a87e982ad ("firmware: Introduce request_firmware_direct()"), added on v3.14 *iff* the firmware being requested was not present in *certain kernel builds* [0]. When no firmware is found the worker which goes on to finish never informs waiters queued up of this, so any batched request will stall in what seems to be forever (MAX_SCHEDULE_TIMEOUT). Sadly, a reboot will also stall, as the reboot notifier was only designed to kill custom fallback workers. The issue seems to the user as a type of soft lockup, what *actually* happens underneath the hood is a wait call which never completes as we failed to issue a completion on error. For device drivers with optional firmware schemes (ie, Intel iwlwifi, or Netronome -- even though it uses request_firmware() and not request_firmware_direct()), this could mean that when you boot a system with multiple cards the firmware will seem to never load on the system, or that the card is just not responsive even the driver initialization. Due to differences in scheduling possible this should not always trigger -- one would need to to ensure that multiple requests are in place at the right time for this to work, also release_firmware() must not be called prior to any other incoming request. The complexity may not be worth supporting batched requests in the future given the wait mechanism is only used also for the fallback mechanism. We'll keep it for now and just fix it. Its reported that at least with the Intel WiFi cards on one system this issue was creeping up 50% of the boots [0]. Before this commit batched requests testing revealed: ============================================================================ CONFIG_FW_LOADER_USER_HELPER_FALLBACK=n CONFIG_FW_LOADER_USER_HELPER=y Most common Linux distribution setup. API-type no-firmware-found firmware-found ---------------------------------------------------------------------- request_firmware() FAIL OK request_firmware_direct() FAIL OK request_firmware_nowait(uevent=true) FAIL OK request_firmware_nowait(uevent=false) FAIL OK ============================================================================ CONFIG_FW_LOADER_USER_HELPER_FALLBACK=n CONFIG_FW_LOADER_USER_HELPER=n Only possible if CONFIG_DELL_RBU=n and CONFIG_LEDS_LP55XX_COMMON=n, rare. API-type no-firmware-found firmware-found ---------------------------------------------------------------------- request_firmware() FAIL OK request_firmware_direct() FAIL OK request_firmware_nowait(uevent=true) FAIL OK request_firmware_nowait(uevent=false) FAIL OK ============================================================================ CONFIG_FW_LOADER_USER_HELPER_FALLBACK=y CONFIG_FW_LOADER_USER_HELPER=y Google Android setup. API-type no-firmware-found firmware-found ---------------------------------------------------------------------- request_firmware() OK OK request_firmware_direct() FAIL OK request_firmware_nowait(uevent=true) OK OK request_firmware_nowait(uevent=false) OK OK ============================================================================ Ater this commit batched testing results: ============================================================================ CONFIG_FW_LOADER_USER_HELPER_FALLBACK=n CONFIG_FW_LOADER_USER_HELPER=y Most common Linux distribution setup. API-type no-firmware-found firmware-found ---------------------------------------------------------------------- request_firmware() OK OK request_firmware_direct() OK OK request_firmware_nowait(uevent=true) OK OK request_firmware_nowait(uevent=false) OK OK ============================================================================ CONFIG_FW_LOADER_USER_HELPER_FALLBACK=n CONFIG_FW_LOADER_USER_HELPER=n Only possible if CONFIG_DELL_RBU=n and CONFIG_LEDS_LP55XX_COMMON=n, rare. API-type no-firmware-found firmware-found ---------------------------------------------------------------------- request_firmware() OK OK request_firmware_direct() OK OK request_firmware_nowait(uevent=true) OK OK request_firmware_nowait(uevent=false) OK OK ============================================================================ CONFIG_FW_LOADER_USER_HELPER_FALLBACK=y CONFIG_FW_LOADER_USER_HELPER=y Google Android setup. API-type no-firmware-found firmware-found ---------------------------------------------------------------------- request_firmware() OK OK request_firmware_direct() OK OK request_firmware_nowait(uevent=true) OK OK request_firmware_nowait(uevent=false) OK OK ============================================================================ [0] https://bugzilla.kernel.org/show_bug.cgi?id=195477 Cc: stable # v3.14 Fixes: bba3a87e982ad ("firmware: Introduce request_firmware_direct()" Reported-by: Nicolas Reported-by: John Ewalt Reported-by: Jakub Kicinski Signed-off-by: Luis R. Rodriguez Signed-off-by: Greg Kroah-Hartman --- drivers/base/firmware_class.c | 38 +++++++++++++++++++++++++++-------- 1 file changed, 30 insertions(+), 8 deletions(-) diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c index f50ec6e367bd..76f1b702bdd6 100644 --- a/drivers/base/firmware_class.c +++ b/drivers/base/firmware_class.c @@ -153,28 +153,27 @@ static void __fw_state_set(struct fw_state *fw_st, __fw_state_set(fw_st, FW_STATUS_LOADING) #define fw_state_done(fw_st) \ __fw_state_set(fw_st, FW_STATUS_DONE) +#define fw_state_aborted(fw_st) \ + __fw_state_set(fw_st, FW_STATUS_ABORTED) #define fw_state_wait(fw_st) \ __fw_state_wait_common(fw_st, MAX_SCHEDULE_TIMEOUT) -#ifndef CONFIG_FW_LOADER_USER_HELPER - -#define fw_state_is_aborted(fw_st) false - -#else /* CONFIG_FW_LOADER_USER_HELPER */ - static int __fw_state_check(struct fw_state *fw_st, enum fw_status status) { return fw_st->status == status; } +#define fw_state_is_aborted(fw_st) \ + __fw_state_check(fw_st, FW_STATUS_ABORTED) + +#ifdef CONFIG_FW_LOADER_USER_HELPER + #define fw_state_aborted(fw_st) \ __fw_state_set(fw_st, FW_STATUS_ABORTED) #define fw_state_is_done(fw_st) \ __fw_state_check(fw_st, FW_STATUS_DONE) #define fw_state_is_loading(fw_st) \ __fw_state_check(fw_st, FW_STATUS_LOADING) -#define fw_state_is_aborted(fw_st) \ - __fw_state_check(fw_st, FW_STATUS_ABORTED) #define fw_state_wait_timeout(fw_st, timeout) \ __fw_state_wait_common(fw_st, timeout) @@ -1198,6 +1197,28 @@ _request_firmware_prepare(struct firmware **firmware_p, const char *name, return 1; /* need to load */ } +/* + * Batched requests need only one wake, we need to do this step last due to the + * fallback mechanism. The buf is protected with kref_get(), and it won't be + * released until the last user calls release_firmware(). + * + * Failed batched requests are possible as well, in such cases we just share + * the struct firmware_buf and won't release it until all requests are woken + * and have gone through this same path. + */ +static void fw_abort_batch_reqs(struct firmware *fw) +{ + struct firmware_buf *buf; + + /* Loaded directly? */ + if (!fw || !fw->priv) + return; + + buf = fw->priv; + if (!fw_state_is_aborted(&buf->fw_st)) + fw_state_aborted(&buf->fw_st); +} + /* called from request_firmware() and request_firmware_work_func() */ static int _request_firmware(const struct firmware **firmware_p, const char *name, @@ -1241,6 +1262,7 @@ _request_firmware(const struct firmware **firmware_p, const char *name, out: if (ret < 0) { + fw_abort_batch_reqs(fw); release_firmware(fw); fw = NULL; } From 260d9f2fc5655a2552701cdd0b909c4466732f19 Mon Sep 17 00:00:00 2001 From: "Luis R. Rodriguez" Date: Thu, 20 Jul 2017 13:13:11 -0700 Subject: [PATCH 3/3] firmware: avoid invalid fallback aborts by using killable wait Commit 0cb64249ca500 ("firmware_loader: abort request if wait_for_completion is interrupted") added via 4.0 added support to abort the fallback mechanism when a signal was detected and wait_for_completion_interruptible() returned -ERESTARTSYS -- for instance when a user hits CTRL-C. The abort was overly *too* effective. When a child process terminates (successful or not) the signal SIGCHLD can be sent to the parent process which ran the child in the background and later triggered a sync request for firmware through a sysfs interface which relies on the fallback mechanism. This signal in turn can be recieved by the interruptible wait we constructed on firmware_class and detects it as an abort *before* userspace could get a chance to write the firmware. Upon failure -EAGAIN is returned, so userspace is also kept in the dark about exactly what happened. We can reproduce the issue with the fw_fallback.sh selftest: Before this patch: $ sudo tools/testing/selftests/firmware/fw_fallback.sh ... tools/testing/selftests/firmware/fw_fallback.sh: error - sync firmware request cancelled due to SIGCHLD After this patch: $ sudo tools/testing/selftests/firmware/fw_fallback.sh ... tools/testing/selftests/firmware/fw_fallback.sh: SIGCHLD on sync ignored as expected Fix this by making the wait killable -- only killable by SIGKILL (kill -9). We loose the ability to allow userspace to cancel a write with CTRL-C (SIGINT), however its been decided the compromise to require SIGKILL is worth the gains. Chances of this issue occuring are low due to the number of drivers upstream exclusively relying on the fallback mechanism for firmware (2 drivers), however this is observed in the field with custom drivers with sysfs triggers to load firmware. Only distributions relying on the fallback mechanism are impacted as well. An example reported issue was on Android, as follows: 1) Android init (pid=1) fork()s (say pid=42) [this child process is totally unrelated to firmware loading, it could be sleep 2; for all we care ] 2) Android init (pid=1) does a write() on a (driver custom) sysfs file which ends up calling request_firmware() kernel side 3) The firmware loading fallback mechanism is used, the request is sent to userspace and pid 1 waits in the kernel on wait_* 4) before firmware loading completes pid 42 dies (for any reason, even normal termination) 5) Kernel delivers SIGCHLD to pid=1 to tell it a child has died, which causes -ERESTARTSYS to be returned from wait_* 6) The kernel's wait aborts and return -EAGAIN for the request_firmware() caller. Cc: stable # 4.0 Fixes: 0cb64249ca500 ("firmware_loader: abort request if wait_for_completion is interrupted") Suggested-by: "Eric W. Biederman" Suggested-by: Dmitry Torokhov Tested-by: Martin Fuzzey Reported-by: Martin Fuzzey Signed-off-by: Luis R. Rodriguez Signed-off-by: Greg Kroah-Hartman --- drivers/base/firmware_class.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c index 76f1b702bdd6..bfbe1e154128 100644 --- a/drivers/base/firmware_class.c +++ b/drivers/base/firmware_class.c @@ -130,8 +130,7 @@ static int __fw_state_wait_common(struct fw_state *fw_st, long timeout) { long ret; - ret = wait_for_completion_interruptible_timeout(&fw_st->completion, - timeout); + ret = wait_for_completion_killable_timeout(&fw_st->completion, timeout); if (ret != 0 && fw_st->status == FW_STATUS_ABORTED) return -ENOENT; if (!ret)