Jiping Ma 2a5f9f3490 kernel: Fix drivers panic during shutdown
This commit prevents the ice and iavf drivers from causing kernel
panics when forced reboot is initiated with "reboot -f".

Issue #1: iavf driver

If the netdev pointer is NULL, then iavf_remove() returns early to
ensure that it does not proceed with an already-freed netdev instance.
However, drvdata field of the iavf driver's pci_dev structure continues
to keep the former value of the netdev pointer, and this value can be
acquired from the pci_dev structure via pci_get_drvdata(). This causes
a kernel panic when a forced reboot/shutdown is in progress due to the
following sequence of events:

- The iavf_shutdown() callback is called by the kernel. This function
  detaches the device, brings it down if it was running and frees
  resources.
- Later, the associated PF driver's shutdown callback is called:
  ice_shutdown(). That callback calls, among others, sriov_disable(),
  which then indirectly calls iavf_remove() again.
- Kernel WARNING is reported because the work adminq_task->func is NULL
  in cancel_work_sync(&adapter->adminq_task) during iavf_remove(), that
  reason is the resource already had been freed in the first
  iavf_remove() running stage.
  "WARNING: CPU: 63 PID: 93678 at kernel/workqueue.c:3047
    __flush_work.isra.0+0x6b/0x80"

The patch for iavf resolves this issue by checking the pci_dev
structure's is_busmaster field at the beginning of iavf_remove(). If the
PCI device had already been disabled by an earlier call to
iavf_shutdown() or iavf_remove(), via a call to pci_disable_device(),
then the is_busmaster field would be set to zero. Based on this logic,
if the is_busmaster field is set to zero, then the iavf_remove function
returns early. This in turn avoids the aforementioned kernel panic
caused by multiple calls to iavf_remove().

Note that the description above is applicable to iavf-4.6.1 (in NIC
driver bundle cvl-4.10); however, a similar issue occurs in earlier
versions of the iavf driver as well, which necessitates the same fix.

Issue #2: ice driver

When the system is rebooted, then the PTP-related resources are released
by the ice driver's ice_remove() function before the irq_msix_misc
interrupt is disabled. However, the interrupt handler continues to use
these resources, and when the interrupt in question occurs, then a
kernel panic occurs.

This issue is fixed by disabling the irq_msix_misc interrupt before the
call to ice_ptp_release() in ice_remove().

Please note that colleagues at Intel have reviewed the fixes included in
this commit, and they have confirmed that these changes could be used as
a temporary workaround for now. The changes introduced by this commit
can be reverted once Intel resolves the aforementioned issues in the
official ice and iavf driver releases.

This issue can be reproduced with the below steps.
1. Installed sts-silicom app.
2. Make sure sts-silicom must be running status.
3. reboot -f

Verification:
- build-pkgs; build-iso; install and boot up on aio-sx lab.
- The issue can not be reproduced after the fix with the up reproduced
  steps.

Closes-Bug: 2030725

Change-Id: Ib296dc3180023230c46aa028a7d7c4283b17cff0
Signed-off-by: Jiping Ma <jiping.ma2@windriver.com>
2023-09-05 23:24:19 -04:00
..