summaryrefslogtreecommitdiff
path: root/drivers/edac
AgeCommit message (Collapse)Author
2024-07-15Merge tag 'x86_misc_for_v6.11_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 updates from Borislav Petkov: - Make error checking of AMD SMN accesses more robust in the callers as they're the only ones who can interpret the results properly - The usual cleanups and fixes, left and right * tag 'x86_misc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kmsan: Fix hook for unaligned accesses x86/platform/iosf_mbi: Convert PCIBIOS_* return codes to errnos x86/pci/xen: Fix PCIBIOS_* return code handling x86/pci/intel_mid_pci: Fix PCIBIOS_* return code handling x86/of: Return consistent error type from x86_of_pci_irq_enable() hwmon: (k10temp) Rename _data variable hwmon: (k10temp) Remove unused HAVE_TDIE() macro hwmon: (k10temp) Reduce k10temp_get_ccd_support() parameters hwmon: (k10temp) Define a helper function to read CCD temperature x86/amd_nb: Enhance SMN access error checking hwmon: (k10temp) Check return value of amd_smn_read() EDAC/amd64: Check return value of amd_smn_read() EDAC/amd64: Remove unused register accesses tools/x86/kcpuid: Add missing dir via Makefile x86, arm: Add missing license tag to syscall tables files
2024-07-15Merge tag 'edac_updates_for_v6.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - The AMD memory controllers data fabric version 4.5 supports non-power-of-2 denormalization in the sense that certain bits of the system physical address cannot be reconstructed from the normalized address reported by the RAS hardware. Add support for handling such addresses - Switch the EDAC drivers to the new Intel CPU model defines - The usual fixes and cleanups all over the place * tag 'edac_updates_for_v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC: Add missing MODULE_DESCRIPTION() macros EDAC/dmc520: Use devm_platform_ioremap_resource() EDAC/igen6: Add Intel Arrow Lake-U/H SoCs support RAS/AMD/FMPM: Use atl internal.h for INVALID_SPA RAS/AMD/ATL: Implement DF 4.5 NP2 denormalization RAS/AMD/ATL: Validate address map when information is gathered RAS/AMD/ATL: Expand helpers for adding and removing base and hole RAS/AMD/ATL: Read DRAM hole base early RAS/AMD/ATL: Add amd_atl pr_fmt() prefix RAS/AMD/ATL: Add a missing module description EDAC, i10nm: make skx_common.o a separate module EDAC/skx: Switch to new Intel CPU model defines EDAC/sb_edac: Switch to new Intel CPU model defines EDAC, pnd2: Switch to new Intel CPU model defines EDAC/i10nm: Switch to new Intel CPU model defines EDAC/ghes: Add missing newline to pr_info() statement RAS/AMD/ATL: Add missing newline to pr_info() statement EDAC/thunderx: Remove unused struct error_syndrome
2024-06-29EDAC: Add missing MODULE_DESCRIPTION() macrosJeff Johnson
With ARCH=arm64 make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/edac/layerscape_edac_mod.o Add the missing invocation of the MODULE_DESCRIPTION() macro to all files which have a MODULE_LICENSE(). This includes mpc85xx_edac.c and four octeon_edac-*.c files which, although they did not produce a warning with the arm64 allmodconfig configuration, may cause this warning with other configurations. [ bp: s/module/driver/ for layerscape_edac ] Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240617-md-arm64-drivers-edac-v2-1-6d6c5dd1e5da@quicinc.com
2024-06-23EDAC/dmc520: Use devm_platform_ioremap_resource()Jai Arora
platform_get_resource() and devm_ioremap_resource() are wrapped up in the devm_platform_ioremap_resource() helper. Use the helper and get rid of the local variable for struct resource *. Signed-off-by: Jai Arora <jai.arora@samsung.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240618110226.97395-1-jai.arora@samsung.com
2024-06-14EDAC/igen6: Add Intel Arrow Lake-U/H SoCs supportQiuxu Zhuo
Arrow Lake-U/H SoCs share same IBECC registers with Meteor Lake-P SoCs. Add Arrow Lake-U/H SoC compute die IDs for EDAC support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240614030354.69180-1-qiuxu.zhuo@intel.com
2024-06-12EDAC/amd64: Check return value of amd_smn_read()Yazen Ghannam
Check the return value of amd_smn_read() before saving a value. This ensures invalid values aren't saved. The struct umc instance is initialized to 0 during memory allocation. Therefore, a bad read will keep the value as 0 providing the expected Read-as-Zero behavior. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20240606-fix-smn-bad-read-v4-2-ffde21931c3f@amd.com
2024-06-12EDAC/amd64: Remove unused register accessesYazen Ghannam
A number of UMC registers are read only for the purpose of debug printing. They are not used in any calculations. Nor do they have any specific debug value. Remove them. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20240606-fix-smn-bad-read-v4-1-ffde21931c3f@amd.com
2024-06-04EDAC/igen6: Convert PCIBIOS_* return codes to errnosIlpo Järvinen
errcmd_enable_error_reporting() uses pci_{read,write}_config_word() that return PCIBIOS_* codes. The return code is then returned all the way into the probe function igen6_probe() that returns it as is. The probe functions, however, should return normal errnos. Convert PCIBIOS_* returns code using pcibios_err_to_errno() into normal errno before returning it from errcmd_enable_error_reporting(). Fixes: 10590a9d4f23 ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC") Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240527132236.13875-2-ilpo.jarvinen@linux.intel.com
2024-06-04EDAC/amd64: Convert PCIBIOS_* return codes to errnosIlpo Järvinen
gpu_get_node_map() uses pci_read_config_dword() that returns PCIBIOS_* codes. The return code is then returned all the way into the module init function amd64_edac_init() that returns it as is. The module init functions, however, should return normal errnos. Convert PCIBIOS_* returns code using pcibios_err_to_errno() into normal errno before returning it from gpu_get_node_map(). For consistency, convert also the other similar cases which return PCIBIOS_* codes even if they do not have any bugs at the moment. Fixes: 4251566ebc1c ("EDAC/amd64: Cache and use GPU node map") Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240527132236.13875-1-ilpo.jarvinen@linux.intel.com
2024-05-29EDAC, i10nm: make skx_common.o a separate moduleArnd Bergmann
Commit 598afa050403 ("kbuild: warn objects shared among multiple modules") was added to track down cases where the same object is linked into multiple modules. This can cause serious problems if some modules are builtin while others are not. That test triggers this warning: scripts/Makefile.build:236: drivers/edac/Makefile: skx_common.o is added to multiple modules: i10nm_edac skx_edac Make this a separate module instead. [Tony: Added more background details to commit message] Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20240529095132.1929397-1-arnd@kernel.org/
2024-05-28EDAC/skx: Switch to new Intel CPU model definesTony Luck
New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240520224620.9480-39-tony.luck@intel.com
2024-05-28EDAC/sb_edac: Switch to new Intel CPU model definesTony Luck
New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240520224620.9480-38-tony.luck@intel.com
2024-05-28EDAC, pnd2: Switch to new Intel CPU model definesTony Luck
New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240520224620.9480-37-tony.luck@intel.com
2024-05-28EDAC/i10nm: Switch to new Intel CPU model definesTony Luck
New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240520224620.9480-36-tony.luck@intel.com
2024-05-28EDAC/ghes: Add missing newline to pr_info() statementVasyl Gomonovych
Add a missing newline character even if printk() adds newlines to non-\n-terminated strings because in the unlikely case a KERN_CONT print statement is added after the unterminated statement, the two will get glued together which is not the expected behavior. [ bp: Rewrite commit message. ] Signed-off-by: Vasyl Gomonovych <gomonovych@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240517204951.2019031-1-gomonovych@gmail.com
2024-05-27EDAC/thunderx: Remove unused struct error_syndromeDr. David Alan Gilbert
struct error_syndrome appears never to have been used. Remove it, together with the MAX_SYNDROME_REGS it used. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240516133404.251397-1-linux@treblig.org
2024-05-14Merge tag 'edac_updates_for_v6.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Have skx_edac decode error addresses belonging to SGX properly - Remove a bunch of unused struct members - Other cleanups * tag 'edac_updates_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/skx_common: Allow decoding of SGX addresses EDAC/mc_sysfs: Convert sprintf()/snprintf() to sysfs_emit() EDAC: Remove unused struct members EDAC: Remove dynamic attributes from edac_device_alloc_ctl_info() EDAC/device: Remove edac_dev_sysfs_block_attribute::store() EDAC/device: Remove edac_dev_sysfs_block_attribute::{block,value} EDAC/amd64: Remove unused struct member amd64_pvt::ext_nbcfg
2024-05-06EDAC/synopsys: Fix ECC status and IRQ control race conditionSerge Semin
The race condition around the ECCCLR register access happens in the IRQ disable method called in the device remove() procedure and in the ECC IRQ handler: 1. Enable IRQ: a. ECCCLR = EN_CE | EN_UE 2. Disable IRQ: a. ECCCLR = 0 3. IRQ handler: a. ECCCLR = CLR_CE | CLR_CE_CNT | CLR_CE | CLR_CE_CNT b. ECCCLR = 0 c. ECCCLR = EN_CE | EN_UE So if the IRQ disabling procedure is called concurrently with the IRQ handler method the IRQ might be actually left enabled due to the statement 3c. The root cause of the problem is that ECCCLR register (which since v3.10a has been called as ECCCTL) has intermixed ECC status data clear flags and the IRQ enable/disable flags. Thus the IRQ disabling (clear EN flags) and handling (write 1 to clear ECC status data) procedures must be serialised around the ECCCTL register modification to prevent the race. So fix the problem described above by adding the spin-lock around the ECCCLR modifications and preventing the IRQ-handler from modifying the IRQs enable flags (there is no point in disabling the IRQ and then re-enabling it again within a single IRQ handler call, see the statements 3a/3b and 3c above). Fixes: f7824ded4149 ("EDAC/synopsys: Add support for version 3 of the Synopsys EDAC DDR") Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240222181324.28242-2-fancer.lancer@gmail.com
2024-04-25EDAC/versal: Do not log total error countsShubhrajyoti Datta
When logging errors, the driver currently logs the total error count. However, it should log the current error only. Fix it. [ bp: Rewrite text. ] Fixes: 6f15b178cd63 ("EDAC/versal: Add a Xilinx Versal memory controller driver") Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240425121942.26378-4-shubhrajyoti.datta@amd.com
2024-04-25EDAC/versal: Check user-supplied data before injecting an errorShubhrajyoti Datta
The function inject_data_ue_store() lacks a NULL check for the user passed values. To prevent below kernel crash include a NULL check. Call trace: kstrtoull kstrtou8 inject_data_ue_store full_proxy_write vfs_write ksys_write __arm64_sys_write invoke_syscall el0_svc_common.constprop.0 do_el0_svc el0_svc el0t_64_sync_handler el0t_64_sync Fixes: 83bf24051a60 ("EDAC/versal: Make the bit position of injected errors configurable") Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240425121942.26378-3-shubhrajyoti.datta@amd.com
2024-04-25EDAC/versal: Do not register for NOC errorsShubhrajyoti Datta
The NOC errors are not handled in the driver. Remove the request for registration. Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240425121942.26378-2-shubhrajyoti.datta@amd.com
2024-04-08EDAC/skx_common: Allow decoding of SGX addressesQiuxu Zhuo
There are no "struct page" associations with SGX pages, causing the check pfn_to_online_page() to fail. This results in the inability to decode the SGX addresses and warning messages like: Invalid address 0x34cc9a98840 in IA32_MC17_ADDR Add an additional check to allow the decoding of the error address and to skip the warning message, if the error address is an SGX address. Fixes: 1e92af09fab1 ("EDAC/skx_common: Filter out the invalid address") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240408120419.50234-1-qiuxu.zhuo@intel.com
2024-04-04EDAC/mc_sysfs: Convert sprintf()/snprintf() to sysfs_emit()Li Zhijian
Per Documentation/filesystems/sysfs.rst, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Generated by: make coccicheck M=<path/to/file> MODE=patch \ COCCI=scripts/coccinelle/api/device_attr_show.cocci No functional change intended. [ bp: Massage. ] Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240314084628.1322006-1-lizhijian@fujitsu.com
2024-03-27EDAC: Remove unused struct membersJiri Slaby (SUSE)
Remove unused - edac_pci_ctl_info::edac_subsys - edac_pci_ctl_info::complete - edac_device_ctl_info::removal_complete members. Found by https://github.com/jirislaby/clang-struct. [ bp: Squash three almost identical trivial patches into one. ] Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240213112051.27715-6-jirislaby@kernel.org Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-03-27EDAC: Remove dynamic attributes from edac_device_alloc_ctl_info()Jiri Slaby (SUSE)
Dynamic attributes are not passed from any caller of edac_device_alloc_ctl_info(). Drop this unused/untested functionality completely. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240213112051.27715-5-jirislaby@kernel.org Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-03-27EDAC/device: Remove edac_dev_sysfs_block_attribute::store()Jiri Slaby (SUSE)
No one uses this store hook (both BLOCK_ATTR() pass NULL). It actually never was since its addition in fd309a9d8e63 ("drivers/edac: fix leaf sysfs attribute") so drop it. Found by https://github.com/jirislaby/clang-struct. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240213112051.27715-4-jirislaby@kernel.org Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-03-27EDAC/device: Remove edac_dev_sysfs_block_attribute::{block,value}Jiri Slaby (SUSE)
They're unused. And they were never used since their addition in fd309a9d8e63 ("drivers/edac: fix leaf sysfs attribute") Drop it. Found by https://github.com/jirislaby/clang-struct. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240213112051.27715-3-jirislaby@kernel.org Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-03-27EDAC/amd64: Remove unused struct member amd64_pvt::ext_nbcfgJiri Slaby (SUSE)
Commit cfe40fdb4a46 ("amd64_edac: add driver header") added amd64_pvt struct with ext_nbcfg in it. But no one used that member since then. Therefore, remove it. Found by https://github.com/jirislaby/clang-struct. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20240213112051.27715-2-jirislaby@kernel.org Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-03-11Merge tag 'edac_updates_for_v6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add a FRU (Field Replaceable Unit) memory poison manager which collects and manages previously encountered hw errors in order to save them to persistent storage across reboots. Previously recorded errors are "replayed" upon reboot in order to poison memory which has caused said errors in the past. The main use case is stacked, on-chip memory which cannot simply be replaced so poisoning faulty areas of it and thus making them inaccessible is the only strategy to prolong its lifetime. - Add an AMD address translation library glue which converts the reported addresses of hw errors into system physical addresses in order to be used by other subsystems like memory failure, for example. Add support for MI300 accelerators to that library. - igen6: Add support for Alder Lake-N SoC - i10nm: Add Grand Ridge support - The usual fixlets and cleanups * tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/versal: Convert to platform remove callback returning void RAS/AMD/FMPM: Fix off by one when unwinding on error RAS/AMD/FMPM: Add debugfs interface to print record entries RAS/AMD/FMPM: Save SPA values RAS: Export helper to get ras_debugfs_dir RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2() RAS: Introduce a FRU memory poison manager RAS/AMD/ATL: Add MI300 row retirement support Documentation: Move RAS section to admin-guide EDAC/versal: Make the bit position of injected errors configurable EDAC/i10nm: Add Intel Grand Ridge micro-server support EDAC/igen6: Add one more Intel Alder Lake-N SoC support RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300() RAS/AMD/ATL: Add MI300 support Documentation: RAS: Add index and address translation section EDAC/amd64: Use new AMD Address Translation Library RAS: Introduce AMD Address Translation Library EDAC/synopsys: Convert to devm_platform_ioremap_resource()
2024-03-11Merge remote-tracking branches 'ras/edac-drivers', 'ras/edac-misc' and ↵Borislav Petkov (AMD)
'ras/edac-amd-atl' into edac-updates-for-v6.9 * ras/edac-drivers: EDAC/i10nm: Add Intel Grand Ridge micro-server support EDAC/igen6: Add one more Intel Alder Lake-N SoC support * ras/edac-misc: EDAC/versal: Convert to platform remove callback returning void EDAC/versal: Make the bit position of injected errors configurable EDAC/synopsys: Convert to devm_platform_ioremap_resource() * ras/edac-amd-atl: RAS/AMD/FMPM: Fix off by one when unwinding on error RAS/AMD/FMPM: Add debugfs interface to print record entries RAS/AMD/FMPM: Save SPA values RAS: Export helper to get ras_debugfs_dir RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2() RAS: Introduce a FRU memory poison manager RAS/AMD/ATL: Add MI300 row retirement support Documentation: Move RAS section to admin-guide RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300() RAS/AMD/ATL: Add MI300 support Documentation: RAS: Add index and address translation section EDAC/amd64: Use new AMD Address Translation Library RAS: Introduce AMD Address Translation Library Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-03-08EDAC/versal: Convert to platform remove callback returning voidUwe Kleine-König
The .remove() callback for a platform driver returns an int which makes many driver authors wrongly assume it's possible to do error handling by returning an error code. However the value returned is ignored (apart from emitting a warning) and this typically results in resource leaks. To improve this, there is a quest to make the remove callback return void. In the first step of this quest all drivers are converted to .remove_new(), which already returns void. Eventually after all drivers are converted, .remove_new() will be renamed to .remove(). Trivially convert this driver from always returning zero in the remove callback to the void returning variant. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Link: https://lore.kernel.org/r/83deca1ce260f7e17ff3cb106c9a6946d4ca4505.1709886922.git.u.kleine-koenig@pengutronix.de
2024-02-15x86/cpu/amd: Provide a separate accessor for Node IDThomas Gleixner
AMD (ab)uses topology_die_id() to store the Node ID information and topology_max_dies_per_pkg to store the number of nodes per package. This collides with the proper processor die level enumeration which is coming on AMD with CPUID 8000_0026, unless there is a correlation between the two. There is zero documentation about that. So provide new storage and new accessors which for now still access die_id and topology_max_die_per_pkg(). Will be mopped up after AMD and HYGON are converted over. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153624.956116738@linutronix.de
2024-02-14EDAC/versal: Make the bit position of injected errors configurableShubhrajyoti Datta
Currently, the bit positions to inject correctable and uncorrectable errors are hardcoded. To make that configurable add separate sysfs entries to set the bit positions for injecting CE and UE errors. Allow for single bit error for CE and two bits errors for UE injection. [ bp: Massage. ] Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240208094653.11704-1-shubhrajyoti.datta@amd.com
2024-02-01EDAC/i10nm: Add Intel Grand Ridge micro-server supportQiuxu Zhuo
The Grand Ridge CPU model uses similar memory controller registers with Granite Rapids server. Add Grand Ridge CPU model ID for EDAC support. Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240129062040.60809-3-qiuxu.zhuo@intel.com
2024-02-01EDAC/igen6: Add one more Intel Alder Lake-N SoC supportLili Li
Add a new Intel Alder Lake-N SoC compute die ID for EDAC support. Signed-off-by: Lili Li <lili.li@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/r/20240129062040.60809-2-qiuxu.zhuo@intel.com
2024-01-24EDAC/amd64: Use new AMD Address Translation LibraryYazen Ghannam
Remove old address translation code and use the new AMD Address Translation Library. Use "imply" in Kconfig so that the "AMD_ATL" config option takes the value of "EDAC_AMD64" as its default. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240123041401.79812-3-yazen.ghannam@amd.com
2024-01-23EDAC/synopsys: Convert to devm_platform_ioremap_resource()Yangtao Li
Use devm_platform_ioremap_resource() to simplify code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Michal Simek <michal.simek@amd.com> Link: https://lore.kernel.org/r/20230704101811.49637-3-frank.li@vivo.com
2024-01-18Merge tag 'driver-core-6.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here are the set of driver core and kernfs changes for 6.8-rc1. Nothing major in here this release cycle, just lots of small cleanups and some tweaks on kernfs that in the very end, got reverted and will come back in a safer way next release cycle. Included in here are: - more driver core 'const' cleanups and fixes - fw_devlink=rpm is now the default behavior - kernfs tiny changes to remove some string functions - cpu handling in the driver core is updated to work better on many systems that add topologies and cpus after booting - other minor changes and cleanups All of the cpu handling patches have been acked by the respective maintainers and are coming in here in one series. Everything has been in linux-next for a while with no reported issues" * tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (51 commits) Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock" kernfs: convert kernfs_idr_lock to an irq safe raw spinlock class: fix use-after-free in class_register() PM: clk: make pm_clk_add_notifier() take a const pointer EDAC: constantify the struct bus_type usage kernfs: fix reference to renamed function driver core: device.h: fix Excess kernel-doc description warning driver core: class: fix Excess kernel-doc description warning driver core: mark remaining local bus_type variables as const driver core: container: make container_subsys const driver core: bus: constantify subsys_register() calls driver core: bus: make bus_sort_breadthfirst() take a const pointer kernfs: d_obtain_alias(NULL) will do the right thing... driver core: Better advertise dev_err_probe() kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy() kernfs: Convert kernfs_name_locked() from strlcpy() to strscpy() kernfs: Convert kernfs_walk_ns() from strlcpy() to strscpy() initramfs: Expose retained initrd as sysfs file fs/kernfs/dir: obey S_ISGID kernel/cgroup: use kernfs_create_dir_ns() ...
2024-01-17Merge tag 'char-misc-6.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc and other driver updates from Greg KH: "Here is the big set of char/misc and other driver subsystem changes for 6.8-rc1. Other than lots of binder driver changes (as you can see by the merge conflicts) included in here are: - lots of iio driver updates and additions - spmi driver updates - eeprom driver updates - firmware driver updates - ocxl driver updates - mhi driver updates - w1 driver updates - nvmem driver updates - coresight driver updates - platform driver remove callback api changes - tags.sh script updates - bus_type constant marking cleanups - lots of other small driver updates All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (341 commits) android: removed duplicate linux/errno uio: Fix use-after-free in uio_open drivers: soc: xilinx: add check for platform firmware: xilinx: Export function to use in other module scripts/tags.sh: remove find_sources scripts/tags.sh: use -n to test archinclude scripts/tags.sh: add local annotation scripts/tags.sh: use more portable -path instead of -wholename scripts/tags.sh: Update comment (addition of gtags) firmware: zynqmp: Convert to platform remove callback returning void firmware: turris-mox-rwtm: Convert to platform remove callback returning void firmware: stratix10-svc: Convert to platform remove callback returning void firmware: stratix10-rsu: Convert to platform remove callback returning void firmware: raspberrypi: Convert to platform remove callback returning void firmware: qemu_fw_cfg: Convert to platform remove callback returning void firmware: mtk-adsp-ipc: Convert to platform remove callback returning void firmware: imx-dsp: Convert to platform remove callback returning void firmware: coreboot_table: Convert to platform remove callback returning void firmware: arm_scpi: Convert to platform remove callback returning void firmware: arm_scmi: Convert to platform remove callback returning void ...
2024-01-08Merge tag 'ras_core_for_v6.8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS updates from Borislav Petkov: - Convert the hw error storm handling into a finer-grained, per-bank solution which allows for more timely detection and reporting of errors - Start a documentation section which will hold down relevant RAS features description and how they should be used - Add new AMD error bank types - Slim down and remove error type descriptions from the kernel side of error decoding to rasdaemon which can be used from now on to decode hw errors on AMD - Mark pages containing uncorrectable errors as poison so that kdump can avoid them and thus not cause another panic - The usual cleanups and fixlets * tag 'ras_core_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Handle Intel threshold interrupt storms x86/mce: Add per-bank CMCI storm mitigation x86/mce: Remove old CMCI storm mitigation code Documentation: Begin a RAS section x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types EDAC/mce_amd: Remove SMCA Extended Error code descriptions x86/mce/amd, EDAC/mce_amd: Move long names to decoder module x86/mce/inject: Clear test status value x86/mce: Remove redundant check from mce_device_create() x86/mce: Mark fatal MCE's page as poison to avoid panic in the kdump kernel
2024-01-08Merge tag 'edac_updates_for_v6.8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - The EDAC drivers part of the effort to make the ->remove() platform driver callback return void - Add support for AMD AI accelerators - Add support for a number of Intel SoCs: Alder Lake-N, Raptor Lake-P, Meteor Lake-{P,PS} - Random fixes and cleanups all over the place * tag 'edac_updates_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: (39 commits) EDAC/skx_common: Filter out the invalid address EDAC, pnd2: Sort headers alphabetically EDAC, pnd2: Correct misleading error message in mk_region_mask() EDAC, pnd2: Apply bit macros and helpers where it makes sense EDAC, pnd2: Replace custom definition by one from sizes.h EDAC/igen6: Add Intel Meteor Lake-P SoCs support EDAC/igen6: Add Intel Meteor Lake-PS SoCs support EDAC/igen6: Add Intel Raptor Lake-P SoCs support EDAC/igen6: Add Intel Alder Lake-N SoCs support EDAC/igen6: Make get_mchbar() helper function EDAC/amd64: Add support for family 0x19, models 0x90-9f devices EDAC/mc: Add support for HBM3 memory type EDAC/{sb,i7core}_edac: Do not use a plain integer for a NULL pointer EDAC/armada_xp: Explicitly include correct DT includes EDAC/pci_sysfs: Use PCI_HEADER_TYPE_MASK instead of literals EDAC/thunderx: Fix possible out-of-bounds string access EDAC/fsl_ddr: Convert to platform remove callback returning void EDAC/zynqmp: Convert to platform remove callback returning void EDAC/xgene: Convert to platform remove callback returning void EDAC/ti: Convert to platform remove callback returning void ...
2024-01-04drivers: soc: xilinx: add check for platformJay Buddhabhatti
Some error event IDs for Versal and Versal NET are different. Both the platforms should access their respective error event IDs so use sub_family_code to check for platform and check error IDs for respective platforms. The family code is passed via platform data to avoid platform detection again. Platform data is setup when even driver is registered. Signed-off-by: Jay Buddhabhatti <jay.buddhabhatti@amd.com> Link: https://lore.kernel.org/r/20231219055025.27570-3-jay.buddhabhatti@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-04EDAC: constantify the struct bus_type usageGreg Kroah-Hartman
In many places in the edac code, struct bus_type pointers are passed around and then eventually sent to the driver core, which can handle a constant pointer. So constantify all of the edac usage of these as well because the data in them is never modified by the edac code either. Cc: Borislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rric@kernel.org> Cc: <linux-edac@vger.kernel.org> Link: https://lore.kernel.org/r/2023121909-tribute-punctuate-4b22@gregkh Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-02EDAC/skx_common: Filter out the invalid addressQiuxu Zhuo
Decoding an invalid address with certain firmware decoders could cause a #PF (Page Fault) in the EFI runtime context, which could subsequently hang the system. To make {i10nm,skx}_edac more robust against such bogus firmware decoders, filter out invalid addresses before allowing the firmware decoder to process them. Suggested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20231207014512.78564-1-qiuxu.zhuo@intel.com
2023-12-15EDAC/versal: Read num_csrows and num_chans using the correct bitfield macroShubhrajyoti Datta
Fix the extraction of num_csrows and num_chans. The extraction of the num_rows is wrong. Instead of extracting using the FIELD_GET it is calling FIELD_PREP. The issue was masked as the default design has the rows as 0. Fixes: 6f15b178cd63 ("EDAC/versal: Add a Xilinx Versal memory controller driver") Closes: https://lore.kernel.org/all/60ca157e-6eff-d12c-9dc0-8aeab125edda@linux-m68k.org/ Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231215053352.8740-1-shubhrajyoti.datta@amd.com
2023-12-05EDAC, pnd2: Sort headers alphabeticallyAndy Shevchenko
Sort the headers in alphabetic order in order to ease the maintenance for this part. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-12-05EDAC, pnd2: Correct misleading error message in mk_region_mask()Andy Shevchenko
The mask parameter is expected to be of a sequence of the set bits. It does not mean it must be power of two (only single bit set). Correct misleading error message. Suggested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-12-05EDAC, pnd2: Apply bit macros and helpers where it makes senseAndy Shevchenko
Apply bit macros (BIT()/BIT_ULL()/GENMASK()/etc) and helpers (for_each_set_bit()/etc) where it makes sense. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-12-05EDAC, pnd2: Replace custom definition by one from sizes.hAndy Shevchenko
The sizes.h provides a set of common size definitions, use it. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-12-05EDAC/igen6: Add Intel Meteor Lake-P SoCs supportQiuxu Zhuo
Add Intel Meteor Lake-P SoC compute die IDs for EDAC support. These Meteor Lake-P SoCs share similar IBECC registers with Alder Lake-P SoCs. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>