summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-03-21Merge tag 'driver-core-6.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the "big" set of driver core and kernfs changes for 6.9-rc1. Nothing all that crazy here, just some good updates that include: - automatic attribute group hiding from Dan Williams (he fixed up my horrible attempt at doing this.) - kobject lock contention fixes from Eric Dumazet - driver core cleanups from Andy - kernfs rcu work from Tejun - fw_devlink changes to resolve some reported issues - other minor changes, all details in the shortlog All of these have been in linux-next for a long time with no reported issues" * tag 'driver-core-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (28 commits) device: core: Log warning for devices pending deferred probe on timeout driver: core: Use dev_* instead of pr_* so device metadata is added driver: core: Log probe failure as error and with device metadata of: property: fw_devlink: Add support for "post-init-providers" property driver core: Add FWLINK_FLAG_IGNORE to completely ignore a fwnode link driver core: Adds flags param to fwnode_link_add() debugfs: fix wait/cancellation handling during remove device property: Don't use "proxy" headers device property: Move enum dev_dma_attr to fwnode.h driver core: Move fw_devlink stuff to where it belongs driver core: Drop unneeded 'extern' keyword in fwnode.h firmware_loader: Suppress warning on FW_OPT_NO_WARN flag sysfs:Addresses documentation in sysfs_merge_group and sysfs_unmerge_group. firmware_loader: introduce __free() cleanup hanler platform-msi: Remove usage of the deprecated ida_simple_xx() API sysfs: Introduce DEFINE_SIMPLE_SYSFS_GROUP_VISIBLE() sysfs: Document new "group visible" helpers sysfs: Fix crash on empty group attributes array sysfs: Introduce a mechanism to hide static attribute_groups sysfs: Introduce a mechanism to hide static attribute_groups ...
2024-03-21Merge tag 'tty-6.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty / serial driver updates from Greg KH: "Here is the big set of TTY/Serial driver updates and cleanups for 6.9-rc1. Included in here are: - more tty cleanups from Jiri - loads of 8250 driver cleanups from Andy - max310x driver updates - samsung serial driver updates - uart_prepare_sysrq_char() updates for many drivers - platform driver remove callback void cleanups - stm32 driver updates - other small tty/serial driver updates All of these have been in linux-next for a long time with no reported issues" * tag 'tty-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (199 commits) dt-bindings: serial: stm32: add power-domains property serial: 8250_dw: Replace ACPI device check by a quirk serial: Lock console when calling into driver before registration serial: 8250_uniphier: Switch to use uart_read_port_properties() serial: 8250_tegra: Switch to use uart_read_port_properties() serial: 8250_pxa: Switch to use uart_read_port_properties() serial: 8250_omap: Switch to use uart_read_port_properties() serial: 8250_of: Switch to use uart_read_port_properties() serial: 8250_lpc18xx: Switch to use uart_read_port_properties() serial: 8250_ingenic: Switch to use uart_read_port_properties() serial: 8250_dw: Switch to use uart_read_port_properties() serial: 8250_bcm7271: Switch to use uart_read_port_properties() serial: 8250_bcm2835aux: Switch to use uart_read_port_properties() serial: 8250_aspeed_vuart: Switch to use uart_read_port_properties() serial: port: Introduce a common helper to read properties serial: core: Add UPIO_UNKNOWN constant for unknown port type serial: core: Move struct uart_port::quirks closer to possible values serial: sh-sci: Call sci_serial_{in,out}() directly serial: core: only stop transmit when HW fifo is empty serial: pch: Use uart_prepare_sysrq_char(). ...
2024-03-19Merge tag 'pm-6.9-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These update the Energy Model to make it prevent errors due to power unit mismatches, fix a typo in power management documentation, convert one driver to using a platform remove callback returning void, address two cpufreq issues (one in the core and one in the DT driver), and enable boost support in the SCMI cpufreq driver. Specifics: - Modify the Energy Model code to bail out and complain if the unit of power is not uW to prevent errors due to unit mismatches (Lukasz Luba) - Make the intel_rapl platform driver use a remove callback returning void (Uwe Kleine-König) - Fix typo in the suspend and interrupts document (Saravana Kannan) - Make per-policy boost flags actually take effect on platforms using cpufreq_boost_set_sw() (Sibi Sankar) - Enable boost support in the SCMI cpufreq driver (Sibi Sankar) - Make the DT cpufreq driver use zalloc_cpumask_var() for allocating cpumasks to avoid using unitinialized memory (Marek Szyprowski)" * tag 'pm-6.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: scmi: Enable boost support firmware: arm_scmi: Add support for marking certain frequencies as turbo cpufreq: dt: always allocate zeroed cpumask cpufreq: Fix per-policy boost behavior on SoCs using cpufreq_boost_set_sw() Documentation: power: Fix typo in suspend and interrupts doc PM: EM: Force device drivers to provide power in uW powercap: intel_rapl: Convert to platform remove callback returning void
2024-03-18Merge tag 'trace-v6.9-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: "Main user visible change: - User events can now have "multi formats" The current user events have a single format. If another event is created with a different format, it will fail to be created. That is, once an event name is used, it cannot be used again with a different format. This can cause issues if a library is using an event and updates its format. An application using the older format will prevent an application using the new library from registering its event. A task could also DOS another application if it knows the event names, and it creates events with different formats. The multi-format event is in a different name space from the single format. Both the event name and its format are the unique identifier. This will allow two different applications to use the same user event name but with different payloads. - Added support to have ftrace_dump_on_oops dump out instances and not just the main top level tracing buffer. Other changes: - Add eventfs_root_inode Only the root inode has a dentry that is static (never goes away) and stores it upon creation. There's no reason that the thousands of other eventfs inodes should have a pointer that never gets set in its descriptor. Create a eventfs_root_inode desciptor that has a eventfs_inode descriptor and a dentry pointer, and only the root inode will use this. - Added WARN_ON()s in eventfs There's some conditionals remaining in eventfs that should never be hit, but instead of removing them, add WARN_ON() around them to make sure that they are never hit. - Have saved_cmdlines allocation also include the map_cmdline_to_pid array The saved_cmdlines structure allocates a large amount of data to hold its mappings. Within it, it has three arrays. Two are already apart of it: map_pid_to_cmdline[] and saved_cmdlines[]. More memory can be saved by also including the map_cmdline_to_pid[] array as well. - Restructure __string() and __assign_str() macros used in TRACE_EVENT() Dynamic strings in TRACE_EVENT() are declared with: __string(name, source) And assigned with: __assign_str(name, source) In the tracepoint callback of the event, the __string() is used to get the size needed to allocate on the ring buffer and __assign_str() is used to copy the string into the ring buffer. There's a helper structure that is created in the TRACE_EVENT() macro logic that will hold the string length and its position in the ring buffer which is created by __string(). There are several trace events that have a function to create the string to save. This function is executed twice. Once for __string() and again for __assign_str(). There's no reason for this. The helper structure could also save the string it used in __string() and simply copy that into __assign_str() (it also already has its length). By using the structure to store the source string for the assignment, it means that the second argument to __assign_str() is no longer needed. It will be removed in the next merge window, but for now add a warning if the source string given to __string() is different than the source string given to __assign_str(), as the source to __assign_str() isn't even used and will be going away. - Added checks to make sure that the source of __string() is also the source of __assign_str() so that it can be safely removed in the next merge window. Included fixes that the above check found. - Other minor clean ups and fixes" * tag 'trace-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (34 commits) tracing: Add __string_src() helper to help compilers not to get confused tracing: Use strcmp() in __assign_str() WARN_ON() check tracepoints: Use WARN() and not WARN_ON() for warnings tracing: Use div64_u64() instead of do_div() tracing: Support to dump instance traces by ftrace_dump_on_oops tracing: Remove second parameter to __assign_rel_str() tracing: Add warning if string in __assign_str() does not match __string() tracing: Add __string_len() example tracing: Remove __assign_str_len() ftrace: Fix most kernel-doc warnings tracing: Decrement the snapshot if the snapshot trigger fails to register tracing: Fix snapshot counter going between two tracers that use it tracing: Use EVENT_NULL_STR macro instead of open coding "(null)" tracing: Use ? : shortcut in trace macros tracing: Do not calculate strlen() twice for __string() fields tracing: Rework __assign_str() and __string() to not duplicate getting the string cxl/trace: Properly initialize cxl_poison region name net: hns3: tracing: fix hclgevf trace event strings drm/i915: Add missing ; to __assign_str() macros in tracepoint code NFSD: Fix nfsd_clid_class use of __string_len() macro ...
2024-03-18tracing: Use div64_u64() instead of do_div()Thorsten Blum
Fixes Coccinelle/coccicheck warnings reported by do_div.cocci. Compared to do_div(), div64_u64() does not implicitly cast the divisor and does not unnecessarily calculate the remainder. Link: https://lore.kernel.org/linux-trace-kernel/20240225164507.232942-2-thorsten.blum@toblux.com Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18tracing: Support to dump instance traces by ftrace_dump_on_oopsHuang Yiwei
Currently ftrace only dumps the global trace buffer on an OOPs. For debugging a production usecase, instance trace will be helpful to check specific problems since global trace buffer may be used for other purposes. This patch extend the ftrace_dump_on_oops parameter to dump a specific or multiple trace instances: - ftrace_dump_on_oops=0: as before -- don't dump - ftrace_dump_on_oops[=1]: as before -- dump the global trace buffer on all CPUs - ftrace_dump_on_oops=2 or =orig_cpu: as before -- dump the global trace buffer on CPU that triggered the oops - ftrace_dump_on_oops=<instance_name>: new behavior -- dump the tracing instance matching <instance_name> - ftrace_dump_on_oops[=2/orig_cpu],<instance1_name>[=2/orig_cpu], <instrance2_name>[=2/orig_cpu]: new behavior -- dump the global trace buffer and multiple instance buffer on all CPUs, or only dump on CPU that triggered the oops if =2 or =orig_cpu is given Also, the sysctl node can handle the input accordingly. Link: https://lore.kernel.org/linux-trace-kernel/20240223083126.1817731-1-quic_hyiwei@quicinc.com Cc: Ross Zwisler <zwisler@google.com> Cc: <mhiramat@kernel.org> Cc: <mark.rutland@arm.com> Cc: <mcgrof@kernel.org> Cc: <keescook@chromium.org> Cc: <j.granados@samsung.com> Cc: <mathieu.desnoyers@efficios.com> Cc: <corbet@lwn.net> Signed-off-by: Huang Yiwei <quic_hyiwei@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18ftrace: Fix most kernel-doc warningsRandy Dunlap
Reduce the number of kernel-doc warnings from 52 down to 10, i.e., fix 42 kernel-doc warnings by (a) using the Returns: format for function return values or (b) using "@var:" instead of "@var -" for function parameter descriptions. Fix one return values list so that it is formatted correctly when rendered for output. Spell "non-zero" with a hyphen in several places. Link: https://lore.kernel.org/linux-trace-kernel/20240223054833.15471-1-rdunlap@infradead.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/oe-kbuild-all/202312180518.X6fRyDSN-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18tracing: Decrement the snapshot if the snapshot trigger fails to registerSteven Rostedt (Google)
Running the ftrace selftests caused the ring buffer mapping test to fail. Investigating, I found that the snapshot counter would be incremented every time a snapshot trigger was added, even if that snapshot trigger failed. # cd /sys/kernel/tracing # echo "snapshot" > events/sched/sched_process_fork/trigger # echo "snapshot" > events/sched/sched_process_fork/trigger -bash: echo: write error: File exists That second one that fails increments the snapshot counter but doesn't decrement it. It needs to be decremented when the snapshot fails. Link: https://lore.kernel.org/linux-trace-kernel/20240223013344.729055907@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Fixes: 16f7e48ffc53a ("tracing: Add snapshot refcount") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18tracing: Fix snapshot counter going between two tracers that use itSteven Rostedt (Google)
Running the ftrace selftests caused the ring buffer mapping test to fail. Investigating, I found that the snapshot counter would be incremented every time a tracer that uses the snapshot is enabled even if the snapshot was used by the previous tracer. That is: # cd /sys/kernel/tracing # echo wakeup_rt > current_tracer # echo wakeup_dl > current_tracer # echo nop > current_tracer would leave the snapshot counter at 1 and not zero. That's because the enabling of wakeup_dl would increment the counter again but the setting the tracer to nop would only decrement it once. Do not arm the snapshot for a tracer if the previous tracer already had it armed. Link: https://lore.kernel.org/linux-trace-kernel/20240223013344.570525723@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Fixes: 16f7e48ffc53a ("tracing: Add snapshot refcount") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18tracing: Use init_utsname()->releaseJohn Garry
Instead of using UTS_RELEASE, use init_utsname()->release, which means that we don't need to rebuild the code just for the git head commit changing. Link: https://lore.kernel.org/linux-trace-kernel/20240222124639.65629-1-john.g.garry@oracle.com Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18tracing/user_events: Introduce multi-format eventsBeau Belgrave
Currently user_events supports 1 event with the same name and must have the exact same format when referenced by multiple programs. This opens an opportunity for malicious or poorly thought through programs to create events that others use with different formats. Another scenario is user programs wishing to use the same event name but add more fields later when the software updates. Various versions of a program may be running side-by-side, which is prevented by the current single format requirement. Add a new register flag (USER_EVENT_REG_MULTI_FORMAT) which indicates the user program wishes to use the same user_event name, but may have several different formats of the event. When this flag is used, create the underlying tracepoint backing the user_event with a unique name per-version of the format. It's important that existing ABI users do not get this logic automatically, even if one of the multi format events matches the format. This ensures existing programs that create events and assume the tracepoint name will match exactly continue to work as expected. Add logic to only check multi-format events with other multi-format events and single-format events to only check single-format events during find. Change system name of the multi-format event tracepoint to ensure that multi-format events are isolated completely from single-format events. This prevents single-format names from conflicting with multi-format events if they end with the same suffix as the multi-format events. Add a register_name (reg_name) to the user_event struct which allows for split naming of events. We now have the name that was used to register within user_events as well as the unique name for the tracepoint. Upon registering events ensure matches based on first the reg_name, followed by the fields and format of the event. This allows for multiple events with the same registered name to have different formats. The underlying tracepoint will have a unique name in the format of {reg_name}.{unique_id}. For example, if both "test u32 value" and "test u64 value" are used with the USER_EVENT_REG_MULTI_FORMAT the system would have 2 unique tracepoints. The dynamic_events file would then show the following: u:test u64 count u:test u32 count The actual tracepoint names look like this: test.0 test.1 Both would be under the new user_events_multi system name to prevent the older ABI from being used to squat on multi-formatted events and block their use. Deleting events via "!u:test u64 count" would only delete the first tracepoint that matched that format. When the delete ABI is used all events with the same name will be attempted to be deleted. If per-version deletion is required, user programs should either not use persistent events or delete them via dynamic_events. Link: https://lore.kernel.org/linux-trace-kernel/20240222001807.1463-3-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18tracing/user_events: Prepare find/delete for same name eventsBeau Belgrave
The current code for finding and deleting events assumes that there will never be cases when user_events are registered with the same name, but different formats. Scenarios exist where programs want to use the same name but have different formats. An example is multiple versions of a program running side-by-side using the same event name, but with updated formats in each version. This change does not yet allow for multi-format events. If user_events are registered with the same name but different arguments the programs see the same return values as before. This change simply makes it possible to easily accommodate for this. Update find_user_event() to take in argument parameters and register flags to accommodate future multi-format event scenarios. Have find validate argument matching and return error pointers to cover when an existing event has the same name but different format. Update callers to handle error pointer logic. Move delete_user_event() to use hash walking directly now that find_user_event() has changed. Delete all events found that match the register name, stop if an error occurs and report back to the user. Update user_fields_match() to cover list_empty() scenarios now that find_user_event() uses it directly. This makes the logic consistent across several callsites. Link: https://lore.kernel.org/linux-trace-kernel/20240222001807.1463-2-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18tracing: Add snapshot refcountVincent Donnefort
When a ring-buffer is memory mapped by user-space, no trace or ring-buffer swap is possible. This means the snapshot feature is mutually exclusive with the memory mapping. Having a refcount on snapshot users will help to know if a mapping is possible or not. Instead of relying on the global trace_types_lock, a new spinlock is introduced to serialize accesses to trace_array->snapshot. This intends to allow access to that variable in a context where the mmap lock is already held. Link: https://lore.kernel.org/linux-trace-kernel/20240220202310.2489614-4-vdonnefort@google.com Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18ring-buffer: Make wake once of ring_buffer_wait() more robustSteven Rostedt (Google)
The default behavior of ring_buffer_wait() when passed a NULL "cond" parameter is to exit the function the first time it is woken up. The current implementation uses a counter that starts at zero and when it is greater than one it exits the wait_event_interruptible(). But this relies on the internal working of wait_event_interruptible() as that code basically has: if (cond) return; prepare_to_wait(); if (!cond) schedule(); finish_wait(); That is, cond is called twice before it sleeps. The default cond of ring_buffer_wait() needs to account for that and wait for its counter to increment twice before exiting. Instead, use the seq/atomic_inc logic that is used by the tracing code that calls this function. Add an atomic_t seq to rb_irq_work and when cond is NULL, have the default callback take a descriptor as its data that holds the rbwork and the value of the seq when it started. The wakeups will now increment the rbwork->seq and the cond callback will simply check if that number is different, and no longer have to rely on the implementation of wait_event_interruptible(). Link: https://lore.kernel.org/linux-trace-kernel/20240315063115.6cb5d205@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 7af9ded0c2ca ("ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-17Merge tag 'timers-urgent-2024-03-17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix timer migration bug that can result in long bootup delays and other oddities" * tag 'timers-urgent-2024-03-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timer/migration: Remove buggy early return on deactivation
2024-03-17ring-buffer: use READ_ONCE() to read cpu_buffer->commit_page in concurrent ↵linke li
environment In function ring_buffer_iter_empty(), cpu_buffer->commit_page is read while other threads may change it. It may cause the time_stamp that read in the next line come from a different page. Use READ_ONCE() to avoid having to reason about compiler optimizations now and in future. Link: https://lore.kernel.org/linux-trace-kernel/tencent_DFF7D3561A0686B5E8FC079150A02505180A@qq.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: linke li <lilinke99@qq.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-17ring-buffer: Zero ring-buffer sub-buffersVincent Donnefort
In preparation for the ring-buffer memory mapping where each subbuf will be accessible to user-space, zero all the page allocations. Link: https://lore.kernel.org/linux-trace-kernel/20240220202310.2489614-2-vdonnefort@google.com Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-17tracing: Move saved_cmdline code into trace_sched_switch.cSteven Rostedt (Google)
The code that handles saved_cmdlines is split between the trace.c file and the trace_sched_switch.c. There's some history to this. The trace_sched_switch.c was originally created to handle the sched_switch tracer that was deprecated due to sched_switch trace event making it obsolete. But that file did not get deleted as it had some code to help with saved_cmdlines. But trace.c has grown tremendously since then. Just move all the saved_cmdlines code into trace_sched_switch.c as that's the only reason that file still exists, and trace.c has gotten too big. No functional changes. Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.497966629@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-17tracing: Move open coded processing of tgid_map into helper functionSteven Rostedt (Google)
In preparation of moving the saved_cmdlines logic out of trace.c and into trace_sched_switch.c, replace the open coded manipulation of tgid_map in set_tracer_flag() into a helper function trace_alloc_tgid_map() so that it can be easily moved into trace_sched_switch.c without changing existing functions in trace.c. No functional changes. Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.338116216@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-17tracing: Have saved_cmdlines arrays all in one allocationSteven Rostedt (Google)
The saved_cmdlines have three arrays for mapping PIDs to COMMs: - map_pid_to_cmdline[] - map_cmdline_to_pid[] - saved_cmdlines The map_pid_to_cmdline[] is PID_MAX_DEFAULT in size and holds the index into the other arrays. The map_cmdline_to_pid[] is a mapping back to the full pid as it can be larger than PID_MAX_DEFAULT. And the saved_cmdlines[] just holds the COMMs associated to the pids. Currently the map_pid_to_cmdline[] and saved_cmdlines[] are allocated together (in reality the saved_cmdlines is just in the memory of the rounding of the allocation of the structure as it is always allocated in powers of two). The map_cmdline_to_pid[] array is allocated separately. Since the rounding to a power of two is rather large (it allows for 8000 elements in saved_cmdlines), also include the map_cmdline_to_pid[] array. (This drops it to 6000 by default, which is still plenty for most use cases). This saves even more memory as the map_cmdline_to_pid[] array doesn't need to be allocated. Link: https://lore.kernel.org/linux-trace-kernel/20240212174011.068211d9@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.182330529@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Mete Durlu <meted@linux.ibm.com> Fixes: 44dc5c41b5b1 ("tracing: Fix wasted memory in saved_cmdlines logic") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-16timer/migration: Remove buggy early return on deactivationFrederic Weisbecker
When a CPU enters into idle and deactivates itself from the timer migration hierarchy without any global timer of its own to propagate, the group event of that CPU is set to "ignore" and tmigr_update_events() accordingly performs an early return without considering timers queued by other CPUs. If the hierarchy has a single level, and the CPU is the last one to enter idle, it will ignore others' global timers, as in the following layout: [GRP0:0] migrator = 0 active = 0 nextevt = T0i / \ 0 1 active (T0i) idle (T1) 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are upper levels' events. CPU 1 is idle and has the timer T1 enqueued. [GRP0:0] migrator = NONE active = NONE nextevt = T0i / \ 0 1 idle (T0i) idle (T1) 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is pushed as its next expiry and its own event kept as "ignore". As a result tmigr_update_events() ignores T1 and CPU 0 goes to idle with T1 unhandled. This isn't proper to single level hierarchy though. A similar issue, although slightly different, may arise on multi-level: [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = T0:0i, T0:1 / \ [GRP0:0] [GRP0:1] migrator = 0 migrator = NONE active = 0 active = NONE nextevt = T0i nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 active idle idle idle 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are upper levels' events. CPU 1 is idle and has the timer T1 enqueued. CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2 [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = T0:0i, T0:1 / \ [GRP0:0] [GRP0:1] migrator = NONE migrator = NONE active = NONE active = NONE nextevt = T0i nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 idle idle idle idle 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is pushed as its next expiry and its own event kept as "ignore". As a result tmigr_update_events() ignores T1. The change only propagated up to 1st level so far. [GRP1:0] migrator = NONE active = NONE nextevt = T0:1 / \ [GRP0:0] [GRP0:1] migrator = NONE migrator = NONE active = NONE active = NONE nextevt = T0i nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 idle idle idle idle 2) The change now propagates up to the top. tmigr_update_events() finds that the child event is ignored and thus removes it. The top level next event is now T2 which is returned to CPU 0 as its next effective expiry to take account for as the global idle migrator. However T1 has been ignored along the way, leaving it unhandled. Fix those issues with removing the buggy related early return. Ignored child events must not prevent from evaluating the other events within the same group. Reported-by: Boqun Feng <boqun.feng@gmail.com> Reported-by: Florian Fainelli <f.fainelli@gmail.com> Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/ZfOhB9ZByTZcBy4u@lothringen
2024-03-15Merge tag 'bcachefs-2024-03-13' of https://evilpiepirate.org/git/bcachefsLinus Torvalds
Pull bcachefs updates from Kent Overstreet: - Subvolume children btree; this is needed for providing a userspace interface for walking subvolumes, which will come later - Lots of improvements to directory structure checking - Improved journal pipelining, significantly improving performance on high iodepth write workloads - Discard path improvements: the discard path is more efficient, and no longer flushes the journal unnecessarily - Buffered write path can now avoid taking the inode lock - new mm helper: memalloc_flags_{save|restore} - mempool now does kvmalloc mempools * tag 'bcachefs-2024-03-13' of https://evilpiepirate.org/git/bcachefs: (128 commits) bcachefs: time_stats: shrink time_stat_buffer for better alignment bcachefs: time_stats: split stats-with-quantiles into a separate structure bcachefs: mean_and_variance: put struct mean_and_variance_weighted on a diet bcachefs: time_stats: add larger units bcachefs: pull out time_stats.[ch] bcachefs: reconstruct_alloc cleanup bcachefs: fix bch_folio_sector padding bcachefs: Fix btree key cache coherency during replay bcachefs: Always flush write buffer in delete_dead_inodes() bcachefs: Fix order of gc_done passes bcachefs: fix deletion of indirect extents in btree_gc bcachefs: Prefer struct_size over open coded arithmetic bcachefs: Kill unused flags argument to btree_split() bcachefs: Check for writing superblocks with nonsense member seq fields bcachefs: fix bch2_journal_buf_to_text() lib/generic-radix-tree.c: Make nodes more reasonably sized bcachefs: copy_(to|from)_user_errcode() bcachefs: Split out bkey_types.h bcachefs: fix lost journal buf wakeup due to improved pipelining bcachefs: intercept mountoption value for bool type ...
2024-03-14Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min heap optimizations". - Kuan-Wei Chiu has also sped up the library sorting code in the series "lib/sort: Optimize the number of swaps and comparisons". - Alexey Gladkov has added the ability for code running within an IPC namespace to alter its IPC and MQ limits. The series is "Allow to change ipc/mq sysctls inside ipc namespace". - Geert Uytterhoeven has contributed some dhrystone maintenance work in the series "lib: dhry: miscellaneous cleanups". - Ryusuke Konishi continues nilfs2 maintenance work in the series "nilfs2: eliminate kmap and kmap_atomic calls" "nilfs2: fix kernel bug at submit_bh_wbc()" - Nathan Chancellor has updated our build tools requirements in the series "Bump the minimum supported version of LLVM to 13.0.1". - Muhammad Usama Anjum continues with the selftests maintenance work in the series "selftests/mm: Improve run_vmtests.sh". - Oleg Nesterov has done some maintenance work against the signal code in the series "get_signal: minor cleanups and fix". Plus the usual shower of singleton patches in various parts of the tree. Please see the individual changelogs for details. * tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) nilfs2: prevent kernel bug at submit_bh_wbc() nilfs2: fix failure to detect DAT corruption in btree and direct mappings ocfs2: enable ocfs2_listxattr for special files ocfs2: remove SLAB_MEM_SPREAD flag usage assoc_array: fix the return value in assoc_array_insert_mid_shortcut() buildid: use kmap_local_page() watchdog/core: remove sysctl handlers from public header nilfs2: use div64_ul() instead of do_div() mul_u64_u64_div_u64: increase precision by conditionally swapping a and b kexec: copy only happens before uchunk goes to zero get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig get_signal: don't abuse ksig->info.si_signo and ksig->sig const_structs.checkpatch: add device_type Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>" dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace() list: leverage list_is_head() for list_entry_is_head() nilfs2: MAINTAINERS: drop unreachable project mirror site smp: make __smp_processor_id() 0-argument macro fat: fix uninitialized field in nostale filehandles ...
2024-03-14Merge tag 'mm-stable-2024-03-13-20-04' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ...
2024-03-14Merge tag 'trace-ring-buffer-v6.8-rc7-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Do not update shortest_full in rb_watermark_hit() if the watermark is hit. The shortest_full field was being updated regardless if the task was going to wait or not. If the watermark is hit, then the task is not going to wait, so do not update the shortest_full field (used by the waker). - Update shortest_full field before setting the full_waiters_pending flag In the poll logic, the full_waiters_pending flag was being set before the shortest_full field was set. If the full_waiters_pending flag is set, writers will check the shortest_full field which has the least percentage of data that the ring buffer needs to be filled before waking up. The writer will check shortest_full if full_waiters_pending is set, and if the ring buffer percentage filled is greater than shortest full, then it will call the irq_work to wake up the waiters. The problem was that the poll logic set the full_waiters_pending flag before updating shortest_full, which when zero will always trigger the writer to call the irq_work to wake up the waiters. The irq_work will reset the shortest_full field back to zero as the woken waiters is suppose to reset it. - There's some optimized logic in the rb_watermark_hit() that is used in ring_buffer_wait(). Use that helper function in the poll logic as well. - Restructure ring_buffer_wait() to use wait_event_interruptible() The logic to wake up pending readers when the file descriptor is closed is racy. Restructure ring_buffer_wait() to allow callers to pass in conditions besides the ring buffer having enough data in it by using wait_event_interruptible(). - Update the tracing_wait_on_pipe() to call ring_buffer_wait() with its own conditions to exit the wait loop. * tag 'trace-ring-buffer-v6.8-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/ring-buffer: Fix wait_on_pipe() race ring-buffer: Use wait_event_interruptible() in ring_buffer_wait() ring-buffer: Reuse rb_watermark_hit() for the poll logic ring-buffer: Fix full_waiters_pending in poll ring-buffer: Do not set shortest_full when full target is hit
2024-03-14Merge tag 'probes-v6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes updates from Masami Hiramatsu: "x86 kprobes: - Use boolean for some function return instead of 0 and 1 - Prohibit probing on INT/UD. This prevents user to put kprobe on INTn/INT1/INT3/INTO and UD0/UD1/UD2 because these are used for a special purpose in the kernel - Boost Grp instructions. Because a few percent of kernel instructions are Grp 2/3/4/5 and those are safe to be executed without ip register fixup, allow those to be boosted (direct execution on the trampoline buffer with a JMP) tracing: - Add function argument access from return events (kretprobe and fprobe). This allows user to compare how a data structure field is changed after executing a function. With BTF, return event also accepts function argument access by name. - Fix a wrong comment (using "Kretprobe" in fprobe) - Cleanup a big probe argument parser function into three parts, type parser, post-processing function, and main parser - Cleanup to set nr_args field when initializing trace_probe instead of counting up it while parsing - Cleanup a redundant #else block from tracefs/README source code - Update selftests to check entry argument access from return probes - Documentation update about entry argument access from return probes" * tag 'probes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: Documentation: tracing: Add entry argument access at function exit selftests/ftrace: Add test cases for entry args at function exit tracing/probes: Support $argN in return probe (kprobe and fprobe) tracing: Remove redundant #else block for BTF args from README tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_init tracing/probes: Cleanup probe argument parser tracing/fprobe-event: cleanup: Fix a wrong comment in fprobe event x86/kprobes: Boost more instructions from grp2/3/4/5 x86/kprobes: Prohibit kprobing on INT and UD x86/kprobes: Refactor can_{probe,boost} return type to bool
2024-03-13kernel/hung_task.c: export sysctl_hung_task_timeout_secsKent Overstreet
needed for thread_with_file; also rare but not unheard of to need this in module code, when blocking on user input. one workaround used by some code is wait_event_interruptible() - but that can be buggy if the outer context isn't expecting unwinding. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: fuyuanli <fuyuanli@didiglobal.com>
2024-03-13pidfs: remove config optionChristian Brauner
As Linus suggested this enables pidfs unconditionally. A key property to retain is the ability to compare pidfds by inode number (cf. [1]). That's extremely helpful just as comparing namespace file descriptors by inode number is. They are used in a variety of scenarios where they need to be compared, e.g., when receiving a pidfd via SO_PEERPIDFD from a socket to trivially authenticate a the sender and various other use-cases. For 64bit systems this is pretty trivial to do. For 32bit it's slightly more annoying as we discussed but we simply add a dumb ida based allocator that gets used on 32bit. This gives the same guarantees about inode numbers on 64bit without any overflow risk. Practically, we'll never run into overflow issues because we're constrained by the number of processes that can exist on 32bit and by the number of open files that can exist on a 32bit system. On 64bit none of this matters and things are very simple. If 32bit also needs the uniqueness guarantee they can simply parse the contents of /proc/<pid>/fd/<nr>. The uniqueness guarantees have a variety of use-cases. One of the most obvious ones is that they will make pidfiles (or "pidfdfiles", I guess) reliable as the unique identifier can be placed into there that won't be reycled. Also a frequent request. Note, I took the chance and simplified path_from_stashed() even further. Instead of passing the inode number explicitly to path_from_stashed() we let the filesystem handle that internally. So path_from_stashed() ends up even simpler than it is now. This is also a good solution allowing the cleanup code to be clean and consistent between 32bit and 64bit. The cleanup path in prepare_anon_dentry() is also switched around so we put the inode before the dentry allocation. This means we only have to call the cleanup handler for the filesystem's inode data once and can rely ->evict_inode() otherwise. Aside from having to have a bit of extra code for 32bit it actually ends up a nice cleanup for path_from_stashed() imho. Tested on both 32 and 64bit including error injection. Link: https://github.com/systemd/systemd/pull/31713 [1] Link: https://lore.kernel.org/r/20240312-dingo-sehnlich-b3ecc35c6de7@brauner Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-13PM: EM: Force device drivers to provide power in uWLukasz Luba
The EM only supports power in uW. Make sure that it is not possible to register some downstream driver which doesn't provide power in uW. The only exception is artificial EM, but that EM is ignored by the rest of kernel frameworks (thermal, powercap, etc). Reported-by: PoShao Chen <poshao.chen@mediatek.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-03-13Merge tag 'modules-6.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull modules updates from Luis Chamberlain: "Christophe Leroy did most of the work on this release, first with a few cleanups on CONFIG_STRICT_KERNEL_RWX and ending with error handling for when set_memory_XX() can fail. This is part of a larger effort to clean up all these callers which can fail, modules is just part of it" * tag 'modules-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: module: Don't ignore errors from set_memory_XX() lib/test_kmod: fix kernel-doc warnings powerpc: Simplify strict_kernel_rwx_enabled() modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around rodata_enabled init: Declare rodata_enabled and mark_rodata_ro() at all time module: Change module_enable_{nx/x/ro}() to more explicit names module: Use set_memory_rox()
2024-03-13Merge tag 'pm-6.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "From the functional perspective, the most significant change here is the addition of support for Energy Models that can be updated dynamically at run time. There is also the addition of LZ4 compression support for hibernation, the new preferred core support in amd-pstate, new platforms support in the Intel RAPL driver, new model-specific EPP handling in intel_pstate and more. Apart from that, the cpufreq default transition delay is reduced from 10 ms to 2 ms (along with some related adjustments), the system suspend statistics code undergoes a significant rework and there is a usual bunch of fixes and code cleanups all over. Specifics: - Allow the Energy Model to be updated dynamically (Lukasz Luba) - Add support for LZ4 compression algorithm to the hibernation image creation and loading code (Nikhil V) - Fix and clean up system suspend statistics collection (Rafael Wysocki) - Simplify device suspend and resume handling in the power management core code (Rafael Wysocki) - Fix PCI hibernation support description (Yiwei Lin) - Make hibernation take set_memory_ro() return values into account as appropriate (Christophe Leroy) - Set mem_sleep_current during kernel command line setup to avoid an ordering issue with handling it (Maulik Shah) - Fix wake IRQs handling when pm_runtime_force_suspend() is used as a driver's system suspend callback (Qingliang Li) - Simplify pm_runtime_get_if_active() usage and add a replacement for pm_runtime_put_autosuspend() (Sakari Ailus) - Add a tracepoint for runtime_status changes tracking (Vilas Bhat) - Fix section title markdown in the runtime PM documentation (Yiwei Lin) - Enable preferred core support in the amd-pstate cpufreq driver (Meng Li) - Fix min_perf assignment in amd_pstate_adjust_perf() and make the min/max limit perf values in amd-pstate always stay within the (highest perf, lowest perf) range (Tor Vic, Meng Li) - Allow intel_pstate to assign model-specific values to strings used in the EPP sysfs interface and make it do so on Meteor Lake (Srinivas Pandruvada) - Drop long-unused cpudata::prev_cummulative_iowait from the intel_pstate cpufreq driver (Jiri Slaby) - Prevent scaling_cur_freq from exceeding scaling_max_freq when the latter is an inefficient frequency (Shivnandan Kumar) - Change default transition delay in cpufreq to 2ms (Qais Yousef) - Remove references to 10ms minimum sampling rate from comments in the cpufreq code (Pierre Gondois) - Honour transition_latency over transition_delay_us in cpufreq (Qais Yousef) - Stop unregistering cpufreq cooling on CPU hot-remove (Viresh Kumar) - General enhancements / cleanups to ARM cpufreq drivers (tianyu2, Nícolas F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia Belova) - Update cpufreq-dt-platdev to block/approve devices (Richard Acayan) - Make the SCMI cpufreq driver get a transition delay value from firmware (Pierre Gondois) - Prevent the haltpoll cpuidle governor from shrinking guest poll_limit_ns below grow_start (Parshuram Sangle) - Avoid potential overflow in integer multiplication when computing cpuidle state parameters (C Cheng) - Adjust MWAIT hint target C-state computation in the ACPI cpuidle driver and in intel_idle to return a correct value for C0 (He Rongguang) - Address multiple issues in the TPMI RAPL driver and add support for new platforms (Lunar Lake-M, Arrow Lake) to Intel RAPL (Zhang Rui) - Fix freq_qos_add_request() return value check in dtpm_cpu (Daniel Lezcano) - Fix kernel-doc for dtpm_create_hierarchy() (Yang Li) - Fix file leak in get_pkg_num() in x86_energy_perf_policy (Samasth Norway Ananda) - Fix cpupower-frequency-info.1 man page typo (Jan Kratochvil) - Fix a couple of warnings in the OPP core code related to W=1 builds (Viresh Kumar) - Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h (Viresh Kumar) - Extend dev_pm_opp_data with turbo support (Sibi Sankar) - dt-bindings: drop maxItems from inner items (David Heidelberg)" * tag 'pm-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (95 commits) dt-bindings: opp: drop maxItems from inner items OPP: debugfs: Fix warning around icc_get_name() OPP: debugfs: Fix warning with W=1 builds cpufreq: Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h OPP: Extend dev_pm_opp_data with turbo support Fix cpupower-frequency-info.1 man page typo cpufreq: scmi: Set transition_delay_us firmware: arm_scmi: Populate fast channel rate_limit firmware: arm_scmi: Populate perf commands rate_limit cpuidle: ACPI/intel: fix MWAIT hint target C-state computation PM: sleep: wakeirq: fix wake irq warning in system suspend powercap: dtpm: Fix kernel-doc for dtpm_create_hierarchy() function cpufreq: Don't unregister cpufreq cooling on CPU hotplug PM: suspend: Set mem_sleep_current during kernel command line setup cpufreq: Honour transition_latency over transition_delay_us cpufreq: Limit resolving a frequency to policy min/max Documentation: PM: Fix runtime_pm.rst markdown syntax cpufreq: amd-pstate: adjust min/max limit perf cpufreq: Remove references to 10ms min sampling rate cpufreq: intel_pstate: Update default EPPs for Meteor Lake ...
2024-03-13Merge tag 'dma-mapping-6.9-2024-03-11' of ↵Linus Torvalds
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - fix leaked pages on dma_set_decrypted() failure (Rick Edgecombe) - add a new swiotlb debugfs file (ZhangPeng) * tag 'dma-mapping-6.9-2024-03-11' of git://git.infradead.org/users/hch/dma-mapping: dma-direct: Leak pages on dma_set_decrypted() failure swiotlb: add debugfs to track swiotlb transient pool usage
2024-03-12Merge tag 'printk-for-6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: "Improve the behavior during panic. The issues were found when testing the ongoing changes introducing atomic consoles and printk kthreads: - pr_flush() has to wait for the last reserved record instead of the last finalized one. Note that records are finalized in random order when generated by more CPUs in parallel. - Ignore non-finalized records during panic(). Messages printed on panic-CPU are always finalized. Messages printed by other CPUs might never be finalized when the CPUs get stopped. - Block new printk() calls on non-panic CPUs completely. Backtraces are printed before entering the panic mode. Later messages would just mess information printed by the panic CPU. - Do not take console_lock in console_flush_on_panic() at all. The original code did try_lock()/console_unlock(). The unlock part might cause a deadlock when panic() happened in a scheduler code. - Fix conversion of 64-bit sequence number for 32-bit atomic operations" * tag 'printk-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: dump_stack: Do not get cpu_sync for panic CPU panic: Flush kernel log buffer at the end printk: Avoid non-panic CPUs writing to ringbuffer printk: Disable passing console lock owner completely during panic() printk: ringbuffer: Skip non-finalized records in panic printk: Wait for all reserved records with pr_flush() printk: ringbuffer: Cleanup reader terminology printk: Add this_cpu_in_panic() printk: For @suppress_panic_printk check for other CPU in panic printk: ringbuffer: Clarify special lpos values printk: ringbuffer: Do not skip non-finalized records with prb_next_seq() printk: Use prb_first_seq() as base for 32bit seq macros printk: Adjust mapping for 32bit seq macros printk: nbcon: Relocate 32bit seq macros
2024-03-12Merge tag 'lsm-pr-20240312' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: - Promote IMA/EVM to a proper LSM This is the bulk of the diffstat, and the source of all the changes in the VFS code. Prior to the start of the LSM stacking work it was important that IMA/EVM were separate from the rest of the LSMs, complete with their own hooks, infrastructure, etc. as it was the only way to enable IMA/EVM at the same time as a LSM. However, now that the bulk of the LSM infrastructure supports multiple simultaneous LSMs, we can simplify things greatly by bringing IMA/EVM into the LSM infrastructure as proper LSMs. This is something I've wanted to see happen for quite some time and Roberto was kind enough to put in the work to make it happen. - Use the LSM hook default values to simplify the call_int_hook() macro Previously the call_int_hook() macro required callers to supply a default return value, despite a default value being specified when the LSM hook was defined. This simplifies the macro by using the defined default return value which makes life easier for callers and should also reduce the number of return value bugs in the future (we've had a few pop up recently, hence this work). - Use the KMEM_CACHE() macro instead of kmem_cache_create() The guidance appears to be to use the KMEM_CACHE() macro when possible and there is no reason why we can't use the macro, so let's use it. - Fix a number of comment typos in the LSM hook comment blocks Not much to say here, we fixed some questionable grammar decisions in the LSM hook comment blocks. * tag 'lsm-pr-20240312' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (28 commits) cred: Use KMEM_CACHE() instead of kmem_cache_create() lsm: use default hook return value in call_int_hook() lsm: fix typos in security/security.c comment headers integrity: Remove LSM ima: Make it independent from 'integrity' LSM evm: Make it independent from 'integrity' LSM evm: Move to LSM infrastructure ima: Move IMA-Appraisal to LSM infrastructure ima: Move to LSM infrastructure integrity: Move integrity_kernel_module_request() to IMA security: Introduce key_post_create_or_update hook security: Introduce inode_post_remove_acl hook security: Introduce inode_post_set_acl hook security: Introduce inode_post_create_tmpfile hook security: Introduce path_post_mknod hook security: Introduce file_release hook security: Introduce file_post_open hook security: Introduce inode_post_removexattr hook security: Introduce inode_post_setattr hook security: Align inode_setattr hook definition with EVM ...
2024-03-12Merge tag 'net-next-6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Large effort by Eric to lower rtnl_lock pressure and remove locks: - Make commonly used parts of rtnetlink (address, route dumps etc) lockless, protected by RCU instead of rtnl_lock. - Add a netns exit callback which already holds rtnl_lock, allowing netns exit to take rtnl_lock once in the core instead of once for each driver / callback. - Remove locks / serialization in the socket diag interface. - Remove 6 calls to synchronize_rcu() while holding rtnl_lock. - Remove the dev_base_lock, depend on RCU where necessary. - Support busy polling on a per-epoll context basis. Poll length and budget parameters can be set independently of system defaults. - Introduce struct net_hotdata, to make sure read-mostly global config variables fit in as few cache lines as possible. - Add optional per-nexthop statistics to ease monitoring / debug of ECMP imbalance problems. - Support TCP_NOTSENT_LOWAT in MPTCP. - Ensure that IPv6 temporary addresses' preferred lifetimes are long enough, compared to other configured lifetimes, and at least 2 sec. - Support forwarding of ICMP Error messages in IPSec, per RFC 4301. - Add support for the independent control state machine for bonding per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled control state machine. - Add "network ID" to MCTP socket APIs to support hosts with multiple disjoint MCTP networks. - Re-use the mono_delivery_time skbuff bit for packets which user space wants to be sent at a specified time. Maintain the timing information while traversing veth links, bridge etc. - Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets. - Simplify many places iterating over netdevs by using an xarray instead of a hash table walk (hash table remains in place, for use on fastpaths). - Speed up scanning for expired routes by keeping a dedicated list. - Speed up "generic" XDP by trying harder to avoid large allocations. - Support attaching arbitrary metadata to netconsole messages. Things we sprinkled into general kernel code: - Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages (used by bpf_arena). - Rework selftest harness to enable the use of the full range of ksft exit code (pass, fail, skip, xfail, xpass). Netfilter: - Allow userspace to define a table that is exclusively owned by a daemon (via netlink socket aliveness) without auto-removing this table when the userspace program exits. Such table gets marked as orphaned and a restarting management daemon can re-attach/regain ownership. - Speed up element insertions to nftables' concatenated-ranges set type. Compact a few related data structures. BPF: - Add BPF token support for delegating a subset of BPF subsystem functionality from privileged system-wide daemons such as systemd through special mount options for userns-bound BPF fs to a trusted & unprivileged application. - Introduce bpf_arena which is sparse shared memory region between BPF program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and BPF programs. - Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it. - Extend the BPF verifier to enable static subprog calls in spin lock critical sections. - Support registration of struct_ops types from modules which helps projects like fuse-bpf that seeks to implement a new struct_ops type. - Add support for retrieval of cookies for perf/kprobe multi links. - Support arbitrary TCP SYN cookie generation / validation in the TC layer with BPF to allow creating SYN flood handling in BPF firewalls. - Add code generation to inline the bpf_kptr_xchg() helper which improves performance when stashing/popping the allocated BPF objects. Wireless: - Add SPP (signaling and payload protected) AMSDU support. - Support wider bandwidth OFDMA, as required for EHT operation. Driver API: - Major overhaul of the Energy Efficient Ethernet internals to support new link modes (2.5GE, 5GE), share more code between drivers (especially those using phylib), and encourage more uniform behavior. Convert and clean up drivers. - Define an API for querying per netdev queue statistics from drivers. - IPSec: account in global stats for fully offloaded sessions. - Create a concept of Ethernet PHY Packages at the Device Tree level, to allow parameterizing the existing PHY package code. - Enable Rx hashing (RSS) on GTP protocol fields. Misc: - Improvements and refactoring all over networking selftests. - Create uniform module aliases for TC classifiers, actions, and packet schedulers to simplify creating modprobe policies. - Address all missing MODULE_DESCRIPTION() warnings in networking. - Extend the Netlink descriptions in YAML to cover message encapsulation or "Netlink polymorphism", where interpretation of nested attributes depends on link type, classifier type or some other "class type". Drivers: - Ethernet high-speed NICs: - Add a new driver for Marvell's Octeon PCI Endpoint NIC VF. - Intel (100G, ice, idpf): - support E825-C devices - nVidia/Mellanox: - support devices with one port and multiple PCIe links - Broadcom (bnxt): - support n-tuple filters - support configuring the RSS key - Wangxun (ngbe/txgbe): - implement irq_domain for TXGBE's sub-interrupts - Pensando/AMD: - support XDP - optimize queue submission and wakeup handling (+17% bps) - optimize struct layout, saving 28% of memory on queues - Ethernet NICs embedded and virtual: - Google cloud vNIC: - refactor driver to perform memory allocations for new queue config before stopping and freeing the old queue memory - Synopsys (stmmac): - obey queueMaxSDU and implement counters required by 802.1Qbv - Renesas (ravb): - support packet checksum offload - suspend to RAM and runtime PM support - Ethernet switches: - nVidia/Mellanox: - support for nexthop group statistics - Microchip: - ksz8: implement PHY loopback - add support for KSZ8567, a 7-port 10/100Mbps switch - PTP: - New driver for RENESAS FemtoClock3 Wireless clock generator. - Support OCP PTP cards designed and built by Adva. - CAN: - Support recvmsg() flags for own, local and remote traffic on CAN BCM sockets. - Support for esd GmbH PCIe/402 CAN device family. - m_can: - Rx/Tx submission coalescing - wake on frame Rx - WiFi: - Intel (iwlwifi): - enable signaling and payload protected A-MSDUs - support wider-bandwidth OFDMA - support for new devices - bump FW API to 89 for AX devices; 90 for BZ/SC devices - MediaTek (mt76): - mt7915: newer ADIE version support - mt7925: radio temperature sensor support - Qualcomm (ath11k): - support 6 GHz station power modes: Low Power Indoor (LPI), Standard Power) SP and Very Low Power (VLP) - QCA6390 & WCN6855: support 2 concurrent station interfaces - QCA2066 support - Qualcomm (ath12k): - refactoring in preparation for Multi-Link Operation (MLO) support - 1024 Block Ack window size support - firmware-2.bin support - support having multiple identical PCI devices (firmware needs to have ATH12K_FW_FEATURE_MULTI_QRTR_ID) - QCN9274: support split-PHY devices - WCN7850: enable Power Save Mode in station mode - WCN7850: P2P support - RealTek: - rtw88: support for more rtw8811cu and rtw8821cu devices - rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL - rtlwifi: speed up USB firmware initialization - rtwl8xxxu: - RTL8188F: concurrent interface support - Channel Switch Announcement (CSA) support in AP mode - Broadcom (brcmfmac): - per-vendor feature support - per-vendor SAE password setup - DMI nvram filename quirk for ACEPC W5 Pro" * tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits) nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y nexthop: Fix out-of-bounds access during attribute validation nexthop: Only parse NHA_OP_FLAGS for dump messages that require it nexthop: Only parse NHA_OP_FLAGS for get messages that require it bpf: move sleepable flag from bpf_prog_aux to bpf_prog bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes() selftests/bpf: Add kprobe multi triggering benchmarks ptp: Move from simple ida to xarray vxlan: Remove generic .ndo_get_stats64 vxlan: Do not alloc tstats manually devlink: Add comments to use netlink gen tool nfp: flower: handle acti_netdevs allocation failure net/packet: Add getsockopt support for PACKET_COPY_THRESH net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID selftests/bpf: Add bpf_arena_htab test. selftests/bpf: Add bpf_arena_list test. selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages bpf: Add helper macro bpf_addr_space_cast() libbpf: Recognize __arena global variables. bpftool: Recognize arena map type ...
2024-03-12Merge tag 'audit-pr-20240312' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "Two small audit patches: - Use the KMEM_CACHE() macro instead of kmem_cache_create() The guidance appears to be to use the KMEM_CACHE() macro when possible and there is no reason why we can't use the macro, so let's use it. - Remove an unnecessary assignment in audit_dupe_lsm_field() A return value variable was assigned a value in its declaration, but the declaration value is overwritten before the return value variable is ever referenced; drop the assignment at declaration time" * tag 'audit-pr-20240312' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: use KMEM_CACHE() instead of kmem_cache_create() audit: remove unnecessary assignment in audit_dupe_lsm_field()
2024-03-12Merge tag 'hardening-v6.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "As is pretty normal for this tree, there are changes all over the place, especially for small fixes, selftest improvements, and improved macro usability. Some header changes ended up landing via this tree as they depended on the string header cleanups. Also, a notable set of changes is the work for the reintroduction of the UBSAN signed integer overflow sanitizer so that we can continue to make improvements on the compiler side to make this sanitizer a more viable future security hardening option. Summary: - string.h and related header cleanups (Tanzir Hasan, Andy Shevchenko) - VMCI memcpy() usage and struct_size() cleanups (Vasiliy Kovalev, Harshit Mogalapalli) - selftests/powerpc: Fix load_unaligned_zeropad build failure (Michael Ellerman) - hardened Kconfig fragment updates (Marco Elver, Lukas Bulwahn) - Handle tail call optimization better in LKDTM (Douglas Anderson) - Use long form types in overflow.h (Andy Shevchenko) - Add flags param to string_get_size() (Andy Shevchenko) - Add Coccinelle script for potential struct_size() use (Jacob Keller) - Fix objtool corner case under KCFI (Josh Poimboeuf) - Drop 13 year old backward compat CAP_SYS_ADMIN check (Jingzi Meng) - Add str_plural() helper (Michal Wajdeczko, Kees Cook) - Ignore relocations in .notes section - Add comments to explain how __is_constexpr() works - Fix m68k stack alignment expectations in stackinit Kunit test - Convert string selftests to KUnit - Add KUnit tests for fortified string functions - Improve reporting during fortified string warnings - Allow non-type arg to type_max() and type_min() - Allow strscpy() to be called with only 2 arguments - Add binary mode to leaking_addresses scanner - Various small cleanups to leaking_addresses scanner - Adding wrapping_*() arithmetic helper - Annotate initial signed integer wrap-around in refcount_t - Add explicit UBSAN section to MAINTAINERS - Fix UBSAN self-test warnings - Simplify UBSAN build via removal of CONFIG_UBSAN_SANITIZE_ALL - Reintroduce UBSAN's signed overflow sanitizer" * tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (51 commits) selftests/powerpc: Fix load_unaligned_zeropad build failure string: Convert helpers selftest to KUnit string: Convert selftest to KUnit sh: Fix build with CONFIG_UBSAN=y compiler.h: Explain how __is_constexpr() works overflow: Allow non-type arg to type_max() and type_min() VMCI: Fix possible memcpy() run-time warning in vmci_datagram_invoke_guest_handler() lib/string_helpers: Add flags param to string_get_size() x86, relocs: Ignore relocations in .notes section objtool: Fix UNWIND_HINT_{SAVE,RESTORE} across basic blocks overflow: Use POD in check_shl_overflow() lib: stackinit: Adjust target string to 8 bytes for m68k sparc: vdso: Disable UBSAN instrumentation kernel.h: Move lib/cmdline.c prototypes to string.h leaking_addresses: Provide mechanism to scan binary files leaking_addresses: Ignore input device status lines leaking_addresses: Use File::Temp for /tmp files MAINTAINERS: Update LEAKING_ADDRESSES details fortify: Improve buffer overflow reporting fortify: Add KUnit tests for runtime overflows ...
2024-03-12watchdog/core: remove sysctl handlers from public headerThomas Weißschuh
The functions are only used in the file where they are defined. Remove them from the header and make them static. Also guard proc_soft_watchdog with a #define-guard as it is not used otherwise. Link: https://lkml.kernel.org/r/20240306-const-sysctl-prep-watchdog-v1-1-bd45da3a41cf@weissschuh.net Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-12tracing/ring-buffer: Fix wait_on_pipe() raceSteven Rostedt (Google)
When the trace_pipe_raw file is closed, there should be no new readers on the file descriptor. This is mostly handled with the waking and wait_index fields of the iterator. But there's still a slight race. CPU 0 CPU 1 ----- ----- wait_index++; index = wait_index; ring_buffer_wake_waiters(); wait_on_pipe() ring_buffer_wait(); The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is that the ring_buffer_wait() needs the logic of: prepare_to_wait(); if (!condition) schedule(); Where the missing condition check is the iter->wait_index update. Have the ring_buffer_wait() take a conditional callback function and a data parameter that can be used within the wait_event_interruptible() of the ring_buffer_wait() function. In wait_on_pipe(), pass a condition function that will check if the wait_index has been updated, if it has, it will return true to break out of the wait_event_interruptible() loop. Create a new field "closed" in the trace_iterator and set it in the .flush() callback before calling ring_buffer_wake_waiters(). This will keep any new readers from waiting on a closed file descriptor. Have the wait_on_pipe() condition callback also check the closed field. Change the wait_index field of the trace_iterator to atomic_t. There's no reason it needs to be 'long' and making it atomic and using atomic_read_acquire() and atomic_fetch_inc_release() will provide the necessary memory barriers. Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after one more try to fetch data. That is, if it waited for data and something woke it up, it should try to collect any new data and then exit back to user space. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsNgewHFxZAJiAQznwPMqEtQmi1waeS2O1v6L4c_Um5A@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240312121703.557950713@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linke li <lilinke99@qq.com> Cc: Rabin Vincent <rabin@rab.in> Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-12ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()Steven Rostedt (Google)
Convert ring_buffer_wait() over to wait_event_interruptible(). The default condition is to execute the wait loop inside __wait_event() just once. This does not change the ring_buffer_wait() prototype yet, but restructures the code so that it can take a "cond" and "data" parameter and will call wait_event_interruptible() with a helper function as the condition. The helper function (rb_wait_cond) takes the cond function and data parameters. It will first check if the buffer hit the watermark defined by the "full" parameter and then call the passed in condition parameter. If either are true, it returns true. If rb_wait_cond() does not return true, it will set the appropriate "waiters_pending" flag and returns false. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsNgewHFxZAJiAQznwPMqEtQmi1waeS2O1v6L4c_Um5A@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240312121703.399598519@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linke li <lilinke99@qq.com> Cc: Rabin Vincent <rabin@rab.in> Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-12ring-buffer: Reuse rb_watermark_hit() for the poll logicSteven Rostedt (Google)
The check for knowing if the poll should wait or not is basically the exact same logic as rb_watermark_hit(). The only difference is that rb_watermark_hit() also handles the !full case. But for the full case, the logic is the same. Just call that instead of duplicating the code in ring_buffer_poll_wait(). Link: https://lore.kernel.org/linux-trace-kernel/20240312131952.802267543@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-12ring-buffer: Fix full_waiters_pending in pollSteven Rostedt (Google)
If a reader of the ring buffer is doing a poll, and waiting for the ring buffer to hit a specific watermark, there could be a case where it gets into an infinite ping-pong loop. The poll code has: rbwork->full_waiters_pending = true; if (!cpu_buffer->shortest_full || cpu_buffer->shortest_full > full) cpu_buffer->shortest_full = full; The writer will see full_waiters_pending and check if the ring buffer is filled over the percentage of the shortest_full value. If it is, it calls an irq_work to wake up all the waiters. But the code could get into a circular loop: CPU 0 CPU 1 ----- ----- [ Poll ] [ shortest_full = 0 ] rbwork->full_waiters_pending = true; if (rbwork->full_waiters_pending && [ buffer percent ] > shortest_full) { rbwork->wakeup_full = true; [ queue_irqwork ] cpu_buffer->shortest_full = full; [ IRQ work ] if (rbwork->wakeup_full) { cpu_buffer->shortest_full = 0; wakeup poll waiters; [woken] if ([ buffer percent ] > full) break; rbwork->full_waiters_pending = true; if (rbwork->full_waiters_pending && [ buffer percent ] > shortest_full) { rbwork->wakeup_full = true; [ queue_irqwork ] cpu_buffer->shortest_full = full; [ IRQ work ] if (rbwork->wakeup_full) { cpu_buffer->shortest_full = 0; wakeup poll waiters; [woken] [ Wash, rinse, repeat! ] In the poll, the shortest_full needs to be set before the full_pending_waiters, as once that is set, the writer will compare the current shortest_full (which is incorrect) to decide to call the irq_work, which will reset the shortest_full (expecting the readers to update it). Also move the setting of full_waiters_pending after the check if the ring buffer has the required percentage filled. There's no reason to tell the writer to wake up waiters if there are no waiters. Link: https://lore.kernel.org/linux-trace-kernel/20240312131952.630922155@goodmis.org Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-12ring-buffer: Do not set shortest_full when full target is hitSteven Rostedt (Google)
The rb_watermark_hit() checks if the amount of data in the ring buffer is above the percentage level passed in by the "full" variable. If it is, it returns true. But it also sets the "shortest_full" field of the cpu_buffer that informs writers that it needs to call the irq_work if the amount of data on the ring buffer is above the requested amount. The rb_watermark_hit() always sets the shortest_full even if the amount in the ring buffer is what it wants. As it is not going to wait, because it has what it wants, there's no reason to set shortest_full. Link: https://lore.kernel.org/linux-trace-kernel/20240312115641.6aa8ba08@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-11Merge tag 'x86-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: - The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code - Propagate more RIP-relative addressing in assembly code, to generate slightly better code - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options - Rework the x86 idle code to cure RCU violations and to clean up the logic - Clean up the vDSO Makefile logic - Misc cleanups and fixes * tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/idle: Select idle routine only once x86/idle: Let prefer_mwait_c1_over_halt() return bool x86/idle: Cleanup idle_setup() x86/idle: Clean up idle selection x86/idle: Sanitize X86_BUG_AMD_E400 handling sched/idle: Conditionally handle tick broadcast in default_idle_call() x86: Increase brk randomness entropy for 64-bit systems x86/vdso: Move vDSO to mmap region x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o x86/retpoline: Ensure default return thunk isn't used at runtime x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 x86/vdso: Use $(addprefix ) instead of $(foreach ) x86/vdso: Simplify obj-y addition x86/vdso: Consolidate targets and clean-files x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS ...
2024-03-11Merge tag 'sched-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Fix inconsistency in misfit task load-balancing - Fix CPU isolation bugs in the task-wakeup logic - Rework and unify the sched_use_asym_prio() and sched_asym_prefer() logic - Clean up and simplify ->avg_* accesses - Misc cleanups and fixes * tag 'sched-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLC sched/fair: Check the SD_ASYM_PACKING flag in sched_use_asym_prio() sched/fair: Rework sched_use_asym_prio() and sched_asym_prefer() sched/fair: Remove unused parameter from sched_asym() sched/topology: Remove duplicate descriptions from TOPOLOGY_SD_FLAGS sched/fair: Simplify the update_sd_pick_busiest() logic sched/fair: Do strict inequality check for busiest misfit task group sched/fair: Remove unnecessary goto in update_sd_lb_stats() sched/fair: Take the scheduling domain into account in select_idle_core() sched/fair: Take the scheduling domain into account in select_idle_smt() sched/fair: Add READ_ONCE() and use existing helper function to access ->avg_irq sched/fair: Use existing helper functions to access ->avg_rt and ->avg_dl sched/core: Simplify code by removing duplicate #ifdefs
2024-03-11Merge tag 'locking-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Micro-optimize local_xchg() and the rtmutex code on x86 - Fix percpu-rwsem contention tracepoints - Simplify debugging Kconfig dependencies - Update/clarify the documentation of atomic primitives - Misc cleanups * tag 'locking-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Use try_cmpxchg_relaxed() in mark_rt_mutex_waiters() locking/x86: Implement local_xchg() using CMPXCHG without the LOCK prefix locking/percpu-rwsem: Trigger contention tracepoints only if contended locking/rwsem: Make DEBUG_RWSEMS and PREEMPT_RT mutually exclusive locking/rwsem: Clarify that RWSEM_READER_OWNED is just a hint locking/mutex: Simplify <linux/mutex.h> locking/qspinlock: Fix 'wait_early' set but not used warning locking/atomic: scripts: Clarify ordering of conditional atomics
2024-03-11Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2024-03-11 We've added 59 non-merge commits during the last 9 day(s) which contain a total of 88 files changed, 4181 insertions(+), 590 deletions(-). The main changes are: 1) Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages to be used in bpf_arena, from Alexei. 2) Introduce bpf_arena which is sparse shared memory region between bpf program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and bpf programs, from Alexei and Andrii. 3) Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it, from Alexei. 4) Use IETF format for field definitions in the BPF standard document, from Dave. 5) Extend struct_ops libbpf APIs to allow specify version suffixes for stuct_ops map types, share the same BPF program between several map definitions, and other improvements, from Eduard. 6) Enable struct_ops support for more than one page in trampolines, from Kui-Feng. 7) Support kCFI + BPF on riscv64, from Puranjay. 8) Use bpf_prog_pack for arm64 bpf trampoline, from Puranjay. 9) Fix roundup_pow_of_two undefined behavior on 32-bit archs, from Toke. ==================== Link: https://lore.kernel.org/r/20240312003646.8692-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11bpf: move sleepable flag from bpf_prog_aux to bpf_progAndrii Nakryiko
prog->aux->sleepable is checked very frequently as part of (some) BPF program run hot paths. So this extra aux indirection seems wasteful and on busy systems might cause unnecessary memory cache misses. Let's move sleepable flag into prog itself to eliminate unnecessary pointer dereference. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Message-ID: <20240309004739.2961431-1-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-11bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()Puranjay Mohan
On some architectures like ARM64, PMD_SIZE can be really large in some configurations. Like with CONFIG_ARM64_64K_PAGES=y the PMD_SIZE is 512MB. Use 2MB * num_possible_nodes() as the size for allocations done through the prog pack allocator. On most architectures, PMD_SIZE will be equal to 2MB in case of 4KB pages and will be greater than 2MB for bigger page sizes. Fixes: ea2babac63d4 ("bpf: Simplify bpf_prog_pack_[size|mask]") Reported-by: "kernelci.org bot" <bot@kernelci.org> Closes: https://lore.kernel.org/all/7e216c88-77ee-47b8-becc-a0f780868d3c@sirena.org.uk/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202403092219.dhgcuz2G-lkp@intel.com/ Suggested-by: Song Liu <song@kernel.org> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Message-ID: <20240311122722.86232-1-puranjay12@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-11Merge tag 'x86-apic-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 APIC updates from Thomas Gleixner: "Rework of APIC enumeration and topology evaluation. The current implementation has a couple of shortcomings: - It fails to handle hybrid systems correctly. - The APIC registration code which handles CPU number assignents is in the middle of the APIC code and detached from the topology evaluation. - The various mechanisms which enumerate APICs, ACPI, MPPARSE and guest specific ones, tweak global variables as they see fit or in case of XENPV just hack around the generic mechanisms completely. - The CPUID topology evaluation code is sprinkled all over the vendor code and reevaluates global variables on every hotplug operation. - There is no way to analyze topology on the boot CPU before bringing up the APs. This causes problems for infrastructure like PERF which needs to size certain aspects upfront or could be simplified if that would be possible. - The APIC admission and CPU number association logic is incomprehensible and overly complex and needs to be kept around after boot instead of completing this right after the APIC enumeration. This update addresses these shortcomings with the following changes: - Rework the CPUID evaluation code so it is common for all vendors and provides information about the APIC ID segments in a uniform way independent of the number of segments (Thread, Core, Module, ..., Die, Package) so that this information can be computed instead of rewriting global variables of dubious value over and over. - A few cleanups and simplifcations of the APIC, IO/APIC and related interfaces to prepare for the topology evaluation changes. - Seperation of the parser stages so the early evaluation which tries to find the APIC address can be seperately overridden from the late evaluation which enumerates and registers the local APIC as further preparation for sanitizing the topology evaluation. - A new registration and admission logic which - encapsulates the inner workings so that parsers and guest logic cannot longer fiddle in it - uses the APIC ID segments to build topology bitmaps at registration time - provides a sane admission logic - allows to detect the crash kernel case, where CPU0 does not run on the real BSP, automatically. This is required to prevent sending INIT/SIPI sequences to the real BSP which would reset the whole machine. This was so far handled by a tedious command line parameter, which does not even work in nested crash scenarios. - Associates CPU number after the enumeration completed and prevents the late registration of APICs, which was somehow tolerated before. - Converting all parsers and guest enumeration mechanisms over to the new interfaces. This allows to get rid of all global variable tweaking from the parsers and enumeration mechanisms and sanitizes the XEN[PV] handling so it can use CPUID evaluation for the first time. - Mopping up existing sins by taking the information from the APIC ID segment bitmaps. This evaluates hybrid systems correctly on the boot CPU and allows for cleanups and fixes in the related drivers, e.g. PERF. The series has been extensively tested and the minimal late fallout due to a broken ACPI/MADT table has been addressed by tightening the admission logic further" * tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits) x86/topology: Ignore non-present APIC IDs in a present package x86/apic: Build the x86 topology enumeration functions on UP APIC builds too smp: Provide 'setup_max_cpus' definition on UP too smp: Avoid 'setup_max_cpus' namespace collision/shadowing x86/bugs: Use fixed addressing for VERW operand x86/cpu/topology: Get rid of cpuinfo::x86_max_cores x86/cpu/topology: Provide __num_[cores|threads]_per_package x86/cpu/topology: Rename topology_max_die_per_package() x86/cpu/topology: Rename smp_num_siblings x86/cpu/topology: Retrieve cores per package from topology bitmaps x86/cpu/topology: Use topology logical mapping mechanism x86/cpu/topology: Provide logical pkg/die mapping x86/cpu/topology: Simplify cpu_mark_primary_thread() x86/cpu/topology: Mop up primary thread mask handling x86/cpu/topology: Use topology bitmaps for sizing x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT x86/xen/smp_pv: Count number of vCPUs early x86/cpu/topology: Assign hotpluggable CPUIDs during init x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug x86/topology: Add a mechanism to track topology via APIC IDs ...