Add warning checks if mutex_trylock() or mutex_unlock() are used in
IRQ contexts, under CONFIG_DEBUG_MUTEXES=y.
While the mutex rules and semantics are explicitly documented, this allows
to expose any abusers and robustifies the whole thing.
While trylock and unlock are non-blocking, calling from IRQ context
is still forbidden (lock must be within the same context as unlock).
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Link: https://lkml.kernel.org/r/20191025033634.3330-1-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the original comment, we have moved to do the task
reference counting explicitly along with wake_q_add_safe().
Drop the now incorrect comment.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Link: https://lkml.kernel.org/r/20191023033450.6445-1-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The check_preemption_disabled() function uses cpumask_equal() to see
if the task is bounded to the current CPU only. cpumask_equal() calls
memcmp() to do the comparison. As x86 doesn't have __HAVE_ARCH_MEMCMP,
the slow memcmp() function in lib/string.c is used.
On a RT kernel that call check_preemption_disabled() very frequently,
below is the perf-record output of a certain microbenchmark:
42.75% 2.45% testpmd [kernel.kallsyms] [k] check_preemption_disabled
40.01% 39.97% testpmd [kernel.kallsyms] [k] memcmp
We should avoid calling memcmp() in performance critical path. So the
cpumask_equal() call is now replaced with an equivalent simpler check.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191003203608.21881-1-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQUwxxKyE5l/npt8ARiEGxRG/Sl2wUCXZzy3AAKCRBiEGxRG/Sl
2wFjAP9BFsZMqiz8wTxlvn/vuVXg8V1TAbJzwn0rJJKPJsggnQD9HFBKJ3Vq995R
C02zqXE+8wJlqCGRK4pmJey5KamjLQo=
=JF8E
-----END PGP SIGNATURE-----
Merge tag 'led-fixes-for-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds
Pull LED fixes from Jacek Anaszewski:
- fix a leftover from earlier stage of development in the documentation
of recently added led_compose_name() and fix old mistake in the
documentation of led_set_brightness_sync() parameter name.
- MAINTAINERS: add pointer to Pavel Machek's linux-leds.git tree.
Pavel is going to take over LED tree maintainership from myself.
* tag 'led-fixes-for-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds:
Add my linux-leds branch to MAINTAINERS
leds: core: Fix leds.h structure documentation
Add pointer to my git tree to MAINTAINERS. I'd like to maintain
linux-leds for-next branch for 5.5.
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Update the leds.h structure documentation to define the
correct arguments.
Signed-off-by: Dan Murphy <dmurphy@ti.com>
Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>
- Don't clear FLAG_IS_OUT when emulating open drain/source in
gpiolib.
- Fix up the usage of nonexclusive GPIO descriptors from device
trees.
- Fix the incorrect IEC offset when toggling trigger edge in
the Spreadtrum driver.
- Use the correct unit for debounce settings in the MAX77620
driver.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEElDRnuGcz/wPCXQWMQRCzN7AZXXMFAl2cT2cACgkQQRCzN7AZ
XXPCIhAAieKjIfW+adcV9+EZUybw35R9QCn27TTowK3YZfNi4cXyOi9f39gw8Kby
ZptY9lE5w6P142z8qLmtYWZ1F639sjeo4dZ2tXIzo0PhqG36rUP2hPOCffrhvMHb
iDby2IIgKTbR4i1YxZk75h8s0oWG542JGs39Uu+UyYTNAwJdA/4jKjIHK4XosqdD
YbhGNP4J4rmeNLwpK7RObg2NwWsfB0P15TEywnVNEDyOSsjmm65gKYSQn+frDlOz
sSdLy/J2+suRCu2KwHlgclF2dKcm2wyBAHItyTFwLTlYOENUC/cAGDlc6/2Nqhh+
CR9blbf5Jns0aTWoPDRHaAXy5qkKTfcI4osDJOP8yRVziz0vW0+Xq1PxmifrFl4d
zFIN5MPwJ3kTjNlfOaR4txuqvZX8JCDAjMfHMY0e/fqG5Ofm4OvGfsk1Xdwq7wbs
qV5Fewv6XybdHmIuDOHvJ9iPMuMEZbgs3jNRCxnLj0NYXDMXyHR9tJoTq71v8GHg
KvyCNwKXACrRQvE/zkEYgSgY3IXS/3Dhv3RhmpzIpP/Ey55o57CmaqG3oh2C3C0V
OyCXXYQ8ZvFrIAwLYZwq/l1f2knP0D+uJrT/puxvFUXJQpLqD72mfr20/G1tcyrf
8lxwf2MHmqjLV79E+lTxykfTFi0AEbSBHy2FPDYewfdPQG7h6xo=
=J0x0
-----END PGP SIGNATURE-----
Merge tag 'gpio-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
- don't clear FLAG_IS_OUT when emulating open drain/source in gpiolib
- fix up the usage of nonexclusive GPIO descriptors from device trees
- fix the incorrect IEC offset when toggling trigger edge in the
Spreadtrum driver
- use the correct unit for debounce settings in the MAX77620 driver
* tag 'gpio-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: max77620: Use correct unit for debounce times
gpio: eic: sprd: Fix the incorrect EIC offset when toggling
gpio: fix getting nonexclusive gpiods from DT
gpiolib: don't clear FLAG_IS_OUT when emulating open-drain/open-source
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl2bu6kUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMsxhAAtoljww3Xur0JpD7y+g2yzKGZqn9F
ovqH103NOdpXY3vRN5TL0ZfKEWZz/a2Rjyjz/9+Ix5kKFQuaguk9TVenp4LuAWjy
yyo8aSArqwJEpPbrgQDRkjvq08zCcsHSQHwyR44L5MEB8w03Hr+GKFbroR7DkB8R
qthF5nRoarblEpdc88s3WbPN/Yz32zRwl3EppSRriIBSBUNr6OP5yO6YDvBdwJso
CvmQybMK/iGiZrDzm5jAXzUyI79MHkrrB55roNXIdam9Rnyb9Wqjt9SQgzDLTvO1
Z7c4pXqDn1iMSECAqR7EeKLmsEvnp8omDMqbZOsGiWwka93nuNM4NRhswMF6X3pf
EbmBAuj0CokWlRoJAxyxrw/Tn+KXWjyOpOMoNQR7dyyewenzPTWw4zLhiSsl4Epo
e1+3PDkJeZhlrtqMcQhep/OgfnPp/8FlgZXNkq1wsMK6SawIiwvxH3mpELE4I8Zk
3yzYZvnxIDNLcx6TmDgDcJyp+P/iuFGK707G6ogCoCK9VqyTs+nwdZn3s2o1KRDW
00LdiuXiqOyfdDthfY/q5suKJoWExh+K1dhQ7Llx169yx3uOjlnzTaSTt8dcvhkh
Y+Nf5pEk0MVgnldaIRy/Zzr4y81Q7QW6ZwD62NHCIhcSevYczFOP7K6V/mYFmDT1
xlCDPXeHyuR5DrM=
=btWt
-----END PGP SIGNATURE-----
Merge tag 'selinux-pr-20191007' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinuxfix from Paul Moore:
"One patch to ensure we don't copy bad memory up into userspace"
* tag 'selinux-pr-20191007' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: fix context string corruption in convert_context()
This Kselftest update for Linux 5.4-rc3 consists fixes for existing
tests and the framework. Cristian Marussi's patches add ability to
skip targets (tests) and exclude tests that didn't build from run-list.
These patches improve the Kselftest results. Ability to skip targets
helps avoid running tests that aren't supported in certain environments.
As an example, bpf tests from mainline aren't supported on stable kernels
and have dependency on bleeding edge llvm. Being able to skip bpf on
systems that can't meet this llvm dependency will be helpful.
Kselftest can be built and installed from the main Makefile. This change
help simplify Kselftest use-cases which addresses request from users.
Kees Cook added per test timeout support to limit individual test run-time.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAl2bntwACgkQCwJExA0N
QxznkA//cY2Y3UGMoUx08qnLc97cQb95OXodE3m3fcfyH/NoY6R/RAMx+2NSYhLO
kbmpmo+6S94bgekGBzdnki/OzCoVR0d1dkxPImcxXl1zf/fs7eMgZ77Br1nQWfPP
2WfFv7xNYnuws1Ybnz83eN+6ZQ+/AjEbHcqcufWDj/D2AQDTxF/2PXHeD42azJgG
11EAxhCEbSb8x0ZDAeHTELvZ0gIfdWYNmOXFUHgJSW4nVYYhFNcvbq2nukmugkub
MMWBcM6B354bAx8EoMSnBQ/1WWYszs0SqkbVce3iDh8z9R/sLFmUthljK9LR0EpW
okfJVHF0jGSWdwnruyES8Mp7/65RBu6bkVnbdFcYW1nIw4erfzYacUBXK8WZe88g
p5lkY1OlDbPrUcjIN1VpVw4FZt1fktXAwbTIn+xOUI9R5njv94tFNUDaQm3epKwC
fKB1jXv8jAZ8Ho2uw4ikLW8mie9Kd9c/8PK8JoEtgXCtAxOv9/wUb6whHPvUOYeu
B2G5ITyTJF3yYrTaPliHqb2C5cCVN0XcF5VLKQRR+RpQn4///9duQQcEEOJsKHOC
q3SMjjhXRJfgYDLcpIRDn6uqaDwC+giWOaMq6f/QHpmsWL0eT7DJ+8lLCgpV3Bm2
JytbiXpeUigRZCdH0xs+wp23xPRAtKlf7DlGQhOb/v9v4rp/8MY=
=vdrT
-----END PGP SIGNATURE-----
Merge tag 'linux-kselftest-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull Kselftest fixes from Shuah Khan:
"Fixes for existing tests and the framework.
Cristian Marussi's patches add the ability to skip targets (tests) and
exclude tests that didn't build from run-list. These patches improve
the Kselftest results. Ability to skip targets helps avoid running
tests that aren't supported in certain environments. As an example,
bpf tests from mainline aren't supported on stable kernels and have
dependency on bleeding edge llvm. Being able to skip bpf on systems
that can't meet this llvm dependency will be helpful.
Kselftest can be built and installed from the main Makefile. This
change help simplify Kselftest use-cases which addresses request from
users.
Kees Cook added per test timeout support to limit individual test
run-time"
* tag 'linux-kselftest-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests: watchdog: Add command line option to show watchdog_info
selftests: watchdog: Validate optional file argument
selftests/kselftest/runner.sh: Add 45 second timeout per test
kselftest: exclude failed TARGETS from runlist
kselftest: add capability to skip chosen TARGETS
selftests: Add kselftest-all and kselftest-install targets
Merge misc fixes from Andrew Morton:
"The usual shower of hotfixes.
Chris's memcg patches aren't actually fixes - they're mature but a few
niggling review issues were late to arrive.
The ocfs2 fixes are quite old - those took some time to get reviewer
attention.
Subsystems affected by this patch series: ocfs2, hotfixes, mm/memcg,
mm/slab-generic"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
mm, sl[ou]b: improve memory accounting
mm, memcg: make scan aggression always exclude protection
mm, memcg: make memory.emin the baseline for utilisation determination
mm, memcg: proportional memory.{low,min} reclaim
mm/vmpressure.c: fix a signedness bug in vmpressure_register_event()
mm/page_alloc.c: fix a crash in free_pages_prepare()
mm/z3fold.c: claim page in the beginning of free
kernel/sysctl.c: do not override max_threads provided by userspace
memcg: only record foreign writebacks with dirty pages when memcg is not disabled
mm: fix -Wmissing-prototypes warnings
writeback: fix use-after-free in finish_writeback_work()
mm/memremap: drop unused SECTION_SIZE and SECTION_MASK
panic: ensure preemption is disabled during panic()
fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc()
fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock()
fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()
ocfs2: clear zero in unaligned direct IO
In most configurations, kmalloc() happens to return naturally aligned
(i.e. aligned to the block size itself) blocks for power of two sizes.
That means some kmalloc() users might unknowingly rely on that
alignment, until stuff breaks when the kernel is built with e.g.
CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then
developers have to devise workaround such as own kmem caches with
specified alignment [1], which is not always practical, as recently
evidenced in [2].
The topic has been discussed at LSF/MM 2019 [3]. Adding a
'kmalloc_aligned()' variant would not help with code unknowingly relying
on the implicit alignment. For slab implementations it would either
require creating more kmalloc caches, or allocate a larger size and only
give back part of it. That would be wasteful, especially with a generic
alignment parameter (in contrast with a fixed alignment to size).
Ideally we should provide to mm users what they need without difficult
workarounds or own reimplementations, so let's make the kmalloc()
alignment to size explicitly guaranteed for power-of-two sizes under all
configurations. What this means for the three available allocators?
* SLAB object layout happens to be mostly unchanged by the patch. The
implicitly provided alignment could be compromised with
CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
caches with alignment larger than unsigned long long. Practically on at
least x86 this includes kmalloc caches as they use cache line alignment,
which is larger than that. Still, this patch ensures alignment on all
arches and cache sizes.
* SLUB layout is also unchanged unless redzoning is enabled through
CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
With this patch, explicit alignment is guaranteed with redzoning as
well. This will result in more memory being wasted, but that should be
acceptable in a debugging scenario.
* SLOB has no implicit alignment so this patch adds it explicitly for
kmalloc(). The potential downside is increased fragmentation. While
pathological allocation scenarios are certainly possible, in my testing,
after booting a x86_64 kernel+userspace with virtme, around 16MB memory
was consumed by slab pages both before and after the patch, with
difference in the noise.
[1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
[2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
[3] https://lwn.net/Articles/787740/
[akpm@linux-foundation.org: documentation fixlet, per Matthew]
Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "guarantee natural alignment for kmalloc()", v2.
This patch (of 2):
SLOB currently doesn't account its pages at all, so in /proc/meminfo the
Slab field shows zero. Modifying a counter on page allocation and
freeing should be acceptable even for the small system scenarios SLOB is
intended for. Since reclaimable caches are not separated in SLOB,
account everything as unreclaimable.
SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
larger than order-1 page, that are passed directly to the page
allocator. As they also don't appear in /proc/slabinfo, it might look
like a memory leak. For consistency, account them as well. (SLAB
doesn't actually use page allocator directly, so no change there).
Ideally SLOB and SLUB would be handled in separate patches, but due to
the shared kmalloc_order() function and different kfree()
implementations, it's easier to patch both at once to prevent
inconsistencies.
Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is an incremental improvement on the existing
memory.{low,min} relative reclaim work to base its scan pressure
calculations on how much protection is available compared to the current
usage, rather than how much the current usage is over some protection
threshold.
This change doesn't change the experience for the user in the normal
case too much. One benefit is that it replaces the (somewhat arbitrary)
100% cutoff with an indefinite slope, which makes it easier to ballpark
a memory.low value.
As well as this, the old methodology doesn't quite apply generically to
machines with varying amounts of physical memory. Let's say we have a
top level cgroup, workload.slice, and another top level cgroup,
system-management.slice. We want to roughly give 12G to
system-management.slice, so on a 32GB machine we set memory.low to 20GB
in workload.slice, and on a 64GB machine we set memory.low to 52GB.
However, because these are relative amounts to the total machine size,
while the amount of memory we want to generally be willing to yield to
system.slice is absolute (12G), we end up putting more pressure on
system.slice just because we have a larger machine and a larger workload
to fill it, which seems fairly unintuitive. With this new behaviour, we
don't end up with this unintended side effect.
Previously the way that memory.low protection works is that if you are
50% over a certain baseline, you get 50% of your normal scan pressure.
This is certainly better than the previous cliff-edge behaviour, but it
can be improved even further by always considering memory under the
currently enforced protection threshold to be out of bounds. This means
that we can set relatively low memory.low thresholds for variable or
bursty workloads while still getting a reasonable level of protection,
whereas with the previous version we may still trivially hit the 100%
clamp. The previous 100% clamp is also somewhat arbitrary, whereas this
one is more concretely based on the currently enforced protection
threshold, which is likely easier to reason about.
There is also a subtle issue with the way that proportional reclaim
worked previously -- it promotes having no memory.low, since it makes
pressure higher during low reclaim. This happens because we base our
scan pressure modulation on how far memory.current is between memory.min
and memory.low, but if memory.low is unset, we only use the overage
method. In most cromulent configurations, this then means that we end
up with *more* pressure than with no memory.low at all when we're in low
reclaim, which is not really very usable or expected.
With this patch, memory.low and memory.min affect reclaim pressure in a
more understandable and composable way. For example, from a user
standpoint, "protected" memory now remains untouchable from a reclaim
aggression standpoint, and users can also have more confidence that
bursty workloads will still receive some amount of guaranteed
protection.
Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman points out that when when we do the low reclaim pass, we scale the
reclaim pressure relative to position between 0 and the maximum
protection threshold.
However, if the maximum protection is based on memory.elow, and
memory.emin is above zero, this means we still may get binary behaviour
on second-pass low reclaim. This is because we scale starting at 0, not
starting at memory.emin, and since we don't scan at all below emin, we
end up with cliff behaviour.
This should be a fairly uncommon case since usually we don't go into the
second pass, but it makes sense to scale our low reclaim pressure
starting at emin.
You can test this by catting two large sparse files, one in a cgroup
with emin set to some moderate size compared to physical RAM, and
another cgroup without any emin. In both cgroups, set an elow larger
than 50% of physical RAM. The one with emin will have less page
scanning, as reclaim pressure is lower.
Rebase on top of and apply the same idea as what was applied to handle
cgroup_memory=disable properly for the original proportional patch
http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm,
memcg: Handle cgroup_disable=memory when getting memcg protection").
Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Suggested-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cgroup v2 introduces two memory protection thresholds: memory.low
(best-effort) and memory.min (hard protection). While they generally do
what they say on the tin, there is a limitation in their implementation
that makes them difficult to use effectively: that cliff behaviour often
manifests when they become eligible for reclaim. This patch implements
more intuitive and usable behaviour, where we gradually mount more
reclaim pressure as cgroups further and further exceed their protection
thresholds.
This cliff edge behaviour happens because we only choose whether or not
to reclaim based on whether the memcg is within its protection limits
(see the use of mem_cgroup_protected in shrink_node), but we don't vary
our reclaim behaviour based on this information. Imagine the following
timeline, with the numbers the lruvec size in this zone:
1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
scanned. (?!)
* Of course, we won't usually scan all available pages in the zone even
without this patch because of scan control priority, over-reclaim
protection, etc. However, as shown by the tests at the end, these
techniques don't sufficiently throttle such an extreme change in input,
so cliff-like behaviour isn't really averted by their existence alone.
Here's an example of how this plays out in practice. At Facebook, we are
trying to protect various workloads from "system" software, like
configuration management tools, metric collectors, etc (see this[0] case
study). In order to find a suitable memory.low value, we start by
determining the expected memory range within which the workload will be
comfortable operating. This isn't an exact science -- memory usage deemed
"comfortable" will vary over time due to user behaviour, differences in
composition of work, etc, etc. As such we need to ballpark memory.low,
but doing this is currently problematic:
1. If we end up setting it too low for the workload, it won't have
*any* effect (see discussion above). The group will receive the full
weight of reclaim and won't have any priority while competing with the
less important system software, as if we had no memory.low configured
at all.
2. Because of this behaviour, we end up erring on the side of setting
it too high, such that the comfort range is reliably covered. However,
protected memory is completely unavailable to the rest of the system,
so we might cause undue memory and IO pressure there when we *know* we
have some elasticity in the workload.
3. Even if we get the value totally right, smack in the middle of the
comfort zone, we get extreme jumps between no pressure and full
pressure that cause unpredictable pressure spikes in the workload due
to the current binary reclaim behaviour.
With this patch, we can set it to our ballpark estimation without too much
worry. Any undesirable behaviour, such as too much or too little reclaim
pressure on the workload or system will be proportional to how far our
estimation is off. This means we can set memory.low much more
conservatively and thus waste less resources *without* the risk of the
workload falling off a cliff if we overshoot.
As a more abstract technical description, this unintuitive behaviour
results in having to give high-priority workloads a large protection
buffer on top of their expected usage to function reliably, as otherwise
we have abrupt periods of dramatically increased memory pressure which
hamper performance. Having to set these thresholds so high wastes
resources and generally works against the principle of work conservation.
In addition, having proportional memory reclaim behaviour has other
benefits. Most notably, before this patch it's basically mandatory to set
memory.low to a higher than desirable value because otherwise as soon as
you exceed memory.low, all protection is lost, and all pages are eligible
to scan again. By contrast, having a gradual ramp in reclaim pressure
means that you now still get some protection when thresholds are exceeded,
which means that one can now be more comfortable setting memory.low to
lower values without worrying that all protection will be lost. This is
important because workingset size is really hard to know exactly,
especially with variable workloads, so at least getting *some* protection
if your workingset size grows larger than you expect increases user
confidence in setting memory.low without a huge buffer on top being
needed.
Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
assistance in thinking about how to make this work better.
In testing these changes, I intended to verify that:
1. Changes in page scanning become gradual and proportional instead of
binary.
To test this, I experimented stepping further and further down
memory.low protection on a workload that floats around 19G workingset
when under memory.low protection, watching page scan rates for the
workload cgroup:
+------------+-----------------+--------------------+--------------+
| memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
+------------+-----------------+--------------------+--------------+
| 21G | 0 | 0 | N/A |
| 17G | 867 | 3799 | 23% |
| 12G | 1203 | 3543 | 34% |
| 8G | 2534 | 3979 | 64% |
| 4G | 3980 | 4147 | 96% |
| 0 | 3799 | 3980 | 95% |
+------------+-----------------+--------------------+--------------+
As you can see, the test kernel (with a kernel containing this
patch) ramps up page scanning significantly more gradually than the
control kernel (without this patch).
2. More gradual ramp up in reclaim aggression doesn't result in
premature OOMs.
To test this, I wrote a script that slowly increments the number of
pages held by stress(1)'s --vm-keep mode until a production system
entered severe overall memory contention. This script runs in a highly
protected slice taking up the majority of available system memory.
Watching vmstat revealed that page scanning continued essentially
nominally between test and control, without causing forward reclaim
progress to become arrested.
[0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
[akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
[chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "mode" and "level" variables are enums and in this context GCC will
treat them as unsigned ints so the error handling is never triggered.
I also removed the bogus initializer because it isn't required any more
and it's sort of confusing.
[akpm@linux-foundation.org: reduce implicit and explicit typecasting]
[akpm@linux-foundation.org: fix return value, add comment, per Matthew]
Link: http://lkml.kernel.org/r/20190925110449.GO3264@mwanda
Fixes: 3cadfa2b94 ("mm/vmpressure.c: convert to use match_string() helper")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Enrico Weigelt <info@metux.net>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a really hard to reproduce race in z3fold between z3fold_free()
and z3fold_reclaim_page(). z3fold_reclaim_page() can claim the page
after z3fold_free() has checked if the page was claimed and
z3fold_free() will then schedule this page for compaction which may in
turn lead to random page faults (since that page would have been
reclaimed by then).
Fix that by claiming page in the beginning of z3fold_free() and not
forgetting to clear the claim in the end.
[vitalywool@gmail.com: v2]
Link: http://lkml.kernel.org/r/20190928113456.152742cf@bigdell
Link: http://lkml.kernel.org/r/20190926104844.4f0c6efa1366b8f5741eaba9@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Reported-by: Markus Linnala <markus.linnala@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Markus Linnala <markus.linnala@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Partially revert 16db3d3f11 ("kernel/sysctl.c: threads-max observe
limits") because the patch is causing a regression to any workload which
needs to override the auto-tuning of the limit provided by kernel.
set_max_threads is implementing a boot time guesstimate to provide a
sensible limit of the concurrently running threads so that runaways will
not deplete all the memory. This is a good thing in general but there
are workloads which might need to increase this limit for an application
to run (reportedly WebSpher MQ is affected) and that is simply not
possible after the mentioned change. It is also very dubious to
override an admin decision by an estimation that doesn't have any direct
relation to correctness of the kernel operation.
Fix this by dropping set_max_threads from sysctl_max_threads so any
value is accepted as long as it fits into MAX_THREADS which is important
to check because allowing more threads could break internal robust futex
restriction. While at it, do not use MIN_THREADS as the lower boundary
because it is also only a heuristic for automatic estimation and admin
might have a good reason to stop new threads to be created even when
below this limit.
This became more severe when we switched x86 from 4k to 8k kernel
stacks. Starting since 6538b8ea88 ("x86_64: expand kernel stack to
16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned
value.
In the particular case
3.12
kernel.threads-max = 515561
4.4
kernel.threads-max = 200000
Neither of the two values is really insane on 32GB machine.
I am not sure we want/need to tune the max_thread value further. If
anything the tuning should be removed altogether if proven not useful in
general. But we definitely need a way to override this auto-tuning.
Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz
Fixes: 16db3d3f11 ("kernel/sysctl.c: threads-max observe limits")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In kdump kernel, memcg usually is disabled with 'cgroup_disable=memory'
for saving memory. Now kdump kernel will always panic when dump vmcore
to local disk:
BUG: kernel NULL pointer dereference, address: 0000000000000ab8
Oops: 0000 [#1] SMP NOPTI
CPU: 0 PID: 598 Comm: makedumpfile Not tainted 5.3.0+ #26
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 10/02/2018
RIP: 0010:mem_cgroup_track_foreign_dirty_slowpath+0x38/0x140
Call Trace:
__set_page_dirty+0x52/0xc0
iomap_set_page_dirty+0x50/0x90
iomap_write_end+0x6e/0x270
iomap_write_actor+0xce/0x170
iomap_apply+0xba/0x11e
iomap_file_buffered_write+0x62/0x90
xfs_file_buffered_aio_write+0xca/0x320 [xfs]
new_sync_write+0x12d/0x1d0
vfs_write+0xa5/0x1a0
ksys_write+0x59/0xd0
do_syscall_64+0x59/0x1e0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
And this will corrupt the 1st kernel too with 'cgroup_disable=memory'.
Via the trace and with debugging, it is pointing to commit 97b27821b4
("writeback, memcg: Implement foreign dirty flushing") which introduced
this regression. Disabling memcg causes the null pointer dereference at
uninitialized data in function mem_cgroup_track_foreign_dirty_slowpath().
Fix it by returning directly if memcg is disabled, but not trying to
record the foreign writebacks with dirty pages.
Link: http://lkml.kernel.org/r/20190924141928.GD31919@MiWiFi-R3L-srv
Fixes: 97b27821b4 ("writeback, memcg: Implement foreign dirty flushing")
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We get two warnings when build kernel W=1:
mm/shuffle.c:36:12: warning: no previous prototype for `shuffle_show' [-Wmissing-prototypes]
mm/sparse.c:220:6: warning: no previous prototype for `subsection_mask_set' [-Wmissing-prototypes]
Make the functions static to fix this.
Link: http://lkml.kernel.org/r/1566978161-7293-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SECTION_SIZE and SECTION_MASK macros are not getting used anymore. But
they do conflict with existing definitions on arm64 platform causing
following warning during build. Lets drop these unused macros.
mm/memremap.c:16: warning: "SECTION_MASK" redefined
#define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
arch/arm64/include/asm/pgtable-hwdef.h:79: note: this is the location of the previous definition
#define SECTION_MASK (~(SECTION_SIZE-1))
mm/memremap.c:17: warning: "SECTION_SIZE" redefined
#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
arch/arm64/include/asm/pgtable-hwdef.h:78: note: this is the location of the previous definition
#define SECTION_SIZE (_AC(1, UL) << SECTION_SHIFT)
Link: http://lkml.kernel.org/r/1569312010-31313-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Calling 'panic()' on a kernel with CONFIG_PREEMPT=y can leave the
calling CPU in an infinite loop, but with interrupts and preemption
enabled. From this state, userspace can continue to be scheduled,
despite the system being "dead" as far as the kernel is concerned.
This is easily reproducible on arm64 when booting with "nosmp" on the
command line; a couple of shell scripts print out a periodic "Ping"
message whilst another triggers a crash by writing to
/proc/sysrq-trigger:
| sysrq: Trigger a crash
| Kernel panic - not syncing: sysrq triggered crash
| CPU: 0 PID: 1 Comm: init Not tainted 5.2.15 #1
| Hardware name: linux,dummy-virt (DT)
| Call trace:
| dump_backtrace+0x0/0x148
| show_stack+0x14/0x20
| dump_stack+0xa0/0xc4
| panic+0x140/0x32c
| sysrq_handle_reboot+0x0/0x20
| __handle_sysrq+0x124/0x190
| write_sysrq_trigger+0x64/0x88
| proc_reg_write+0x60/0xa8
| __vfs_write+0x18/0x40
| vfs_write+0xa4/0x1b8
| ksys_write+0x64/0xf0
| __arm64_sys_write+0x14/0x20
| el0_svc_common.constprop.0+0xb0/0x168
| el0_svc_handler+0x28/0x78
| el0_svc+0x8/0xc
| Kernel Offset: disabled
| CPU features: 0x0002,24002004
| Memory Limit: none
| ---[ end Kernel panic - not syncing: sysrq triggered crash ]---
| Ping 2!
| Ping 1!
| Ping 1!
| Ping 2!
The issue can also be triggered on x86 kernels if CONFIG_SMP=n,
otherwise local interrupts are disabled in 'smp_send_stop()'.
Disable preemption in 'panic()' before re-enabling interrupts.
Link: http://lkml.kernel.org/r/20191002123538.22609-1-will@kernel.org
Link: https://lore.kernel.org/r/BX1W47JXPMR8.58IYW53H6M5N@dragonstone
Signed-off-by: Will Deacon <will@kernel.org>
Reported-by: Xogium <contact@xogium.me>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ocfs2_info_scan_inode_alloc(), there is an if statement on line 283
to check whether inode_alloc is NULL:
if (inode_alloc)
When inode_alloc is NULL, it is used on line 287:
ocfs2_inode_lock(inode_alloc, &bh, 0);
ocfs2_inode_lock_full_nested(inode, ...)
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
Thus, a possible null-pointer dereference may occur.
To fix this bug, inode_alloc is checked on line 286.
This bug is found by a static analysis tool STCheck written by us.
Link: http://lkml.kernel.org/r/20190726033717.32359-1-baijiaju1990@gmail.com
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ocfs2_write_end_nolock(), there are an if statement on lines 1976,
2047 and 2058, to check whether handle is NULL:
if (handle)
When handle is NULL, it is used on line 2045:
ocfs2_update_inode_fsync_trans(handle, inode, 1);
oi->i_sync_tid = handle->h_transaction->t_tid;
Thus, a possible null-pointer dereference may occur.
To fix this bug, handle is checked before calling
ocfs2_update_inode_fsync_trans().
This bug is found by a static analysis tool STCheck written by us.
Link: http://lkml.kernel.org/r/20190726033705.32307-1-baijiaju1990@gmail.com
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ocfs2_xa_prepare_entry(), there is an if statement on line 2136 to
check whether loc->xl_entry is NULL:
if (loc->xl_entry)
When loc->xl_entry is NULL, it is used on line 2158:
ocfs2_xa_add_entry(loc, name_hash);
loc->xl_entry->xe_name_hash = cpu_to_le32(name_hash);
loc->xl_entry->xe_name_offset = cpu_to_le16(loc->xl_size);
and line 2164:
ocfs2_xa_add_namevalue(loc, xi);
loc->xl_entry->xe_value_size = cpu_to_le64(xi->xi_value_len);
loc->xl_entry->xe_name_len = xi->xi_name_len;
Thus, possible null-pointer dereferences may occur.
To fix these bugs, if loc-xl_entry is NULL, ocfs2_xa_prepare_entry()
abnormally returns with -EINVAL.
These bugs are found by a static analysis tool STCheck written by us.
[akpm@linux-foundation.org: remove now-unused ocfs2_xa_add_entry()]
Link: http://lkml.kernel.org/r/20190726101447.9153-1-baijiaju1990@gmail.com
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Unused portion of a part-written fs-block-sized block is not set to zero
in unaligned append direct write.This can lead to serious data
inconsistencies.
Ocfs2 manage disk with cluster size(for example, 1M), part-written in
one cluster will change the cluster state from UN-WRITTEN to WRITTEN,
VFS(function dio_zero_block) doesn't do the cleaning because bh's state
is not set to NEW in function ocfs2_dio_wr_get_block when we write a
WRITTEN cluster. For example, the cluster size is 1M, file size is 8k
and we direct write from 14k to 15k, then 12k~14k and 15k~16k will
contain dirty data.
We have to deal with two cases:
1.The starting position of direct write is outside the file.
2.The starting position of direct write is located in the file.
We need set bh's state to NEW in the first case. In the second case, we
need mapped twice because bh's state of area out file should be set to
NEW while area in file not.
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/5292e287-8f1a-fd4a-1a14-661e555e0bed@huawei.com
Signed-off-by: Jia Guo <guojia12@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 9f79b78ef7 ("Convert filldir[64]() from __put_user() to
unsafe_put_user()") I made filldir() use unsafe_put_user(), which
improves code generation on x86 enormously.
But because we didn't have a "unsafe_copy_to_user()", the dirent name
copy was also done by hand with unsafe_put_user() in a loop, and it
turns out that a lot of other architectures didn't like that, because
unlike x86, they have various alignment issues.
Most non-x86 architectures trap and fix it up, and some (like xtensa)
will just fail unaligned put_user() accesses unconditionally. Which
makes that "copy using put_user() in a loop" not work for them at all.
I could make that code do explicit alignment etc, but the architectures
that don't like unaligned accesses also don't really use the fancy
"user_access_begin/end()" model, so they might just use the regular old
__copy_to_user() interface.
So this commit takes that looping implementation, turns it into the x86
version of "unsafe_copy_to_user()", and makes other architectures
implement the unsafe copy version as __copy_to_user() (the same way they
do for the other unsafe_xyz() accessor functions).
Note that it only does this for the copying _to_ user space, and we
still don't have a unsafe version of copy_from_user().
That's partly because we have no current users of it, but also partly
because the copy_from_user() case is slightly different and cannot
efficiently be implemented in terms of a unsafe_get_user() loop (because
gcc can't do asm goto with outputs).
It would be trivial to do this using "rep movsb", which would work
really nicely on newer x86 cores, but really badly on some older ones.
Al Viro is looking at cleaning up all our user copy routines to make
this all a non-issue, but for now we have this simple-but-stupid version
for x86 that works fine for the dirent name copy case because those
names are short strings and we simply don't need anything fancier.
Fixes: 9f79b78ef7 ("Convert filldir[64]() from __put_user() to unsafe_put_user()")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-and-tested-by: Tony Luck <tony.luck@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 4ed2863951 ("fs, elf: drop MAP_FIXED usage from elf_map") we
changed elf to use MAP_FIXED_NOREPLACE instead of MAP_FIXED for the
executable mappings.
Then, people reported that it broke some binaries that had overlapping
segments from the same file, and commit ad55eac74f ("elf: enforce
MAP_FIXED on overlaying elf segments") re-instated MAP_FIXED for some
overlaying elf segment cases. But only some - despite the summary line
of that commit, it only did it when it also does a temporary brk vma for
one obvious overlapping case.
Now Russell King reports another overlapping case with old 32-bit x86
binaries, which doesn't trigger that limited case. End result: we had
better just drop MAP_FIXED_NOREPLACE entirely, and go back to MAP_FIXED.
Yes, it's a sign of old binaries generated with old tool-chains, but we
do pride ourselves on not breaking existing setups.
This still leaves MAP_FIXED_NOREPLACE in place for the load_elf_interp()
and the old load_elf_library() use-cases, because nobody has reported
breakage for those. Yet.
Note that in all the cases seen so far, the overlapping elf sections
seem to be just re-mapping of the same executable with different section
attributes. We could possibly introduce a new MAP_FIXED_NOFILECHANGE
flag or similar, which acts like NOREPLACE, but allows just remapping
the same executable file using different protection flags.
It's not clear that would make a huge difference to anything, but if
people really hate that "elf remaps over previous maps" behavior, maybe
at least a more limited form of remapping would alleviate some concerns.
Alternatively, we should take a look at our elf_map() logic to see if we
end up not mapping things properly the first time.
In the meantime, this is the minimal "don't do that then" patch while
people hopefully think about it more.
Reported-by: Russell King <linux@armlinux.org.uk>
Fixes: 4ed2863951 ("fs, elf: drop MAP_FIXED usage from elf_map")
Fixes: ad55eac74f ("elf: enforce MAP_FIXED on overlaying elf segments")
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- revert an incorret hunk from a patch that caused problems
on various arm boards (Andrey Smirnov)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl2aFRoLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMBeBAAuTsOh1amMUdsAJN67PJcHP8JkOlR21cjLVaKkvWh
l5XnXITtlNvyzXH67jZuQL15+rQ/kTOkmSc5bIDZW7+sTW2Rwnq6bIQOZpBYKlol
U/UTBtk26SliKoinlJekKWAA6o32PJU2oLOsTmqoCqH5k0aeKdNHAFPSw4fU3jbW
U4Sv0uc6MI+PM9OM3H/T60qQPvziOkeDp4wAZZ5AO/kUbNgzUrbGatNk26QqgNbs
NsAVQ3X/sgUAwXMtivo9nFUd2fuEIf9GueGVohGiW+2znWQ8AxY76/FgxzXICmMi
S0YLqPrdlzzZ0K7k8enPvJo2hd0qh3yFtWyGx9fUt+EBXepp/hMTIRAEVUHpiiSg
PDTU74TVtXwSYvIQA6jR1bwh9+aMyeDWDFzUwFQh34mahAsZsBKhNLAcpN2uNGv7
XLL3Lqi58eIhaSaqxM4ASIsBS+FIiQiOdqq4eLVx+x6wxjNDTyZ+ynbWdNs8+SYh
MIyjY3wibMwaXtFUBV6LgYtwBF/1pgFcu9jWz02HT7Od0c+Et04ihcXISH+w9fpB
O5WFjo0Oag2HoNm1ODOlLu5DY9CSQftrv4zl0yTQgb1vFB3fPdtv43wIQ8SkVhVu
kwuF4kgIAyRRoe7HCPK/FJjKiYCo6Y+3WZ/4X7ktddCpxjaVYfclv8pMotirCQPU
SSo=
=kS6W
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping regression fix from Christoph Hellwig:
"Revert an incorret hunk from a patch that caused problems on various
arm boards (Andrey Smirnov)"
* tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: fix false positive warnings in dma_common_free_remap()
A few fixes this time around:
- Fixup of some clock specifications for DRA7 (device-tree fix)
- Removal of some dead/legacy CPU OPP/PM code for OMAP that throws
warnings at boot
- A few more minor fixups for OMAPs, most around display
- Enable STM32 QSPI as =y since their rootfs sometimes comes from
there
- Switch CONFIG_REMOTEPROC to =y since it went from tristate to bool
- Fix of thermal zone definition for ux500 (5.4 regression)
-----BEGIN PGP SIGNATURE-----
iQJDBAABCAAtFiEElf+HevZ4QCAJmMQ+jBrnPN6EHHcFAl2ZG30PHG9sb2ZAbGl4
b20ubmV0AAoJEIwa5zzehBx3BuIP/25F21BXvyQ4KdgFqHWdZFU31u9sjxNk8CpX
65xTHew5SdSV4B6Ljx0CD3BbeaiFzy7MpmY229k4bLCbMYJl27Knk9BVjVHcO58+
r8MjX1T5S4x/T8Isrw7uaTHWJKrqqWHwNwiEN7gVidtggx/I523Bkd/4JQyOeAK/
NHju39fdg2E91gkVa/LB5G4Aud2bGDdCFGqtUQmpxHNIT6tQdafcMyJuH7R0tO7m
80zsbLGtDXdIg+hroCQqq4BcWKrVpiYU1DgN/EicQeijj+ZUzvuLyq+6NF3J6clf
XXTshGFvuYo+Yn4bz4j+Pt+VRitMUMEBRxpxRAN8vSJde09rqJyE1fyqZWlaRyZm
8q6OkGLQSYD51qpdliIQzG2zWvG8BVNs8YRbamF8UF8bZymGzAfABSZSqEoRImpl
+tDOUMwMOjgZbBnKe8abt+8KmN4rKbHBF34OnA+LNrCNnvcXehm0G+8ZH9ypiuYP
/TespH7BkolCF7PR0VBhSUmrW4TvaaJmVt1b7oZCwu8j1R2lH8OGh8XeiAVuyhat
loQrCRrur0S7jv/6loDUEUixsSRABQaEQUHmBjGax4njyThtUgZtHlErlg++0qIP
0IAidQcdHF3MrhlwlBU5kKC8IZSg2rUFzKerRYbtFAFWN3T9/id1XkYKRTEfcWjv
NvBLvYyJ
=+lB0
-----END PGP SIGNATURE-----
Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Olof Johansson:
"A few fixes this time around:
- Fixup of some clock specifications for DRA7 (device-tree fix)
- Removal of some dead/legacy CPU OPP/PM code for OMAP that throws
warnings at boot
- A few more minor fixups for OMAPs, most around display
- Enable STM32 QSPI as =y since their rootfs sometimes comes from
there
- Switch CONFIG_REMOTEPROC to =y since it went from tristate to bool
- Fix of thermal zone definition for ux500 (5.4 regression)"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
ARM: multi_v7_defconfig: Fix SPI_STM32_QSPI support
ARM: dts: ux500: Fix up the CPU thermal zone
arm64/ARM: configs: Change CONFIG_REMOTEPROC from m to y
ARM: dts: am4372: Set memory bandwidth limit for DISPC
ARM: OMAP2+: Fix warnings with broken omap2_set_init_voltage()
ARM: OMAP2+: Add missing LCDC midlemode for am335x
ARM: OMAP2+: Fix missing reset done flag for am3 and am43
ARM: dts: Fix gpio0 flags for am335x-icev2
ARM: omap2plus_defconfig: Enable more droid4 devices as loadable modules
ARM: omap2plus_defconfig: Enable DRM_TI_TFP410
DTS: ARM: gta04: introduce legacy spi-cs-high to make display work again
ARM: dts: Fix wrong clocks for dra7 mcasp
clk: ti: dra7: Fix mcasp8 clock bits
- remove unneeded ar-option and KBUILD_ARFLAGS
- remove long-deprecated SUBDIRS
- fix modpost to suppress false-positive warnings for UML builds
- fix namespace.pl to handle relative paths to ${objtree}, ${srctree}
- make setlocalversion work for /bin/sh
- make header archive reproducible
- fix some Makefiles and documents
-----BEGIN PGP SIGNATURE-----
iQJSBAABCgA8FiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl2YPUEeHHlhbWFkYS5t
YXNhaGlyb0Bzb2Npb25leHQuY29tAAoJED2LAQed4NsGVu4P/3Qv7Ov3/R4BlgYb
+LaKupCY/ADE5bRAv/N5AAy37+TJmTLQswN2/JwHflYvTeWd4kZjquFpJkFNwMsk
Qlb79NQvyM9+NlFfSFjap8HBNb0J8A+92aKmrHmh1sQqJJs6JPZ0MOGoAXmgsJaN
SPLvhqophKpmYu7Oa0x2aC2kq+1DnCQyMLTOuVCdrtF0tF8w0hiowDz5GOmOi1U6
VK2ECfzjenFkfbqZOUVBPVfPR9hMpmVBdKdFLwD/iTKVkShZcWmdbxk/ADbemyet
2njehRF2HGp7opbwM4AxIeIubCqYSkThUpLJarKWk/8W87gksH6pCR8yIq1nOwkO
l+/GY2YdvkBdDCkSKpLiQxtJaqnZb8Yv1ZPvCfGF09Ba8tFtwX+HSecSkLFHGyJv
K9FD0XSOFBkQesZWdpIr/xeLwwiuSH80QACrub1Z5Q4OCURaBkKwrO/eDG1/2Xle
YKGZO2va2VVkeo5bisOZ2vfISwZrtiaGakQ8vTdq/5RO59/JvQjsGB8KbccaKXAu
Ozk8vVqkwTmLP6gzIEd2Wr/ICNGuAVc0EELT7lj07hcd6rzsCxPWVXqTFsCkGBJe
587i1jeH1z9oyrHUcP6dhR3joIuOUuUJk1uR7YZq4L4POSvrJnvzMFkSv6tBKL2p
Uq9qD7mpt/9zl3PART7HK9KYfTGJ
=fSXc
-----END PGP SIGNATURE-----
Merge tag 'kbuild-fixes-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- remove unneeded ar-option and KBUILD_ARFLAGS
- remove long-deprecated SUBDIRS
- fix modpost to suppress false-positive warnings for UML builds
- fix namespace.pl to handle relative paths to ${objtree}, ${srctree}
- make setlocalversion work for /bin/sh
- make header archive reproducible
- fix some Makefiles and documents
* tag 'kbuild-fixes-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kheaders: make headers archive reproducible
kbuild: update compile-test header list for v5.4-rc2
kbuild: two minor updates for Documentation/kbuild/modules.rst
scripts/setlocalversion: clear local variable to make it work for sh
namespace: fix namespace.pl script to support relative paths
video/logo: do not generate unneeded logo C files
video/logo: remove unneeded *.o pattern from clean-files
integrity: remove pointless subdir-$(CONFIG_...)
integrity: remove unneeded, broken attempt to add -fshort-wchar
modpost: fix static EXPORT_SYMBOL warnings for UML build
kbuild: correct formatting of header in kbuild module docs
kbuild: remove SUBDIRS support
kbuild: remove ar-option and KBUILD_ARFLAGS
Twelve patches mostly small but obvious fixes or cosmetic but small
updates.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXZgfWiYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishaVOAQDnuANx
QGEuQ1dZPALeZPOlEOsJzzpHPd3O+mQauIE96wD9FMypt/UKF9+fvlp4mCP+ya66
0fz1kmTQIcAADdYaNYM=
=aQi7
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Twelve patches mostly small but obvious fixes or cosmetic but small
updates"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: qla2xxx: Fix Nport ID display value
scsi: qla2xxx: Fix N2N link up fail
scsi: qla2xxx: Fix N2N link reset
scsi: qla2xxx: Optimize NPIV tear down process
scsi: qla2xxx: Fix stale mem access on driver unload
scsi: qla2xxx: Fix unbound sleep in fcport delete path.
scsi: qla2xxx: Silence fwdump template message
scsi: hisi_sas: Make three functions static
scsi: megaraid: disable device when probe failed after enabled device
scsi: storvsc: setup 1:1 mapping between hardware queue and CPU queue
scsi: qedf: Remove always false 'tmp_prio < 0' statement
scsi: ufs: skip shutdown if hba is not powered
scsi: bnx2fc: Handle scope bits when array returns BUSY or TSF
This makes getdents() and getdents64() do sanity checking on the
pathname that it gives to user space. And to mitigate the performance
impact of that, it first cleans up the way it does the user copying, so
that the code avoids doing the SMAP/PAN updates between each part of the
dirent structure write.
I really wanted to do this during the merge window, but didn't have
time. The conversion of filldir to unsafe_put_user() is something I've
had around for years now in a private branch, but the extra pathname
checking finally made me clean it up to the point where it is mergable.
It's worth noting that the filename validity checking really should be a
bit smarter: it would be much better to delay the error reporting until
the end of the readdir, so that non-corrupted filenames are still
returned. But that involves bigger changes, so let's see if anybody
actually hits the corrupt directory entry case before worrying about it
further.
* branch 'readdir':
Make filldir[64]() verify the directory entry filename is valid
Convert filldir[64]() from __put_user() to unsafe_put_user()
This has been discussed several times, and now filesystem people are
talking about doing it individually at the filesystem layer, so head
that off at the pass and just do it in getdents{64}().
This is partially based on a patch by Jann Horn, but checks for NUL
bytes as well, and somewhat simplified.
There's also commentary about how it might be better if invalid names
due to filesystem corruption don't cause an immediate failure, but only
an error at the end of the readdir(), so that people can still see the
filenames that are ok.
There's also been discussion about just how much POSIX strictly speaking
requires this since it's about filesystem corruption. It's really more
"protect user space from bad behavior" as pointed out by Jann. But
since Eric Biederman looked up the POSIX wording, here it is for context:
"From readdir:
The readdir() function shall return a pointer to a structure
representing the directory entry at the current position in the
directory stream specified by the argument dirp, and position the
directory stream at the next entry. It shall return a null pointer
upon reaching the end of the directory stream. The structure dirent
defined in the <dirent.h> header describes a directory entry.
From definitions:
3.129 Directory Entry (or Link)
An object that associates a filename with a file. Several directory
entries can associate names with the same file.
...
3.169 Filename
A name consisting of 1 to {NAME_MAX} bytes used to name a file. The
characters composing the name may be selected from the set of all
character values excluding the slash character and the null byte. The
filenames dot and dot-dot have special meaning. A filename is
sometimes referred to as a 'pathname component'."
Note that I didn't bother adding the checks to any legacy interfaces
that nobody uses.
Also note that if this ends up being noticeable as a performance
regression, we can fix that to do a much more optimized model that
checks for both NUL and '/' at the same time one word at a time.
We haven't really tended to optimize 'memchr()', and it only checks for
one pattern at a time anyway, and we really _should_ check for NUL too
(but see the comment about "soft errors" in the code about why it
currently only checks for '/')
See the CONFIG_DCACHE_WORD_ACCESS case of hash_name() for how the name
lookup code looks for pathname terminating characters in parallel.
Link: https://lore.kernel.org/lkml/20190118161440.220134-2-jannh@google.com/
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We really should avoid the "__{get,put}_user()" functions entirely,
because they can easily be mis-used and the original intent of being
used for simple direct user accesses no longer holds in a post-SMAP/PAN
world.
Manually optimizing away the user access range check makes no sense any
more, when the range check is generally much cheaper than the "enable
user accesses" code that the __{get,put}_user() functions still need.
So instead of __put_user(), use the unsafe_put_user() interface with
user_access_{begin,end}() that really does generate better code these
days, and which is generally a nicer interface. Under some loads, the
multiple user writes that filldir() does are actually quite noticeable.
This also makes the dirent name copy use unsafe_put_user() with a couple
of macros. We do not want to make function calls with SMAP/PAN
disabled, and the code this generates is quite good when the
architecture uses "asm goto" for unsafe_put_user() like x86 does.
Note that this doesn't bother with the legacy cases. Nobody should use
them anyway, so performance doesn't really matter there.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) Fix ieeeu02154 atusb driver use-after-free, from Johan Hovold.
2) Need to validate TCA_CBQ_WRROPT netlink attributes, from Eric
Dumazet.
3) txq null deref in mac80211, from Miaoqing Pan.
4) ionic driver needs to select NET_DEVLINK, from Arnd Bergmann.
5) Need to disable bh during nft_connlimit GC, from Pablo Neira Ayuso.
6) Avoid division by zero in taprio scheduler, from Vladimir Oltean.
7) Various xgmac fixes in stmmac driver from Jose Abreu.
8) Avoid 64-bit division in mlx5 leading to link errors on 32-bit from
Michal Kubecek.
9) Fix bad VLAN check in rtl8366 DSA driver, from Linus Walleij.
10) Fix sleep while atomic in sja1105, from Vladimir Oltean.
11) Suspend/resume deadlock in stmmac, from Thierry Reding.
12) Various UDP GSO fixes from Josh Hunt.
13) Fix slab out of bounds access in tcp_zerocopy_receive(), from Eric
Dumazet.
14) Fix OOPS in __ipv6_ifa_notify(), from David Ahern.
15) Memory leak in NFC's llcp_sock_bind, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits)
selftests/net: add nettest to .gitignore
net: qlogic: Fix memory leak in ql_alloc_large_buffers
nfc: fix memory leak in llcp_sock_bind()
sch_dsmark: fix potential NULL deref in dsmark_init()
net: phy: at803x: use operating parameters from PHY-specific status
net: phy: extract pause mode
net: phy: extract link partner advertisement reading
net: phy: fix write to mii-ctrl1000 register
ipv6: Handle missing host route in __ipv6_ifa_notify
net: phy: allow for reset line to be tied to a sleepy GPIO controller
net: ipv4: avoid mixed n_redirects and rate_tokens usage
r8152: Set macpassthru in reset_resume callback
cxgb4:Fix out-of-bounds MSI-X info array access
Revert "ipv6: Handle race in addrconf_dad_work"
net: make sock_prot_memory_pressure() return "const char *"
rxrpc: Fix rxrpc_recvmsg tracepoint
qmi_wwan: add support for Cinterion CLS8 devices
tcp: fix slab-out-of-bounds in tcp_zerocopy_receive()
lib: textsearch: fix escapes in example code
udp: only do GSO if # of segs > 1
...
- Default configs updates.
- Fix build errors with CC_OPTIMIZE_FOR_SIZE due to usage of "i" constraint
for function arguments. Two kvm changes acked-by Christian Borntraeger.
- Fix -Wunused-but-set-variable warnings in mm code.
- Avoid a constant misuse in qdio.
- Handle a case when cpumf is temporarily unavailable.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAl2YhB4ACgkQjYWKoQLX
FBhPigf9Fz/7YLA/9c23TP2OJvdW4pHYn+5DokhKxBKnOV3akKNeZ0wflrQRmcas
AI28fSlI08w6Nqrc5rX7V5cAy9hGVn5QHF8+gKVnw4QgqBINiiqVt8hjQoxWIL0r
HPyib3CxUgaNTOLIXd+4CHW+SdhJp38OItFMp/ctCAnv67oc7dWMhKNfhdU/pdJq
qDWZC7hhejOcvAogyL9mLuMCTT8uJBlqrooFbEBWR10vCfcgLUF2VsGOTXPYDdSJ
CV/c46FGR+2ENflHr2n/IKzJBB105AeyBSuFvRGhyTD7SDB47qFZf34tTuRMAOjE
LToBupk9q4hxPcTxnDRPwZq4b1GuYg==
=23JK
-----END PGP SIGNATURE-----
Merge tag 's390-5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- defconfig updates
- Fix build errors with CC_OPTIMIZE_FOR_SIZE due to usage of "i"
constraint for function arguments. Two kvm changes acked-by Christian
Borntraeger.
- Fix -Wunused-but-set-variable warnings in mm code.
- Avoid a constant misuse in qdio.
- Handle a case when cpumf is temporarily unavailable.
* tag 's390-5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
KVM: s390: mark __insn32_query() as __always_inline
KVM: s390: fix __insn32_query() inline assembly
s390: update defconfigs
s390/pci: mark function(s) __always_inline
s390/mm: mark function(s) __always_inline
s390/jump_label: mark function(s) __always_inline
s390/cpu_mf: mark function(s) __always_inline
s390/atomic,bitops: mark function(s) __always_inline
s390/mm: fix -Wunused-but-set-variable warnings
s390: mark __cpacf_query() as __always_inline
s390/qdio: clarify size of the QIB parm area
s390/cpumf: Fix indentation in sampling device driver
s390/cpumsf: Check for CPU Measurement sampling
s390/cpumf: Use consistant debug print format
__insn32_query() will not compile if the compiler decides to not
inline it, since it contains an inline assembly with an "i" constraint
with variable contents.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The inline assembly constraints of __insn32_query() tell the compiler
that only the first byte of "query" is being written to. Intended was
probably that 32 bytes are written to.
Fix and simplify the code and just use a "memory" clobber.
Fixes: d668139718 ("KVM: s390: provide query function for instructions returning 32 byte")
Cc: stable@vger.kernel.org # v5.2+
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In commit 43d8ce9d65 ("Provide in-kernel headers to make
extending kernel easier") a new mechanism was introduced, for kernels
>=5.2, which embeds the kernel headers in the kernel image or a module
and exposes them in procfs for use by userland tools.
The archive containing the header files has nondeterminism caused by
header files metadata. This patch normalizes the metadata and utilizes
KBUILD_BUILD_TIMESTAMP if provided and otherwise falls back to the
default behaviour.
In commit f7b101d330 ("kheaders: Move from proc to sysfs") it was
modified to use sysfs and the script for generation of the archive was
renamed to what is being patched.
Signed-off-by: Dmitry Goldin <dgoldin+lkml@protonmail.ch>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Commit 6dc280ebee ("coda: remove uapi/linux/coda_psdev.h") removed
a header in question. Some more build errors were fixed. Add more
headers into the test coverage.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Capitalize the first word in the sentence.
Use obj-m instead of obj-y. obj-y still works, but we have no built-in
objects in external module builds. So, obj-m is better IMHO.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Geert Uytterhoeven reports a strange side-effect of commit 858805b336
("kbuild: add $(BASH) to run scripts with bash-extension"), which
inserts the contents of a localversion file in the build directory twice.
[Steps to Reproduce]
$ echo bar > localversion
$ mkdir build
$ cd build/
$ echo foo > localversion
$ make -s -f ../Makefile defconfig include/config/kernel.release
$ cat include/config/kernel.release
5.4.0-rc1foofoobar
This comes down to the behavior change of local variables.
The 'man sh' on my Ubuntu machine, where sh is an alias to dash,
explains as follows:
When a variable is made local, it inherits the initial value and
exported and readonly flags from the variable with the same name
in the surrounding scope, if there is one. Otherwise, the variable
is initially unset.
[Test Code]
foo ()
{
local res
echo "res: $res"
}
res=1
foo
[Result]
$ sh test.sh
res: 1
$ bash test.sh
res:
So, scripts/setlocalversion correctly works only for bash in spite of
its hashbang being #!/bin/sh. Nobody had noticed it before because
CONFIG_SHELL was previously set to bash almost all the time.
Now that CONFIG_SHELL is set to sh, we must write portable and correct
code. I gave the Fixes tag to the commit that uncovered the issue.
Clear the variable 'res' in collect_files() to make it work for sh
(and it also works on distributions where sh is an alias to bash).
Fixes: 858805b336 ("kbuild: add $(BASH) to run scripts with bash-extension")
Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
The namespace.pl script does not work properly if objtree is not set to
an absolute path. The do_nm function is run from within the find
function, which changes directories.
Because of this, appending objtree, $File::Find::dir, and $source, will
return a path which is not valid from the current directory.
This used to work when objtree was set to an absolute path when using
"make namespacecheck". It appears to have not worked when calling
./scripts/namespace.pl directly.
This behavior was changed in 7e1c04779e ("kbuild: Use relative path
for $(objtree)", 2014-05-14)
Rather than fixing the Makefile to set objtree to an absolute path, just
fix namespace.pl to work when srctree and objtree are relative. Also fix
the script to use an absolute path for these by default.
Use the File::Spec module for this purpose. It's been part of perl
5 since 5.005.
The curdir() function is used to get the current directory when the
objtree and srctree aren't set in the environment.
rel2abs() is used to convert possibly relative objtree and srctree
environment variables to absolute paths.
Finally, the catfile() function is used instead of string appending
paths together, since this is more robust when joining paths together.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Currently, all the logo C files are generated irrespective of the
CONFIG options. Adding them to extra-y is wrong. What we need to do
here is to add them to 'targets' so that if_changed works properly.
Files listed in 'targets' are cleaned, so clean-files is unneeded.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>