Without program headers for PTI entry trampoline pages, the trampoline
virtual addresses do not map to anything.
Example before:
sudo gdb --quiet vmlinux /proc/kcore
Reading symbols from vmlinux...done.
[New process 1]
Core was generated by `BOOT_IMAGE=/boot/vmlinuz-4.16.0 root=UUID=a6096b83-b763-4101-807e-f33daff63233'.
#0 0x0000000000000000 in irq_stack_union ()
(gdb) x /21ib 0xfffffe0000006000
0xfffffe0000006000: Cannot access memory at address 0xfffffe0000006000
(gdb) quit
After:
sudo gdb --quiet vmlinux /proc/kcore
[sudo] password for ahunter:
Reading symbols from vmlinux...done.
[New process 1]
Core was generated by `BOOT_IMAGE=/boot/vmlinuz-4.16.0-fix-4-00005-gd6e65a8b4072 root=UUID=a6096b83-b7'.
#0 0x0000000000000000 in irq_stack_union ()
(gdb) x /21ib 0xfffffe0000006000
0xfffffe0000006000: swapgs
0xfffffe0000006003: mov %rsp,-0x3e12(%rip) # 0xfffffe00000021f8
0xfffffe000000600a: xchg %ax,%ax
0xfffffe000000600c: mov %cr3,%rsp
0xfffffe000000600f: bts $0x3f,%rsp
0xfffffe0000006014: and $0xffffffffffffe7ff,%rsp
0xfffffe000000601b: mov %rsp,%cr3
0xfffffe000000601e: mov -0x3019(%rip),%rsp # 0xfffffe000000300c
0xfffffe0000006025: pushq $0x2b
0xfffffe0000006027: pushq -0x3e35(%rip) # 0xfffffe00000021f8
0xfffffe000000602d: push %r11
0xfffffe000000602f: pushq $0x33
0xfffffe0000006031: push %rcx
0xfffffe0000006032: push %rdi
0xfffffe0000006033: mov $0xffffffff91a00010,%rdi
0xfffffe000000603a: callq 0xfffffe0000006046
0xfffffe000000603f: pause
0xfffffe0000006041: lfence
0xfffffe0000006044: jmp 0xfffffe000000603f
0xfffffe0000006046: mov %rdi,(%rsp)
0xfffffe000000604a: retq
(gdb) quit
In addition, entry trampolines all map to the same page. Represent that
by giving the corresponding program headers in kcore the same offset.
This has the benefit that, when perf tools uses /proc/kcore as a source
for kernel object code, samples from different CPU trampolines are
aggregated together. Note, such aggregation is normal for profiling
i.e. people want to profile the object code, not every different virtual
address the object code might be mapped to (across different processes
for example).
Notes by PeterZ:
This also adds the KCORE_REMAP functionality.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1528289651-4113-4-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull timer updates from Thomas Gleixner:
"The timers departement more or less proudly presents:
- More Y2038 timekeeping work mostly in the core code. The work is
slowly, but steadily targeting the actuall syscalls.
- Enhanced timekeeping suspend/resume support by utilizing
clocksources which do not stop during suspend, but are otherwise
not the main timekeeping clocksources.
- Make NTP adjustmets more accurate and immediate when the frequency
is set directly and not incrementally.
- Sanitize the overrung handing of posix timers
- A new timer driver for Mediatek SoCs
- The usual pile of fixes and updates all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
clockevents: Warn if cpu_all_mask is used as cpumask
tick/broadcast-hrtimer: Use cpu_possible_mask for ce_broadcast_hrtimer
clocksource/drivers/arm_arch_timer: Fix bogus cpu_all_mask usage
clocksource: ti-32k: Remove CLOCK_SOURCE_SUSPEND_NONSTOP flag
timers: Clear timer_base::must_forward_clk with timer_base::lock held
clocksource/drivers/sprd: Register one always-on timer to compensate suspend time
clocksource/drivers/timer-mediatek: Add support for system timer
clocksource/drivers/timer-mediatek: Convert the driver to timer-of
clocksource/drivers/timer-mediatek: Use specific prefix for GPT
clocksource/drivers/timer-mediatek: Rename mtk_timer to timer-mediatek
clocksource/drivers/timer-mediatek: Add system timer bindings
clocksource/drivers: Set clockevent device cpumask to cpu_possible_mask
time: Introduce one suspend clocksource to compensate the suspend time
time: Fix extra sleeptime injection when suspend fails
timekeeping/ntp: Constify some function arguments
ntp: Use kstrtos64 for s64 variable
ntp: Remove redundant arguments
timer: Fix coding style
ktime: Provide typesafe ktime_to_ns()
hrtimer: Improve kernel message printing
...
Pull locking/atomics update from Thomas Gleixner:
"The locking, atomics and memory model brains delivered:
- A larger update to the atomics code which reworks the ordering
barriers, consolidates the atomic primitives, provides the new
atomic64_fetch_add_unless() primitive and cleans up the include
hell.
- Simplify cmpxchg() instrumentation and add instrumentation for
xchg() and cmpxchg_double().
- Updates to the memory model and documentation"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/atomics: Rework ordering barriers
locking/atomics: Instrument cmpxchg_double*()
locking/atomics: Instrument xchg()
locking/atomics: Simplify cmpxchg() instrumentation
locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
tools/memory-model: Rename litmus tests to comply to norm7
tools/memory-model/Documentation: Fix typo, smb->smp
sched/Documentation: Update wake_up() & co. memory-barrier guarantees
locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
sched/core: Use smp_mb() in wake_woken_function()
tools/memory-model: Add informal LKMM documentation to MAINTAINERS
locking/atomics/Documentation: Describe atomic_set() as a write operation
tools/memory-model: Make scripts executable
tools/memory-model: Remove ACCESS_ONCE() from model
tools/memory-model: Remove ACCESS_ONCE() from recipes
locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
tools/memory-model: Add litmus test for full multicopy atomicity
locking/refcount: Always allow checked forms
...
Pull EFI updates from Thomas Gleixner:
"The EFI pile:
- Make mixed mode UEFI runtime service invocations mutually
exclusive, as mandated by the UEFI spec
- Perform UEFI runtime services calls from a work queue so the calls
into the firmware occur from a kernel thread
- Honor the UEFI memory map attributes for live memory regions
configured by UEFI as a framebuffer. This works around a coherency
problem with KVM guests running on ARM.
- Cleanups, improvements and fixes all over the place"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efivars: Call guid_parse() against guid_t type of variable
efi/cper: Use consistent types for UUIDs
efi/x86: Replace references to efi_early->is64 with efi_is_64bit()
efi: Deduplicate efi_open_volume()
efi/x86: Add missing NULL initialization in UGA draw protocol discovery
efi/x86: Merge 32-bit and 64-bit UGA draw protocol setup routines
efi/x86: Align efi_uga_draw_protocol typedef names to convention
efi/x86: Merge the setup_efi_pci32() and setup_efi_pci64() routines
efi/x86: Prevent reentrant firmware calls in mixed mode
efi/esrt: Only call efi_mem_reserve() for boot services memory
fbdev/efifb: Honour UEFI memory map attributes when mapping the FB
efi: Drop type and attribute checks in efi_mem_desc_lookup()
efi/libstub/arm: Add opt-in Kconfig option for the DTB loader
efi: Remove the declaration of efi_late_init() as the function is unused
efi/cper: Avoid using get_seconds()
efi: Use a work queue to invoke EFI Runtime Services
efi/x86: Use non-blocking SetVariable() for efi_delete_dummy_variable()
efi/x86: Clean up the eboot code
Pull vfs fixes from Al Viro:
"A bunch of race fixes, mostly around lazy pathwalk.
All of it is -stable fodder, a large part going back to 2013"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
make sure that __dentry_kill() always invalidates d_seq, unhashed or not
fix __legitimize_mnt()/mntput() race
fix mntput/mntput race
root dentries need RCU-delayed freeing
RCU pathwalk relies upon the assumption that anything that changes
->d_inode of a dentry will invalidate its ->d_seq. That's almost
true - the one exception is that the final dput() of already unhashed
dentry does *not* touch ->d_seq at all. Unhashing does, though,
so for anything we'd found by RCU dcache lookup we are fine.
Unfortunately, we can *start* with an unhashed dentry or jump into
it.
We could try and be careful in the (few) places where that could
happen. Or we could just make the final dput() invalidate the damn
thing, unhashed or not. The latter is much simpler and easier to
backport, so let's do it that way.
Reported-by: "Dae R. Jeong" <threeearcat@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
__legitimize_mnt() has two problems - one is that in case of success
the check of mount_lock is not ordered wrt preceding increment of
refcount, making it possible to have successful __legitimize_mnt()
on one CPU just before the otherwise final mntpu() on another,
with __legitimize_mnt() not seeing mntput() taking the lock and
mntput() not seeing the increment done by __legitimize_mnt().
Solved by a pair of barriers.
Another is that failure of __legitimize_mnt() on the second
read_seqretry() leaves us with reference that'll need to be
dropped by caller; however, if that races with final mntput()
we can end up with caller dropping rcu_read_lock() and doing
mntput() to release that reference - with the first mntput()
having freed the damn thing just as rcu_read_lock() had been
dropped. Solution: in "do mntput() yourself" failure case
grab mount_lock, check if MNT_DOOMED has been set by racing
final mntput() that has missed our increment and if it has -
undo the increment and treat that as "failure, caller doesn't
need to drop anything" case.
It's not easy to hit - the final mntput() has to come right
after the first read_seqretry() in __legitimize_mnt() *and*
manage to miss the increment done by __legitimize_mnt() before
the second read_seqretry() in there. The things that are almost
impossible to hit on bare hardware are not impossible on SMP
KVM, though...
Reported-by: Oleg Nesterov <oleg@redhat.com>
Fixes: 48a066e72d ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
mntput_no_expire() does the calculation of total refcount under mount_lock;
unfortunately, the decrement (as well as all increments) are done outside
of it, leading to false positives in the "are we dropping the last reference"
test. Consider the following situation:
* mnt is a lazy-umounted mount, kept alive by two opened files. One
of those files gets closed. Total refcount of mnt is 2. On CPU 42
mntput(mnt) (called from __fput()) drops one reference, decrementing component
* After it has looked at component #0, the process on CPU 0 does
mntget(), incrementing component #0, gets preempted and gets to run again -
on CPU 69. There it does mntput(), which drops the reference (component #69)
and proceeds to spin on mount_lock.
* On CPU 42 our first mntput() finishes counting. It observes the
decrement of component #69, but not the increment of component #0. As the
result, the total it gets is not 1 as it should've been - it's 0. At which
point we decide that vfsmount needs to be killed and proceed to free it and
shut the filesystem down. However, there's still another opened file
on that filesystem, with reference to (now freed) vfsmount, etc. and we are
screwed.
It's not a wide race, but it can be reproduced with artificial slowdown of
the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups.
Fix consists of moving the refcount decrement under mount_lock; the tricky
part is that we want (and can) keep the fast case (i.e. mount that still
has non-NULL ->mnt_ns) entirely out of mount_lock. All places that zero
mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu()
before that mntput(). IOW, if mntput() observes (under rcu_read_lock())
a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to
be dropped.
Reported-by: Jann Horn <jannh@google.com>
Tested-by: Jann Horn <jannh@google.com>
Fixes: 48a066e72d ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since mountpoint crossing can happen without leaving lazy mode,
root dentries do need the same protection against having their
memory freed without RCU delay as everything else in the tree.
It's partially hidden by RCU delay between detaching from the
mount tree and dropping the vfsmount reference, but the starting
point of pathwalk can be on an already detached mount, in which
case umount-caused RCU delay has already passed by the time the
lazy pathwalk grabs rcu_read_lock(). If the starting point
happens to be at the root of that vfsmount *and* that vfsmount
covers the entire filesystem, we get trouble.
Fixes: 48a066e72d ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
"overflow" inline inode data.
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAltlv70WHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJj3eD/9+goWbs9U62tlO2cIqT5lgTFX9
3vvVpqNJ3atw/fU6SZoa0Z2nRa9TIpZEfEljgARGhCyd2p2MplLJalWo6bq/gUUA
5aWQbVxhyVXUp8kh8m3OnZsAZz658Y5geLMk8vakXdyQ//PF43wO0cZyOFYdG3ec
sYuWA318UKaIsxqB9tT/K8YBRjjBgJ8wjgtpoSAr+FUhgg9Qqp3NL4fjr6SOEQrv
XWXLVLLhyUo8kQJ29E/VyzIfLysgiA67O4ClW+DEyD6rr/ZK9XOeG6yv3vwLGSHI
06/4BXMJce23iGIYf57Jz7b5tfAaZntdzNqUiTW6up0TuvG09auwjv/hKkwfjQv2
fNf0TVnYOCN4ZWbCm4FTEU5q31u+pZDHiRxOTJG9EbuBVEsFiV5X9B9VTLoA+dWK
SqWmE2X9YdONt9q6K8TGpxxrdQnQXGKOOtEY23KoF04XGYQtutIfj7cdFCDDX/eC
AjcOIfV3u8dHWjc/wKIYS5XRUzbkdpEeNOaiCQ3RN79JN2UrA0/w7lAxZTICmYCs
HtMtaFTER5weKQGzTzg0SP3M95qKPdWhlIq5lspdhGedonZAKBNfGahMltTH2UoY
vIb1qGT8uz8tQSzGsu0uvrR92ZIJZ0qUVjB/nhE9HxTz26xqtDDlLXpdIXTGrnU4
hM+Omud//MNgNbsmrg==
=pwq8
-----END PGP SIGNATURE-----
Merge tag 'usercopy-fix-v4.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull usercopy whitelisting fix from Kees Cook:
"Bart Massey discovered that the usercopy whitelist for JFS was
incomplete: the inline inode data may intentionally "overflow" into
the neighboring "extended area", so the size of the whitelist needed
to be raised to include the neighboring field"
* tag 'usercopy-fix-v4.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
jfs: Fix usercopy whitelist for inline inode data
- Fix incorrect shifting in the iomap bmap functions.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAltj7lcACgkQ+H93GTRK
tOutzw//U9yOrUhuOTFrkmA2x+CB1ArncFZysUTXDFUsLRrMp0dwNaRV0WafyypO
tCr4nvqgwg2VTFf21zWblOSAajstgJzH+x3FdxTsU5Gktd/BH9gCHSDV0lORnFhp
jbqNOsBRPJjNIBzaIIAnyDGMuOF/MeWpGsbGF4eDcsgj58lr098rir7grEVgCkKM
+eCMtqJwUXUZ2bo/grBWvZXnLms+8LMPYeRQMJSnw59DVunwxRdSbWumJwkPt9DD
mz3Upa/qFJCZw2n9kluU1b/tnrpxkNWYuSjzp9iu0cMdo52HF+yDNHriUPCHa4PB
3KfyGrhOvg+iC2pUAJ5mI3Dpv+NNAk7j/+4mOVtlUYIf0mi5gszIjNHbLiONKVH5
5x8G59wM/ae6a1WcHe7tJRma7r984G+JLTIxAQnWaWhFq5AsYMOplk4oq6X7Lmir
wAR7laDN0Xgl3WCFj6SXaKcnuzZFsE4A3SnZILtlgO/WxAujUyFEnICx1cO4Q9OY
64txWii6ora9kdBtalolxjfGHwyScbhx6FiBQdKIznxGgBQR89X8hzxghdp7KTIx
kxkC1hAM3KXXtokArjad2hgd8QG23jUshyBwKpfHnEwL75GiZQ3qtPdDm/oJlVWV
ItnRSt+tGIT/fsT5szNPmLKgtIW4kQHflVuih0KR4IsGPMmZ8dU=
=3omy
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.18-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs bugfix from Darrick Wong:
"One more patch for 4.18 to fix a coding error in the iomap_bmap()
function introduced in -rc1: fix incorrect shifting"
* tag 'xfs-4.18-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fs: fix iomap_bmap position calculation
Bart Massey reported what turned out to be a usercopy whitelist false
positive in JFS when symlink contents exceeded 128 bytes. The inline
inode data (i_inline) is actually designed to overflow into the "extended
area" following it (i_inline_ea) when needed. So the whitelist needed to
be expanded to include both i_inline and i_inline_ea (the whole size
of which is calculated internally using IDATASIZE, 256, instead of
sizeof(i_inline), 128).
$ cd /mnt/jfs
$ touch $(perl -e 'print "B" x 250')
$ ln -s B* b
$ ls -l >/dev/null
[ 249.436410] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLUB object 'jfs_ip' (offset 616, size 250)!
Reported-by: Bart Massey <bart.massey@gmail.com>
Fixes: 8d2704d382 ("jfs: Define usercopy region in jfs_ip slab cache")
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: jfs-discussion@lists.sourceforge.net
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
The fix in commit 0cbb4b4f4c ("userfaultfd: clear the
vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails") cleared the
vma->vm_userfaultfd_ctx but kept userfaultfd flags in vma->vm_flags
that were copied from the parent process VMA.
As the result, there is an inconsistency between the values of
vma->vm_userfaultfd_ctx.ctx and vma->vm_flags which triggers BUG_ON
in userfaultfd_release().
Clearing the uffd flags from vma->vm_flags in case of UFFD_EVENT_FORK
failure resolves the issue.
Link: http://lkml.kernel.org/r/1532931975-25473-1-git-send-email-rppt@linux.vnet.ibm.com
Fixes: 0cbb4b4f4c ("userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails")
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reported-by: syzbot+121be635a7a35ddb7dcb@syzkaller.appspotmail.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The position calculation in iomap_bmap() shifts bno the wrong way,
so we don't progress properly and end up re-mapping block zero
over and over, yielding an unchanging physical block range as the
logical block advances:
# filefrag -Be file
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 21.. 21: 1: merged
1: 1.. 1: 21.. 21: 1: 22: merged
Discontinuity: Block 1 is at 21 (was 22)
2: 2.. 2: 21.. 21: 1: 22: merged
Discontinuity: Block 2 is at 21 (was 22)
3: 3.. 3: 21.. 21: 1: 22: merged
This breaks the FIBMAP interface for anyone using it (XFS), which
in turn breaks LILO, zipl, etc.
Bug-actually-spotted-by: Darrick J. Wong <darrick.wong@oracle.com>
Fixes: 89eb1906a9 ("iomap: add an iomap-based bmap implementation")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Previously in squashfs_readpage() when copying data into the page
cache, it used the length of the datablock read from the filesystem
(after decompression). However, if the filesystem has been corrupted
this data block may be short, which will leave pages unfilled.
The fix for this is to compute the expected number of bytes to copy
from the inode size, and use this to detect if the block is short.
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Tested-by: Willy Tarreau <w@1wt.eu>
Cc: Анатолий Тросиненко <anatoly.trosinenko@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The squashfs fragment reading code doesn't actually verify that the
fragment is inside the fragment table. The end result _is_ verified to
be inside the image when actually reading the fragment data, but before
that is done, we may end up taking a page fault because the fragment
table itself might not even exist.
Another report from Anatoly and his endless squashfs image fuzzing.
Reported-by: Анатолий Тросиненко <anatoly.trosinenko@gmail.com>
Acked-by:: Phillip Lougher <phillip.lougher@gmail.com>,
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch to fix the case where a lock request was interrupted ended up
changing default handling of errors such as NFS4ERR_DENIED and caused the
client to immediately resend the lock request. Let's do a partial revert
of that request so that the default is now to exit, but change the way
we handle resends to take into account the fact that the user may have
interrupted the request.
Reported-by: Kenneth Johansson <ken@kenjo.org>
Fixes: a3cf9bca2a ("NFSv4: Don't add a new lock on an interrupted wait..")
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Anatoly continues to find issues with fuzzed squashfs images.
This time, corrupt, missing, or undersized data for the page filling
wasn't checked for, because the squashfs_{copy,read}_cache() functions
did the squashfs_copy_data() call without checking the resulting data
size.
Which could result in the page cache pages being incompletely filled in,
and no error indication to the user space reading garbage data.
So make a helper function for the "fill in pages" case, because the
exact same incomplete sequence existed in two places.
[ I should have made a squashfs branch for these things, but I didn't
intend to start doing them in the first place.
My historical connection through cramfs is why I got into looking at
these issues at all, and every time I (continue to) think it's a
one-off.
Because _this_ time is always the last time. Right? - Linus ]
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Tested-by: Willy Tarreau <w@1wt.eu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Anatoly reports another squashfs fuzzing issue, where the decompression
parameters themselves are in a compressed block.
This causes squashfs_read_data() to be called in order to read the
decompression options before the decompression stream having been set
up, making squashfs go sideways.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Acked-by: Phillip Lougher <phillip.lougher@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
introduced in 4.18-rc4.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlteF34ACgkQ8vlZVpUN
gaMQugf+LjlbbncSEuPxZ+C3CnSGkEzjrg8IRylZA2uf04Z5Bax8K5gqvXLx7ZtF
Qz3vzmrYpaUV8UiaMy0SGLCRWebwoxPEN7ZX3/W1PfeymP3wQ4DLw37059AzLfsq
Vzh9w3N1At1plUee7iJ2MDBU830Q0a917jjnpZ+M0AtQx/BzP8QEISuzp4JWICqe
NbJDVybMWoW2YOSpMPiihxSFqCDx5rMyAJ1vllboopZK+XAjpQ/visnLh3aT3o71
7cTPl9gI2rbwYbJk8kM5fmXhWqSARHARV1bpZNOUnCAUU1E2Se7aETjggQ0QzJE/
mIc7wCzFLrrY8+iakwdhb5Aw3qOPyg==
=ZdXo
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Some miscellaneous ext4 fixes for 4.18; one fix is for a regression
introduced in 4.18-rc4.
Sorry for the late-breaking pull. I was originally going to wait for
the next merge window, but Eric Whitney found a regression introduced
in 4.18-rc4, so I decided to push out the regression plus the other
fixes now. (The other commits have been baking in linux-next since
early July)"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix check to prevent initializing reserved inodes
ext4: check for allocation block validity with block group locked
ext4: fix inline data updates with checksums enabled
ext4: clear mmp sequence number when remounting read-only
ext4: fix false negatives *and* false positives in ext4_check_descriptors()
Anatoly Trosinenko reports that a corrupted squashfs image can cause a
kernel oops. It turns out that squashfs can end up being confused about
negative fragment lengths.
The regular squashfs_read_data() does check for negative lengths, but
squashfs_read_metadata() did not, and the fragment size code just
blindly trusted the on-disk value. Fix both the fragment parsing and
the metadata reading code.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 8844618d8aa7: "ext4: only look at the bg_flags field if it is
valid" will complain if block group zero does not have the
EXT4_BG_INODE_ZEROED flag set. Unfortunately, this is not correct,
since a freshly created file system has this flag cleared. It gets
almost immediately after the file system is mounted read-write --- but
the following somewhat unlikely sequence will end up triggering a
false positive report of a corrupted file system:
mkfs.ext4 /dev/vdc
mount -o ro /dev/vdc /vdc
mount -o remount,rw /dev/vdc
Instead, when initializing the inode table for block group zero, test
to make sure that itable_unused count is not too large, since that is
the case that will result in some or all of the reserved inodes
getting cleared.
This fixes the failures reported by Eric Whiteney when running
generic/230 and generic/231 in the the nojournal test case.
Fixes: 8844618d8a ("ext4: only look at the bg_flags field if it is valid")
Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltbc20QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgh4D/9GYQcjk9qLVFxkv5ucAUvCuxEL6gjsMf4W
M/QdxVIrwh3zpvsH++2IXXn+xH+UjujMA5NkzhsSr4+hsSO2iAGOYMJbroNfhsTD
onvQQ6NTaHPu/+PZs0otVK4KMWHwZGWOV6YU00TWTfRgzRmGEsSMe91oeBIXVv9w
v6d09twaLSY0lUkAAbcdu5fuFBtXu4Bxy60qyHEKkAdWWHEUYaZLrODhVjoGg2V4
KdAWS5X4A6kJMcPcoOvG6RFtpf71boaip9o/DRLUWhGdIQnI38UgSCUmz1XMYnik
Sq8r74vqCm8IhIOLTlxnPrMHHbKv7JZhY3Ow9fxnS6HZRNI0aPX31Yml6NULqnWh
MsQh+6gZXd3xC1O7txEQn4a15Lk0OLXa8HJcIn5ADNxqz5/r/g0mPUG9HmPSIalO
ISFF/9UKQFcAd0RjHR+bEEH2VMznz59UWKfdOsmwFZtZSCmR1ucj0xAKDj+oP1JS
ZsgZ09K2GezrL4GEueocISo9ACIWgDWH8T7/bTxlBok0IYbybAfmOe+MZInL1Tf4
pklmoXm3ntgV3Pq8Ptk05LYyIgAaUIltuSiR3AFaXIADX0wNtV0ZgysIWgHf3BSA
18j+I1yPG1IwBdM8xNwxi56xMQR84uY5tsIyafbfj+laRI2nH5OIYjNZnrKpm957
4xZUgIECBA==
=2ogY
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180727' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Bigger than usual at this time, mostly due to the O_DIRECT corruption
issue and the fact that I was on vacation last week. This contains:
- NVMe pull request with two fixes for the FC code, and two target
fixes (Christoph)
- a DIF bio reset iteration fix (Greg Edwards)
- two nbd reply and requeue fixes (Josef)
- SCSI timeout fixup (Keith)
- a small series that fixes an issue with bio_iov_iter_get_pages(),
which ended up causing corruption for larger sized O_DIRECT writes
that ended up racing with buffered writes (Martin Wilck)"
* tag 'for-linus-20180727' of git://git.kernel.dk/linux-block:
block: reset bi_iter.bi_done after splitting bio
block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
blkdev: __blkdev_direct_IO_simple: fix leak in error case
block: bio_iov_iter_get_pages: fix size of last iovec
nvmet: only check for filebacking on -ENOTBLK
nvmet: fixup crash on NULL device path
scsi: set timed out out mq requests to complete
blk-mq: export setting request completion state
nvme: if_ready checks to fail io to deleting controller
nvmet-fc: fix target sgl list on large transfers
nbd: handle unexpected replies better
nbd: don't requeue the same request twice.
Merge misc fixes from Andrew Morton:
"11 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
kvm, mm: account shadow page tables to kmemcg
zswap: re-check zswap_is_full() after do zswap_shrink()
include/linux/eventfd.h: include linux/errno.h
mm: fix vma_is_anonymous() false-positives
mm: use vma_init() to initialize VMAs on stack and data segments
mm: introduce vma_init()
mm: fix exports that inadvertently make put_page() EXPORT_SYMBOL_GPL
ipc/sem.c: prevent queue.status tearing in semop
mm: disallow mappings that conflict for devm_memremap_pages()
kasan: only select SLUB_DEBUG with SYSFS=y
delayacct: fix crash in delayacct_blkio_end() after delayacct init failure
- Fix some uninitialized variable errors
- Fix an incorrect check in metadata verifiers
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAltYpkoACgkQ+H93GTRK
tOsMAA//Tyt2rjGGrvtPUiI9xhDDbYM+Eds19IWhye9LyNQCHXdmrCicsBvoEyCC
5XSAT5lofeLNIbiTS88aC0b4sr2LLban6YsTBHGTlRxUTrnCSpCCDIgXJswxLjmT
jivIumvKL3sxgmXubwe6gnjoLCNGIy3JrdCu4vFf6JGWAj6U5HyZ5hjtj74nuPtg
w6BMEptJIOmQwGzSjQY76dQ5ekliVuOtYISY6gRAfVPVvwURgIzZdQPi4qV5Kw/d
n2nA6rvMBUcMUSVvXWS1ryOWsy4HrB9LXzbr5Kb0NgaVKnAqSCYGIGMJSEsiO/7Y
P83Doo6N8fYh8QEUOLqJ76XTkkrzoo3fvo7IZXUGMERXx90UliEAI/k6hWy6awtT
cCQatAcOp+8r5PvMJ9ZIivAwDId06PwpuDntOATIamGkNEo4vo0LO189fQP+i8RD
LIbEcLcGOHVjjTZgGqJCfDWVPiFtG8ZdZp9bvmpW9aREzMGl/tXnvI2QsSwZu+lU
87efBqztYGm4U4D5grdV/ynbT1E4E9ggtI2pVHG2ipJnZ+UeTiOCw68lDcUDT0JA
lU2fPUKzUR3v+U6s26AJFKcX2HCG4G75cJozBuH82xcPnUT0m3PMde0ZhFzVnvg4
w8T+bIS0Q/f310SSAitu1qfG5cx2f6I5j107jhldvcibRmqEZLE=
=Ovtv
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.18-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- Fix some uninitialized variable errors
- Fix an incorrect check in metadata verifiers
* tag 'xfs-4.18-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: properly handle free inodes in extent hint validators
xfs: Initialize variables in xfs_alloc_get_rec before using them
Make sure to initialize all VMAs properly, not only those which come
from vm_area_cachep.
Link: http://lkml.kernel.org/r/20180724121139.62570-3-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 72ecad22d9 ("block: support a full bio worth of IO for simplified bdev direct-io")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAW1h/7Pu3V2unywtrAQJFOg/+NQswGJcTsJGeTL8tW+8nGtJVeTP6XaVh
xPEjdqTBmimt6ciaNP1LxLLt9jQ50S1f83rWGZeFBQoNgWinoe3VtzSVdlQKhZcT
jC5LtFVlTxw5rrL4uCsJywLLjD0NeH41ISbvCStcyYJExOZr+f4/VJXKNcwKjAvf
kD1xDGnVZsZiGLWFjwBVaPJwFigquoLEU564InMnZbvMW95uZOPGfnwxAGmKQX2n
BV3WxVizCc0MwlHMJYjs0cVMZNviuC+qg7YBJIoio3+Dq8FIn7ISn98LbhCpG7mi
FoiRi+7xs7VCGm9yqtkXL+euHcSzjnJPnlYxpU8xGqAay0qKxoHecZj2iMEX327K
E4mujQ40oqkMLhwy3GhT9cIpvbbQPu7+kS+k9x7UqVnzlhsKEeMp7TEFqSEebO9H
kIuvfRBD3uZY0B/loLCB3Cc/B9OoWAUi6IGBRclwS9+RUuBnZY/jb7iQsEvcOv9u
0EC0biSs1jizG1tLR0LmvjIyvS567t/DG/peLad1lOqPe6Up2mO4XIeyS9phwXAD
ryupnKPr3tGRgvfJ4jLUZPC8/nrv5Fg9R0YhICEEBqhwKn1uTZgyc085d2EHOdQp
fmfbXR/oz4TwjjwlgzrLMQbLB7GUpkgCFxsIPpEVeFH3RDbZNO1UbLovDMUuRaos
lFWHzd4K4XA=
=yuI9
-----END PGP SIGNATURE-----
Merge tag 'fscache-fixes-20180725' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fscache/cachefiles fixes from David Howells:
- Allow cancelled operations to be queued so they can be cleaned up.
- Fix a refcounting bug in the monitoring of reads on backend files
whereby a race can occur between monitor objects being listed for
work, the work processing being queued and the work processor running
and destroying the monitor objects.
- Fix a ref overput in object attachment, whereby a tentatively
considered object is put in error handling without first being 'got'.
- Fix a missing clear of the CACHEFILES_OBJECT_ACTIVE flag whereby an
assertion occurs when we retry because it seems the object is now
active.
- Wait rather BUG'ing on an object collision in the depths of
cachefiles as the active object should be being cleaned up - also
depends on the one above.
* tag 'fscache-fixes-20180725' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
cachefiles: Wait rather than BUG'ing on "Unexpected object collision"
cachefiles: Fix missing clear of the CACHEFILES_OBJECT_ACTIVE flag
fscache: Fix reference overput in fscache_attach_object() error handling
cachefiles: Fix refcounting bug in backing-file read monitoring
fscache: Allow cancelled operations to be enqueued
If we meet a conflicting object that is marked FSCACHE_OBJECT_IS_LIVE in
the active object tree, we have been emitting a BUG after logging
information about it and the new object.
Instead, we should wait for the CACHEFILES_OBJECT_ACTIVE flag to be cleared
on the old object (or return an error). The ACTIVE flag should be cleared
after it has been removed from the active object tree. A timeout of 60s is
used in the wait, so we shouldn't be able to get stuck there.
Fixes: 9ae326a690 ("CacheFiles: A cache that backs onto a mounted filesystem")
Signed-off-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
In cachefiles_mark_object_active(), the new object is marked active and
then we try to add it to the active object tree. If a conflicting object
is already present, we want to wait for that to go away. After the wait,
we go round again and try to re-mark the object as being active - but it's
already marked active from the first time we went through and a BUG is
issued.
Fix this by clearing the CACHEFILES_OBJECT_ACTIVE flag before we try again.
Analysis from Kiran Kumar Modukuri:
[Impact]
Oops during heavy NFS + FSCache + Cachefiles
CacheFiles: Error: Overlong wait for old active object to go away.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
CacheFiles: Error: Object already active kernel BUG at
fs/cachefiles/namei.c:163!
[Cause]
In a heavily loaded system with big files being read and truncated, an
fscache object for a cookie is being dropped and a new object being
looked. The new object being looked for has to wait for the old object
to go away before the new object is moved to active state.
[Fix]
Clear the flag 'CACHEFILES_OBJECT_ACTIVE' for the new object when
retrying the object lookup.
[Testcase]
Have run ~100 hours of NFS stress tests and have not seen this bug recur.
[Regression Potential]
- Limited to fscache/cachefiles.
Fixes: 9ae326a690 ("CacheFiles: A cache that backs onto a mounted filesystem")
Signed-off-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When a cookie is allocated that causes fscache_object structs to be
allocated, those objects are initialised with the cookie pointer, but
aren't blessed with a ref on that cookie unless the attachment is
successfully completed in fscache_attach_object().
If attachment fails because the parent object was dying or there was a
collision, fscache_attach_object() returns without incrementing the cookie
counter - but upon failure of this function, the object is released which
then puts the cookie, whether or not a ref was taken on the cookie.
Fix this by taking a ref on the cookie when it is assigned in
fscache_object_init(), even when we're creating a root object.
Analysis from Kiran Kumar:
This bug has been seen in 4.4.0-124-generic #148-Ubuntu kernel
BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776277
fscache cookie ref count updated incorrectly during fscache object
allocation resulting in following Oops.
kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/internal.h:321!
kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639!
[Cause]
Two threads are trying to do operate on a cookie and two objects.
(1) One thread tries to unmount the filesystem and in process goes over a
huge list of objects marking them dead and deleting the objects.
cookie->usage is also decremented in following path:
nfs_fscache_release_super_cookie
-> __fscache_relinquish_cookie
->__fscache_cookie_put
->BUG_ON(atomic_read(&cookie->usage) <= 0);
(2) A second thread tries to lookup an object for reading data in following
path:
fscache_alloc_object
1) cachefiles_alloc_object
-> fscache_object_init
-> assign cookie, but usage not bumped.
2) fscache_attach_object -> fails in cant_attach_object because the
cookie's backing object or cookie's->parent object are going away
3) fscache_put_object
-> cachefiles_put_object
->fscache_object_destroy
->fscache_cookie_put
->BUG_ON(atomic_read(&cookie->usage) <= 0);
[NOTE from dhowells] It's unclear as to the circumstances in which (2) can
take place, given that thread (1) is in nfs_kill_super(), however a
conflicting NFS mount with slightly different parameters that creates a
different superblock would do it. A backtrace from Kiran seems to show
that this is a possibility:
kernel BUG at/build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639!
...
RIP: __fscache_cookie_put+0x3a/0x40 [fscache]
Call Trace:
__fscache_relinquish_cookie+0x87/0x120 [fscache]
nfs_fscache_release_super_cookie+0x2d/0xb0 [nfs]
nfs_kill_super+0x29/0x40 [nfs]
deactivate_locked_super+0x48/0x80
deactivate_super+0x5c/0x60
cleanup_mnt+0x3f/0x90
__cleanup_mnt+0x12/0x20
task_work_run+0x86/0xb0
exit_to_usermode_loop+0xc2/0xd0
syscall_return_slowpath+0x4e/0x60
int_ret_from_sys_call+0x25/0x9f
[Fix] Bump up the cookie usage in fscache_object_init, when it is first
being assigned a cookie atomically such that the cookie is added and bumped
up if its refcount is not zero. Remove the assignment in
fscache_attach_object().
[Testcase]
I have run ~100 hours of NFS stress tests and not seen this bug recur.
[Regression Potential]
- Limited to fscache/cachefiles.
Fixes: ccc4fc3d11 ("FS-Cache: Implement the cookie management part of the netfs API")
Signed-off-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cachefiles_read_waiter() has the right to access a 'monitor' object by
virtue of being called under the waitqueue lock for one of the pages in its
purview. However, it has no ref on that monitor object or on the
associated operation.
What it is allowed to do is to move the monitor object to the operation's
to_do list, but once it drops the work_lock, it's actually no longer
permitted to access that object. However, it is trying to enqueue the
retrieval operation for processing - but it can only do this via a pointer
in the monitor object, something it shouldn't be doing.
If it doesn't enqueue the operation, the operation may not get processed.
If the order is flipped so that the enqueue is first, then it's possible
for the work processor to look at the to_do list before the monitor is
enqueued upon it.
Fix this by getting a ref on the operation so that we can trust that it
will still be there once we've added the monitor to the to_do list and
dropped the work_lock. The op can then be enqueued after the lock is
dropped.
The bug can manifest in one of a couple of ways. The first manifestation
looks like:
FS-Cache:
FS-Cache: Assertion failed
FS-Cache: 6 == 5 is false
------------[ cut here ]------------
kernel BUG at fs/fscache/operation.c:494!
RIP: 0010:fscache_put_operation+0x1e3/0x1f0
...
fscache_op_work_func+0x26/0x50
process_one_work+0x131/0x290
worker_thread+0x45/0x360
kthread+0xf8/0x130
? create_worker+0x190/0x190
? kthread_cancel_work_sync+0x10/0x10
ret_from_fork+0x1f/0x30
This is due to the operation being in the DEAD state (6) rather than
INITIALISED, COMPLETE or CANCELLED (5) because it's already passed through
fscache_put_operation().
The bug can also manifest like the following:
kernel BUG at fs/fscache/operation.c:69!
...
[exception RIP: fscache_enqueue_operation+246]
...
#7 [ffff883fff083c10] fscache_enqueue_operation at ffffffffa0b793c6
#8 [ffff883fff083c28] cachefiles_read_waiter at ffffffffa0b15a48
#9 [ffff883fff083c48] __wake_up_common at ffffffff810af028
I'm not entirely certain as to which is line 69 in Lei's kernel, so I'm not
entirely clear which assertion failed.
Fixes: 9ae326a690 ("CacheFiles: A cache that backs onto a mounted filesystem")
Reported-by: Lei Xue <carmark.dlut@gmail.com>
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Reported-by: Anthony DeRobertis <aderobertis@metrics.net>
Reported-by: NeilBrown <neilb@suse.com>
Reported-by: Daniel Axtens <dja@axtens.net>
Reported-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Alter the state-check assertion in fscache_enqueue_operation() to allow
cancelled operations to be given processing time so they can be cleaned up.
Also fix a debugging statement that was requiring such operations to have
an object assigned.
Fixes: 9ae326a690 ("CacheFiles: A cache that backs onto a mounted filesystem")
Reported-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When inodes are freed in xfs_ifree(), di_flags is cleared (so extent size
hints are removed) but the actual extent size fields are left intact.
This causes the extent hint validators to fail on freed inodes which once
had extent size hints.
This can be observed (for example) by running xfs/229 twice on a
non-crc xfs filesystem, or presumably on V5 with ikeep.
Fixes: 7d71a67 ("xfs: verify extent size hint is valid in inode verifier")
Fixes: 02a0fda ("xfs: verify COW extent size hint is valid in inode verifier")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull vfs fixes from Al Viro:
"Fix several places that screw up cleanups after failures halfway
through opening a file (one open-coding filp_clone_open() and getting
it wrong, two misusing alloc_file()). That part is -stable fodder from
the 'work.open' branch.
And Christoph's regression fix for uapi breakage in aio series;
include/uapi/linux/aio_abi.h shouldn't be pulling in the kernel
definition of sigset_t, the reason for doing so in the first place had
been bogus - there's no need to expose struct __aio_sigset in
aio_abi.h at all"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: don't expose __aio_sigset in uapi
ocxlflash_getfile(): fix double-iput() on alloc_file() failures
cxl_getfile(): fix double-iput() on alloc_file() failures
drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()
uuid_le_to_bin() is deprecated API and take into consideration that variable,
to where we store parsed data, is type of guid_t we switch to guid_parse()
for sake of consistency.
While here, add error checking to it.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180720014726.24031-10-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAltSwkIACgkQxWXV+ddt
WDtMBQ//UXMjHaXvFmC0SM6NczuYQR51hLYtJFIKig93XK5goVpUBTNbxO7LX/Tn
4zKoKyVhkW1884V6mRiC+G23QLbo0BQZA7DExfyJ3jylQdjZMBm+K+r19OtGQf5v
CII7oUwni03KIiXIqiFAL5dLWebVQpG5EKJbh8GLZsmg6xNcyVaUqZ/fHXajbZiv
ldEBtHBKIv7WWTJmylMBKMWnRz+jqU91fXPahoU6R5qivODrLt1o/PMuSjVNhaxe
iDldHfdOaiQmLHB/1kOGyv492oW5mSSVNDE8LjEDZ61tDNlAcUyuKUWIRBxDEDtD
6D7rlVQXJ/N7sJ6+UYmJKsRpHL+NOkyzSZ0QEU/sm1Xpm8gkhHuuofRPrVCtd3l1
ZSbwvlrdyjigVEBfM3IbToQ/K6Rc1ZGId20OAs9PCQbb+mj9IxPIncZ7pI1c4hlh
pPEjcYsp14JbCTjctFalcqTiFY5tHRQsx+GUFnDyOcdL7Mi+CoH+0Jy61Vgz9GQE
7s934cfEC0ot/f66kAL/PZzxUfC7TePqaa+sDfS5BIkJ4M6lPMxS5De5R4Z0+Nzr
DXgQAlgXmxfRjpOYMTH9D0EDdSeJaNmVHgk7hFbiYk/KX3oyd4NmgI9Cfao8rQJv
2yd8wF2httfSJKD4b/Hv9r6Ho/Bw9PK59BvWOKYhSj6IGl32utw=
=f7eB
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A fix of a corruption regarding fsync and clone, under some very
specific conditions explained in the patch.
The fix is marked for stable 3.16+ so I'd like to get it merged now
given the impact"
* tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix file data corruption after cloning a range and fsync
Like vm_area_dup(), it initializes the anon_vma_chain head, and the
basic mm pointer.
The rest of the fields end up being different for different users,
although the plan is to also initialize the 'vm_ops' field to a dummy
entry.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vm_area_struct is one of the most fundamental memory management
objects, but the management of it is entirely open-coded evertwhere,
ranging from allocation and freeing (using kmem_cache_[z]alloc and
kmem_cache_free) to initializing all the fields.
We want to unify this in order to end up having some unified
initialization of the vmas, and the first step to this is to at least
have basic allocation functions.
Right now those functions are literally just wrappers around the
kmem_cache_*() calls. This is a purely mechanical conversion:
# new vma:
kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
# copy old vma
kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
# free vma
kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
to the point where the old vma passed in to the vm_area_dup() function
isn't even used yet (because I've left all the old manual initialization
alone).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In parse_options(), if match_strdup() failed, parse_options() leaves
opts->iocharset in unexpected state (i.e. still pointing the freed
string). And this can be the cause of double free.
To fix, this initialize opts->iocharset always when freeing.
Link: http://lkml.kernel.org/r/8736wp9dzc.fsf@mail.parknet.co.jp
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: syzbot+90b8e10515ae88228a92@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we clone a range into a file we can end up dropping existing
extent maps (or trimming them) and replacing them with new ones if the
range to be cloned overlaps with a range in the destination inode.
When that happens we add the new extent maps to the list of modified
extents in the inode's extent map tree, so that a "fast" fsync (the flag
BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
and log corresponding extent items. However, at the end of range cloning
operation we do truncate all the pages in the affected range (in order to
ensure future reads will not get stale data). Sometimes this truncation
will release the corresponding extent maps besides the pages from the page
cache. If this happens, then a "fast" fsync operation will miss logging
some extent items, because it relies exclusively on the extent maps being
present in the inode's extent tree, leading to data loss/corruption if
the fsync ends up using the same transaction used by the clone operation
(that transaction was not committed in the meanwhile). An extent map is
released through the callback btrfs_invalidatepage(), which gets called by
truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
later ends up calling try_release_extent_mapping() which will release the
extent map if some conditions are met, like the file size being greater
than 16Mb, gfp flags allow blocking and the range not being locked (which
is the case during the clone operation) nor being the extent map flagged
as pinned (also the case for cloning).
The following example, turned into a test for fstests, reproduces the
issue:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
$ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
# reflink destination offset corresponds to the size of file bar,
# 2728Kb minus 4Kb.
$ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
$ md5sum /mnt/bar
95a95813a8c2abc9aa75a6c2914a077e /mnt/bar
<power fail>
$ mount /dev/sdb /mnt
$ md5sum /mnt/bar
207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar
# digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
# power failure
In the above example, the destination offset of the clone operation
corresponds to the size of the "bar" file minus 4Kb. So during the clone
operation, the extent map covering the range from 2572Kb to 2728Kb gets
trimmed so that it ends at offset 2724Kb, and a new extent map covering
the range from 2724Kb to 11724Kb is created. So at the end of the clone
operation when we ask to truncate the pages in the range from 2724Kb to
2724Kb + 15908Kb, the page invalidation callback ends up removing the new
extent map (through try_release_extent_mapping()) when the page at offset
2724Kb is passed to that callback.
Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
map is removed at try_release_extent_mapping(), forcing the next fsync to
search for modified extents in the fs/subvolume tree instead of relying on
the presence of extent maps in memory. This way we can continue doing a
"fast" fsync if the destination range of a clone operation does not
overlap with an existing range or if any of the criteria necessary to
remove an extent map at try_release_extent_mapping() is not met (file
size not bigger then 16Mb or gfp flags do not allow blocking).
CC: stable@vger.kernel.org # 3.16+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
glibc uses a different defintion of sigset_t than the kernel does,
and the current version would pull in both. To fix this just do not
expose the type at all - this somewhat mirrors pselect() where we
do not even have a type for the magic sigmask argument, but just
use pointer arithmetics.
Fixes: 7a074e96 ("aio: implement io_pgetevents")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Adrian Reber <adrian@lisas.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In commit ac0b4145d6 ("btrfs: scrub: Don't use inode pages for device
replace") we removed the branch of copy_nocow_pages() to avoid
corruption for compressed nodatasum extents.
However above commit only solves the problem in scrub_extent(), if
during scrub_pages() we failed to read some pages,
sctx->no_io_error_seen will be non-zero and we go to fixup function
scrub_handle_errored_block().
In scrub_handle_errored_block(), for sctx without csum (no matter if
we're doing replace or scrub) we go to scrub_fixup_nodatasum() routine,
which does the similar thing with copy_nocow_pages(), but does it
without the extra check in copy_nocow_pages() routine.
So for test cases like btrfs/100, where we emulate read errors during
replace/scrub, we could corrupt compressed extent data again.
This patch will fix it just by avoiding any "optimization" for
nodatasum, just falls back to the normal fixup routine by try read from
any good copy.
This also solves WARN_ON() or dead lock caused by lame backref iteration
in scrub_fixup_nodatasum() routine.
The deadlock or WARN_ON() won't be triggered before commit ac0b4145d6
("btrfs: scrub: Don't use inode pages for device replace") since
copy_nocow_pages() have better locking and extra check for data extent,
and it's already doing the fixup work by try to read data from any good
copy, so it won't go scrub_fixup_nodatasum() anyway.
This patch disables the faulty code and will be removed completely in a
followup patch.
Fixes: ac0b4145d6 ("btrfs: scrub: Don't use inode pages for device replace")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ReiserFS prepares log messages into a 1024-byte buffer with no bounds
checks. Long messages, such as the "unknown mount option" warning when
userspace passes a crafted mount options string, overflow this buffer.
This causes KASAN to report a global-out-of-bounds write.
Fix it by truncating messages to the buffer size.
Link: http://lkml.kernel.org/r/20180707203621.30922-1-ebiggers3@gmail.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot+b890b3335a4d8c608963@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current code does not make sure to page align bss before calling
vm_brk(), and this can lead to a VM_BUG_ON() in __mm_populate() due to
the requested lenght not being correctly aligned.
Let us make sure to align it properly.
Kees: only applicable to CONFIG_USELIB kernels: 32-bit and configured
for libc5.
Link: http://lkml.kernel.org/r/20180705145539.9627-1-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reported-by: syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>