Pull SCSI target fixes from Nicholas Bellinger:
"Here are the target-pending fixes queued for v3.18-rc6.
The highlights include:
- target-core OOPs fix with tcm_qla2xxx + vxworks FC initiators +
zero length SCSI commands having a transfer direction set. (Roland
+ Craig Watson)
- vhost-scsi OOPs fix to explicitly prevent WWPN endpoint configfs
group removal while qemu still has an active reference. (Paolo +
nab)
- ib_srpt fix for RDMA hardware with lower srp_sq_size limits.
(Bart)
- two ib_isert work-arounds for running on ocrdma hardware (Or + Sagi
+ Chris)
- iscsi-target discovery portal typo + SPC-3 PR Preempt SA key
matching fix (Steve)"
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
IB/isert: Adjust CQ size to HW limits
target: return CONFLICT only when SA key unmatched
iser-target: Handle DEVICE_REMOVAL event on network portal listener correctly
ib_isert: Add max_send_sge=2 minimum for control PDU responses
srp-target: Retry when QP creation fails with ENOMEM
iscsi-target: return the correct port in SendTargets
vhost-scsi: Take configfs group dependency during VHOST_SCSI_SET_ENDPOINT
target: Don't call TFO->write_pending if data_length == 0
Pull dmaengine fixes from Vinod Koul:
"We have couple of fixes for dmaengine queued up:
- dma mempcy fix for dma configuration of sun6i by Maxime
- pl330 fixes: First the fixing allocation for data buffers by Liviu
and then Jon's fixe for fifo width and usage"
* 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
dmaengine: Fix allocation size for PL330 data buffer depth.
dmaengine: pl330: Limit MFIFO usage for memcpy to avoid exhausting entries
dmaengine: pl330: Align DMA memcpy operations to MFIFO width
dmaengine: sun6i: Fix memcpy operation
Pull MIPS fixes from Ralf Baechle:
"More 3.18 fixes for MIPS:
- backtraces were not quite working on on 64-bit kernels
- loongson needs a different cache coherency setting
- Loongson 3 is a MIPS64 R2 version but due to erratum we treat is an
older architecture revision.
- fix build errors due to undefined references to __node_distances
for certain configurations.
- fix instruction decodig in the jump label code.
- for certain configurations copy_{from,to}_user destroy the content
of $3 so that register needs to be marked as clobbed by the calling
code.
- Hardware Table Walker fixes.
- fill the delay slot of the last instruction of memcpy otherwise
whatever ends up there randomly might have undesirable effects.
- ensure get_user/__get_user always zero the variable to be read even
in case of an error"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: jump_label.c: Handle the microMIPS J instruction encoding
MIPS: jump_label.c: Correct the span of the J instruction
MIPS: Zero variable read by get_user / __get_user in case of an error.
MIPS: lib: memcpy: Restore NOP on delay slot before returning to caller
MIPS: tlb-r4k: Add missing HTW stop/start sequences
MIPS: asm: uaccess: Add v1 register to clobber list on EVA
MIPS: oprofile: Fix backtrace on 64-bit kernel
MIPS: Loongson: Set Loongson-3's ISA level to MIPS64R1
MIPS: Loongson: Fix the write-combine CCA value setting
MIPS: IP27: Fix __node_distances undefined error
MIPS: Loongson3: Fix __node_distances undefined error
Pull powerpc fix from Michael Ellerman:
"One fix from Scott, he says:
This patch fixes a crash (introduced in v3.18-rc1) in the FSL MSI driver
when threaded IRQs are enabled"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
powerpc/fsl_msi: mark the msi cascade handler IRQF_NO_THREAD
Pull x86 fixes from Thomas Gleixner:
"Misc fixes:
- gold linker build fix
- noxsave command line parsing fix
- bugfix for NX setup
- microcode resume path bug fix
- _TIF_NOHZ versus TIF_NOHZ bugfix as discussed in the mysterious
lockup thread"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1
x86, kaslr: Handle Gold linker for finding bss/brk
x86, mm: Set NX across entire PMD at boot
x86, microcode: Update BSPs microcode on resume
x86: Require exact match for 'noxsave' command line option
Pull scheduler fixes from Ingo Molnar:
"Misc fixes: two NUMA fixes, two cputime fixes and an RCU/lockdep fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency
sched/cputime: Fix cpu_timer_sample_group() double accounting
sched/numa: Avoid selecting oneself as swap target
sched/numa: Fix out of bounds read in sched_init_numa()
sched: Remove lockdep check in sched_move_task()
Pull perf fixes from Ingo Molnar:
"Misc fixes: two Intel uncore driver fixes, a CPU-hotplug fix and a
build dependencies fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Fix boot crash on SBOX PMU on Haswell-EP
perf/x86/intel/uncore: Fix IRP uncore register offsets on Haswell EP
perf: Fix corruption of sibling list with hotplug
perf/x86: Fix embarrasing typo
Pull core fix from Ingo Molnar:
"Fix GENMASK macro shift overflow"
Nobody seems to currently use GENMASK() to fill every single last bit
(which is what overflows) in-tree, and gcc would warn about it, so we
have that going for us. But apparently there are pending changes that
want this.
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
bitops: Fix shift overflow in GENMASK macros
Commit c3ae62af8e ("tcp: should drop incoming frames without ACK
flag set") was created to mitigate a security vulnerability in which a
local attacker is able to inject data into locally-opened sockets by
using TCP protocol statistics in procfs to quickly find the correct
sequence number.
This broke the RFC5961 requirement to send a challenge ACK in response
to spurious RST packets, which was subsequently fixed by commit
7b514a886b ("tcp: accept RST without ACK flag").
Unfortunately, the RFC5961 requirement that spurious SYN packets be
handled in a similar manner remains broken.
RFC5961 section 4 states that:
... the handling of the SYN in the synchronized state SHOULD be
performed as follows:
1) If the SYN bit is set, irrespective of the sequence number, TCP
MUST send an ACK (also referred to as challenge ACK) to the remote
peer:
<SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
After sending the acknowledgment, TCP MUST drop the unacceptable
segment and stop processing further.
By sending an ACK, the remote peer is challenged to confirm the loss
of the previous connection and the request to start a new connection.
A legitimate peer, after restart, would not have a TCB in the
synchronized state. Thus, when the ACK arrives, the peer should send
a RST segment back with the sequence number derived from the ACK
field that caused the RST.
This RST will confirm that the remote peer has indeed closed the
previous connection. Upon receipt of a valid RST, the local TCP
endpoint MUST terminate its connection. The local TCP endpoint
should then rely on SYN retransmission from the remote end to
re-establish the connection.
This patch lets SYN packets through the discard added in c3ae62af8e,
so that spurious SYN packets are properly dealt with as per the RFC.
The challenge ACK is sent unconditionally and is rate-limited, so the
original vulnerability is not reintroduced by this patch.
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not sure what I was thinking, but doing anything after
releasing a refcount is suicidal or/and embarrassing.
By the time we set skb->fclone to SKB_FCLONE_FREE, another cpu
could have released last reference and freed whole skb.
We potentially corrupt memory or trap if CONFIG_DEBUG_PAGEALLOC is set.
Reported-by: Chris Mason <clm@fb.com>
Fixes: ce1a4ea3f1 ("net: avoid one atomic operation in skb_clone()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When doing a fsync with a fast path we have a time window where we can miss
the fact that writeback of some file data failed, and therefore we endup
returning success (0) from fsync when we should return an error.
The steps that lead to this are the following:
1) We start all ordered extents by calling filemap_fdatawrite_range();
2) We do some other work like locking the inode's i_mutex, start a transaction,
start a log transaction, etc;
3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
ordered extents from inode's ordered tree into a list;
4) But by the time we do ordered extent collection, some ordered extents we started
at step 1) might have already completed with an error, and therefore we didn't
found them in the ordered tree and had no idea they finished with an error. This
makes our fsync return success (0) to userspace, but has no bad effects on the log
like for example insertion of file extent items into the log that point to unwritten
extents, because the invalid extent maps were removed before the ordered extent
completed (in inode.c:btrfs_finish_ordered_io).
So after collecting the ordered extents just check if the inode's i_mapping has any
error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
writeback fails for a page of an ordered extent, we call mapping_set_error (done in
extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
that sets one of those error flags in the inode's i_mapping flags.
This change also has the side effect of fixing the issue where for fast fsyncs we
never checked/cleared the error flags from the inode's i_mapping flags, which means
that a full fsync performed after a fast fsync could get such errors that belonged
to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
calls filemap_fdatawait_range(), and the later checks for and clears those flags,
while for fast fsyncs we never call filemap_fdatawait_range() or anything else
that checks for and clears the error flags from the inode's i_mapping.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Instead of collecting all ordered extents from the inode's ordered tree
and then wait for all of them to complete, just collect the ones that
overlap the fsync range.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If an error happens during writeback of log btree extents, make sure the
error is returned to the caller (fsync), so that it takes proper action
(commit current transaction) instead of writing a superblock that points
to log btrees with all or some nodes that weren't durably persisted.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We use the modified list to keep track of which extents have been modified so we
know which ones are candidates for logging at fsync() time. Newly modified
extents are added to the list at modification time, around the same time the
ordered extent is created. We do this so that we don't have to wait for ordered
extents to complete before we know what we need to log. The problem is when
something like this happens
log extent 0-4k on inode 1
copy csum for 0-4k from ordered extent into log
sync log
commit transaction
log some other extent on inode 1
ordered extent for 0-4k completes and adds itself onto modified list again
log changed extents
see ordered extent for 0-4k has already been logged
at this point we assume the csum has been copied
sync log
crash
On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
which is the same one that we are replaying which also drops the csum, and then
we won't find the csum in the log for that bytenr. This of course causes us to
have errors about not having csums for certain ranges of our inode. So remove
the modified list manipulation in unpin_extent_cache, any modified extents
should have been added well before now, and we don't want them re-logged. This
fixes my test that I could reliably reproduce this problem with. Thanks,
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Liu Bo pointed out that my previous fix would lose the generation update in the
scenario I described. It is actually much worse than that, we could lose the
entire extent if we lose power right after the transaction commits. Consider
the following
write extent 0-4k
log extent in log tree
commit transaction
< power fail happens here
ordered extent completes
We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
the transaction commit will reset the log so it isn't replayed. If we lose
power before the transaction commit we are save, otherwise we are not.
Fix this by keeping track of all extents we logged in this transaction. Then
when we go to commit the transaction make sure we wait for all of those ordered
extents to complete before proceeding. This will make sure that if we lose
power after the transaction commit we still have our data. This also fixes the
problem of the improperly updated extent generation. Thanks,
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
"The biggest change is to rename the filesystem from "overlayfs" to "overlay".
This will allow legacy overlayfs to be easily carried by distros alongside the
new mainline one. Also fix a couple of copy-up races and allow escaping comma
character in filenames."
The last bit is about commas in pathname mount options...
We currently trigger BUG when VIRTIO_NET_F_CTRL_VQ
is not set but one of features depending on it is.
That's not a friendly way to report errors to
hypervisors.
Let's check, and fail probe instead.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains two bugfixes for your net tree, they are:
1) Validate netlink group from nfnetlink to avoid an out of bound array
access. This should only happen with superuser priviledges though.
Discovered by Andrey Ryabinin using trinity.
2) Don't push ethernet header before calling the netfilter output hook
for multicast traffic, this breaks ebtables since it expects to see
skb->data pointing to the network header, patch from Linus Luessing.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville says:
====================
pull request: wireless 2014-11-20
Please full this little batch of fixes intended for the 3.18 stream!
For the mac80211 patch, Johannes says:
"Here's another last minute fix, for minstrel HT crashing
depending on the value of some uninitialised stack."
On top of that...
Ben Greear fixes an ath9k regression in which a BSSID mask is
miscalculated.
Dmitry Torokhov corrects an error handling routing in brcmfmac which
was checking an unsigned variable for a negative value.
Johannes Berg avoids a build problem in brcmfmac for arches where
linux/unaligned/access_ok.h and asm/unaligned.h conflict.
Mathy Vanhoef addresses another brcmfmac issue so as to eliminate a
use-after-free of the URB transfer buffer if a timeout occurs.
Please let me know if there are problems!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Peer priority groups were being reversed, but this was missed in the previous
fix sent out for this issue.
v2 : Previous patch was doing extra unnecessary work, result is the same.
Please ignore previous patch
Fixes : ee7bc3cdc2 ('cxgb4 : dcb open-lldp interop fixes')
Signed-off-by: Anish Bhatt <anish@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes an old regression introduced by commit
b0d0d915 (ipx: remove the BKL).
When a recvmsg syscall blocks waiting for new data, no data can be sent on the
same socket with sendmsg because ipx_recvmsg() sleeps with the socket locked.
This breaks mars-nwe (NetWare emulator):
- the ncpserv process reads the request using recvmsg
- ncpserv forks and spawns nwconn
- ncpserv calls a (blocking) recvmsg and waits for new requests
- nwconn deadlocks in sendmsg on the same socket
Commit b0d0d915 has simply replaced BKL locking with
lock_sock/release_sock. Unlike now, BKL got unlocked while
sleeping, so a blocking recvmsg did not block a concurrent
sendmsg.
Only keep the socket locked while actually working with the socket data and
release it prior to calling skb_recv_datagram().
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When userspace doesn't provide a mask, OVS datapath generates a fully
unwildcarded mask for the flow by copying the flow and setting all bits
in all fields. For IPv6 label, this creates a mask that matches on the
upper 12 bits, causing the following error:
openvswitch: netlink: Invalid IPv6 flow label value (value=ffffffff, max=fffff)
This patch ignores the label validation check for masks, avoiding this
error.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pptp_getname() only partially initializes the stack variable sa,
particularly only fills the pptp part of the sa_addr union. The code
thereby discloses 16 bytes of kernel stack memory via getsockname().
Fix this by memset(0)'ing the union before.
Cc: Dmitry Kozlov <xeb@mail.ru>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix one regression and one endian issue.
* 'drm-fixes-3.18' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon: fix endian swapping in vbios fetch for tdp table
drm/radeon: disable native backlight control on pre-r6xx asics (v2)
If we have two fsync()'s race on different subvols one will do all of its work
to get into the log_tree, wait on it's outstanding IO, and then allow the
log_tree to finish it's commit. The problem is we were just free'ing that
subvols logged extents instead of waiting on them, so whoever lost the race
wouldn't really have their data on disk. Fix this by waiting properly instead
of freeing the logged extents. Thanks,
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The sizes that are obtained from space infos are in raw units and have
to be adjusted according to the raid factor. This was missing for
f_bavail and df reported doubled size for raid1.
Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Fixes: ba7b6e62f4 ("btrfs: adjust statfs calculations according to raid profiles")
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
This can be reproduced by fstests: btrfs/070
The scenario is like the following:
replace worker thread defrag thread
--------------------- -------------
copy_nocow_pages_worker btrfs_defrag_file
copy_nocow_pages_for_inode ...
btrfs_writepages
|A| lock_extent_bits extent_write_cache_pages
|B| lock_page
__extent_writepage
... writepage_delalloc
find_lock_delalloc_range
|B| lock_extent_bits
find_or_create_page
pagecache_get_page
|A| lock_page
This leads to an ABBA pattern deadlock. To fix it,
o we just change it to an AABB pattern which means to @unlock_extent_bits()
before we @lock_page(), and in this way the @extent_read_full_page_nolock()
is no longer in an locked context, so change it back to @extent_read_full_page()
to regain protection.
o Since we @unlock_extent_bits() earlier, then before @write_page_nocow(),
the extent may not really point at the physical block we want, so we
have to check it before write.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Replacing a xattr consists of doing a lookup for its existing value, delete
the current value from the respective leaf, release the search path and then
finally insert the new value. This leaves a time window where readers (getxattr,
listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs,
so this has security implications.
This change also fixes 2 other existing issues which were:
*) Deleting the old xattr value without verifying first if the new xattr will
fit in the existing leaf item (in case multiple xattrs are packed in the
same item due to name hash collision);
*) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't
exist but we have have an existing item that packs muliple xattrs with
the same name hash as the input xattr. In this case we should return ENOSPC.
A test case for xfstests follows soon.
Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace
implementation.
Reported-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We try to allocate an extent state structure before acquiring the extent
state tree's spinlock as we might need a new one later and therefore avoid
doing later an atomic allocation while holding the tree's spinlock. However
we returned -ENOMEM if that initial non-atomic allocation failed, which is
a bit excessive since we might end up not needing the pre-allocated extent
state at all - for the case where the tree doesn't have any extent states
that cover the input range and cover too any other range. Therefore don't
return -ENOMEM if that pre-allocation fails.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Our gluster boxes get several thousand statfs() calls per second, which begins
to suck hardcore with all of the lock contention on the chunk mutex and dev list
mutex. We don't really need to hold these things, if we have transient
weirdness with statfs() because of the chunk allocator we don't care, so remove
this locking.
We still need the dev_list lock if you mount with -o alloc_start however, which
is a good argument for nuking that thing from orbit, but that's a patch for
another day. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Our gluster boxes were spending lots of time in statfs because our fs'es are
huge. The problem is statfs loops through all of the block groups looking for
read only block groups, and when you have several terabytes worth of data that
ends up being a lot of block groups. Move the read only block groups onto a
read only list and only proces that list in
btrfs_account_ro_block_groups_free_space to reduce the amount of churn. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Copy&paste errors in some messages and add few more missing macro
accessors.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The xfstest btrfs/014 which tests the balance operation caused that the
check_int module complained that known blocks changed their physical
location. Since this is not an error in this case, only print such
message if the verbose mode was enabled.
Reported-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Tested-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
The xfstest btrfs/014 which tests the balance operation caused issues with
the check_int module. The attempt was made to use btrfs_map_block() to
find the physical location for a written block. However, this was not
at all needed since the location of the written block was known since
a hook to submit_bio() was the reason for entering the check_int module.
Additionally, after a block relocation it happened that btrfs_map_block()
failed causing misleading error messages afterwards.
This patch changes the check_int module to use the known information of
the physical location from the bio.
Reported-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Tested-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
We try to allocate an extent state before acquiring the tree's spinlock
just in case we end up needing to split an existing extent state into two.
If that allocation failed, we would return -ENOMEM.
However, our only single caller (transaction/log commit code), passes in
an extent state that was cached from a call to find_first_extent_bit() and
that has a very high chance to match exactly the input range (always true
for a transaction commit and very often, but not always, true for a log
commit) - in this case we end up not needing at all that initial extent
state used for an eventual split. Therefore just don't return -ENOMEM if
we can't allocate the temporary extent state, since we might not need it
at all, and if we end up needing one, we'll do it later anyway.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Right now the only caller of find_first_extent_bit() that is interested
in caching extent states (transaction or log commit), never gets an extent
state cached. This is because find_first_extent_bit() only caches states
that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and
the transaction/log commit caller always passes a tree that doesn't have
ever extent states with any of those flags (they can only have one of the
following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT).
This change together with the following one in the patch series (titled
"Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will
help reduce significantly the chances of calls to convert_extent_bit()
fail with -ENOMEM when called from the transaction/log commit code.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
When committing a transaction or a log, we look for btree extents that
need to be durably persisted by searching for ranges in a io tree that
have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear
those bits and set the EXTENT_NEED_WAIT bit, with calls to the function
convert_extent_bit, and then start writeback for the extents.
That function however can return an error (at the moment only -ENOMEM
is possible, specially when it does GFP_ATOMIC allocation requests
through alloc_extent_state_atomic) - that means the ranges didn't got
the EXTENT_NEED_WAIT bit set (or at least not for the whole range),
which in turn means a call to btrfs_wait_marked_extents() won't find
those ranges for which we started writeback, causing a transaction
commit or a log commit to persist a new superblock without waiting
for the writeback of extents in that range to finish first.
Therefore if a crash happens after persisting the new superblock and
before writeback finishes, we have a superblock pointing to roots that
weren't fully persisted or roots that point to nodes or leafs that weren't
fully persisted, causing all sorts of unexpected/bad behaviour as we endup
reading garbage from disk or the content of some node/leaf from a past
generation that got cowed or deleted and is no longer valid (for this later
case we end up getting error messages like "parent transid verify failed on
X wanted Y found Z" when reading btree nodes/leafs from disk).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
device replace could fail due to another running scrub process or any
other errors btrfs_scrub_dev() may hit, but this failure doesn't get
returned to userspace.
The following steps could reproduce this issue
mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
mount /dev/sdb1 /mnt/btrfs
while true; do btrfs scrub start -B /mnt/btrfs >/dev/null 2>&1; done &
btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
# if this replace succeeded, do the following and repeat until
# you see this log in dmesg
# BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
#btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs
# once you see the error log in dmesg, check return value of
# replace
echo $?
Introduce a new dev replace result
BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS
to catch -EINPROGRESS explicitly and return other errors directly to
userspace.
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
size of @btrfsic_state needs more than 2M, it is very likely to
fail allocating memory using kzalloc(). see following mesage:
[91428.902148] Call Trace:
[<ffffffff816f6e0f>] dump_stack+0x4d/0x66
[<ffffffff811b1c7f>] warn_alloc_failed+0xff/0x170
[<ffffffff811b66e1>] __alloc_pages_nodemask+0x951/0xc30
[<ffffffff811fd9da>] alloc_pages_current+0x11a/0x1f0
[<ffffffff811b1e0b>] ? alloc_kmem_pages+0x3b/0xf0
[<ffffffff811b1e0b>] alloc_kmem_pages+0x3b/0xf0
[<ffffffff811d1018>] kmalloc_order+0x18/0x50
[<ffffffff811d1074>] kmalloc_order_trace+0x24/0x140
[<ffffffffa06c097b>] btrfsic_mount+0x8b/0xae0 [btrfs]
[<ffffffff810af555>] ? check_preempt_curr+0x85/0xa0
[<ffffffff810b2de3>] ? try_to_wake_up+0x103/0x430
[<ffffffffa063d200>] open_ctree+0x1bd0/0x2130 [btrfs]
[<ffffffffa060fdde>] btrfs_mount+0x62e/0x8b0 [btrfs]
[<ffffffff811fd9da>] ? alloc_pages_current+0x11a/0x1f0
[<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50
[<ffffffff81230429>] mount_fs+0x39/0x1b0
[<ffffffff812509fb>] vfs_kern_mount+0x6b/0x150
[<ffffffff812537fb>] do_mount+0x27b/0xc30
[<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50
[<ffffffff812544f6>] SyS_mount+0x96/0xf0
[<ffffffff81701970>] system_call_fastpath+0x16/0x1b
Since we are allocating memory for hash table array, so
it will be good if we could allocate continuous pages here.
Fix this problem by firstly trying kzalloc(), if we fail,
use vzalloc() instead.
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
If cow_file_range_inline() failed, when called from compress_file_range(),
we were tagging the locked page for writeback, end its writeback and unlock it,
but not marking it with an error nor setting AS_EIO in inode's mapping flags.
This made it impossible for a caller of filemap_fdatawrite_range (writepages)
or filemap_fdatawait_range() to know that an error happened. And the return
value of compress_file_range() is useless because it's returned to a workqueue
task and not to the task calling filemap_fdatawrite_range (writepages).
This change applies on top of the previous patchset starting at the patch
titled:
"[1/5] Btrfs: set page and mapping error on compressed write failure"
Which changed extent_clear_unlock_delalloc() to use SetPageError and
mapping_set_error().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
To avoid duplicating this double filemap_fdatawrite_range() call for
inodes with async extents (compressed writes) so often.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
For compressed writes, after doing the first filemap_fdatawrite_range() we
don't get the pages tagged for writeback immediately. Instead we create
a workqueue task, which is run by other kthread, and keep the pages locked.
That other kthread compresses data, creates the respective ordered extent/s,
tags the pages for writeback and unlocks them. Therefore we need a second
call to filemap_fdatawrite_range() if we have compressed writes, as this
second call will wait for the pages to become unlocked, then see they became
tagged for writeback and finally wait for the writeback to finish.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Its return value is useless, its single caller ignores it and can't do
anything with it anyway, since it's a workqueue task and not the task
calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range().
Failure is communicated to such functions via start and end of writeback
with the respective pages tagged with an error and AS_EIO flag set in the
inode's imapping.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Steps to reproduce:
# mkfs.btrfs -f /dev/sdb
# mount -t btrfs /dev/sdb /mnt -o compress=lzo
# dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1
after previous steps, inode will be detected as bad compression ratio,
and NOCOMPRESS flag will be set for that inode.
Reason is that compress have a max limit pages every time(128K), if a
132k write in, it will be splitted into two write(128k+4k), this bug
is a leftover for commit 68bb462d42a(Btrfs: don't compress for a small write)
Fix this problem by checking every time before compression, if it is a
small write(<=blocksize), we bail out and fall into nocompression directly.
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Our compressed bio write end callback was essentially ignoring the error
parameter. When a write error happens, it must pass a value of 0 to the
inode's write_page_end_io_hook callback, SetPageError on the respective
pages and set AS_EIO in the inode's mapping flags, so that a call to
filemap_fdatawait_range() / filemap_fdatawait() can find out that errors
happened (we surely don't want silent failures on fsync for example).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Its return value is completely ignored by its single caller and it's
useless anyway, since errors are indicated through SetPageError and
the bit AS_EIO set in the flags of the inode's mapping. The caller
can't do anything with the value, as it's invoked from a workqueue
task and not by the task calling filemap_fdatawrite_range (which calls
the writepages address space callback, which in turn calls the inode's
fill_delalloc callback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we had an error when processing one of the async extents from our list,
we were not processing the remaining async extents, meaning we would leak
those async_extent structs, never release the pages with the compressed
data and never unlock and clear the dirty flag from the inode's pages (those
that correspond to the uncompressed content).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
In inode.c:submit_compressed_extents(), if we fail before calling
btrfs_submit_compressed_write(), or when that function fails, we
were freeing the async_extent structure without releasing its pages
and freeing the pages array.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
In inode.c:submit_compressed_extents(), before calling btrfs_submit_compressed_write()
we start writeback for all pages, clear their dirty flag, unlock them, etc, but if
btrfs_submit_compressed_write() fails (at the moment it can only fail with -ENOMEM),
we never end the writeback on the pages, so any filemap_fdatawait_range() call will
hang forever. We were also not calling the writepage end io hook, which means the
corresponding ordered extent will never complete and all its waiters will block
forever, such as a full fsync (via btrfs_wait_ordered_range()).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
but we don't tag the pages, nor the inode's mapping, with an error. This makes it
impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
for e.g.) know that there was an error.
Note that the return value of submit_compressed_extents() is useless, as that function
is executed by a workqueue task and not directly by the fill_delalloc callback. This
means the writepage/s callbacks of the inode's address space operations don't get that
return value.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>