Pull security subsystem updates from James Morris:
"Highlights:
- Integrity: add local fs integrity verification to detect offline
attacks
- Integrity: add digital signature verification
- Simple stacking of Yama with other LSMs (per LSS discussions)
- IBM vTPM support on ppc64
- Add new driver for Infineon I2C TIS TPM
- Smack: add rule revocation for subject labels"
Fixed conflicts with the user namespace support in kernel/auditsc.c and
security/integrity/ima/ima_policy.c.
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (39 commits)
Documentation: Update git repository URL for Smack userland tools
ima: change flags container data type
Smack: setprocattr memory leak fix
Smack: implement revoking all rules for a subject label
Smack: remove task_wait() hook.
ima: audit log hashes
ima: generic IMA action flag handling
ima: rename ima_must_appraise_or_measure
audit: export audit_log_task_info
tpm: fix tpm_acpi sparse warning on different address spaces
samples/seccomp: fix 31 bit build on s390
ima: digital signature verification support
ima: add support for different security.ima data types
ima: add ima_inode_setxattr/removexattr function and calls
ima: add inode_post_setattr call
ima: replace iint spinblock with rwlock/read_lock
ima: allocating iint improvements
ima: add appraise action keywords and default rules
ima: integrity appraisal extension
vfs: move ima_file_free before releasing the file
...
* Error reporting and debug printing improvements
* Power cut emulation fixes
* Minor cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQaumRAAoJECmIfjd9wqK016UQAImRrzhPoegy2Pr7j/CpzUST
wJru3QlgdHgjukWer/s4CGTTjmTLMgLAllUnJLO3yn+Po4o67RJxI4SjhMf5lwDG
TSmB/ttQpRmYxaaS2EOC6Ow+jdMM2MhAsPrD8nAfLvOm/XduuSyquPY/xFa6U4rJ
oE4w23b5t2BZI/UJmsCaYs6Rj1h8w1aZUukii+J9BN8DGygn/vJuI9EDNKdtQmxe
vmoT2uWKPyL859kY/+lONH048NWMkEB3BhNmuAsFqGY/tdQRdmeyhgWEKNjbmc/M
DXZe7ovy5umUyw6iDpfhEhmOBdV9xorDySsmKNP1/q60rUGD6R/tynC4y1q1IueB
c6fbry6xpwKIccMS/w7x9cbZxMnG++iQoj71pr2FV9Bg5Xd7XduM/XjzI5rMHuqv
EnmcwT/LIu5Lw1vMU8pfW2yIdTkotNR/2aca1bs0f4WtS3AhngdqmNbFmddcpY37
qvTYDrSMOW/IaegiDQzucVXNhIs8ufIULZzmj9CtmgeyvNVJboTJRik+HJDWFqeC
04TaM8k+Yjp/isbzTmWEyBG3G06cNpMwLtT5OKcpRVQ1ZYu62AbnR+Dbep6n0qt+
gBORW8TTY32GudeSiHLKCrOsrkx3zNSli6T4Sw9YD5e29Dee62KRTgjKVplezQfB
Djh70vUyX2tHdYlW88gf
=gM4M
-----END PGP SIGNATURE-----
Merge tag 'upstream-3.7-rc1' of git://git.infradead.org/linux-ubifs
Pull ubifs changes from Artem Bityutskiy:
"No big changes for 3.7 in UBIFS:
- Error reporting and debug printing improvements
- Power cut emulation fixes
- Minor cleanups"
Fix trivial conflict in fs/ubifs/debug.c due to the user namespace
changes.
* tag 'upstream-3.7-rc1' of git://git.infradead.org/linux-ubifs:
UBIFS: print less
UBIFS: use pr_ helper instead of printk
UBIFS: comply with coding style
UBIFS: use __aligned() attribute
UBIFS: remove __DATE__ and __TIME__
UBIFS: fix power cut emulation for mtdram
UBIFS: improve scanning debug output
UBIFS: always print full error reports
UBIFS: print PID in debug messages
Several enhancements and cleanups:
- make inode32 and inode64 remountable options
- SEEK_HOLE/SEEK_DATA enhancements
- cleanup struct declarations in xfs_mount.h
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJQayQwAAoJENaLyazVq6ZO2H0P/RiWpYlVLP0ZNbxUOvL33WFy
xOHwNkhRG23Z0sgDfrUJjDibzOuSmf/XothsChGPE4Mm8JLCjZKzxR77ySP+pbC5
sylrciXl5E2tY3wtW8oh2kQORLRneqA6qXNCrOtYklqwSmB0qx7bf9OdjL/ALiuC
SvTAn/PhyVZqGJgVZyEAc63CsbSzK2XmLgyfeQeA7JbSptbg5fDkFYrbNE+zr+yW
S1umMsuZW+zwDjzRq0lGkLcYOH3fD0JpMrfqvtQsk3+SIDlq1HSaObD63dhOJVUV
LUpOqgXCPEC3nOL3Dh+i8gv5OTRLJW0P2zYuAY++kqYnuRsYU18tcfdd0loTGeHV
LulurSIMBJV19Qx1K3C3E8KPwbwNwNlvmpEbgXy00Qet7LTPbofcUXDi3mcV6ozQ
SMGQh42VGV4S8wEjAp93wLva7xLf6enV/X6Stuzy/ec8HN8K2tYMyYyGvSxZbjmq
9JTxCulDvbr7EnSDqCkH1inlQ4cXvBSzOi2trDpQbKq0cuYbqlzgL2sloBv+y+CB
4fTWEeeisQ05VhHU8Yp3bsVtiyUjGGcbOgvqM+NHmmh1HuHUmiBnXh7k3RfaLfBi
E4RprFT0Yrpo16mt3Fn0GjkhY87DYHBGIehFOlu13zKPc5q+IBTwIjSugxoa23EL
uXTtFFHMxR4pUV/An5Wf
=G6gd
-----END PGP SIGNATURE-----
Merge tag 'for-linus-v3.7-rc1' of git://oss.sgi.com/xfs/xfs
Pull xfs update from Ben Myers:
"Several enhancements and cleanups:
- make inode32 and inode64 remountable options
- SEEK_HOLE/SEEK_DATA enhancements
- cleanup struct declarations in xfs_mount.h"
* tag 'for-linus-v3.7-rc1' of git://oss.sgi.com/xfs/xfs:
xfs: Make inode32 a remountable option
xfs: add inode64->inode32 transition into xfs_set_inode32()
xfs: Fix mp->m_maxagi update during inode64 remount
xfs: reduce code duplication handling inode32/64 options
xfs: make inode64 as the default allocation mode
xfs: Fix m_agirotor reset during AG selection
Make inode64 a remountable option
xfs: stop the sync worker before xfs_unmountfs
xfs: xfs_seek_hole() refinement with hole searching from page cache for unwritten extents
xfs: xfs_seek_data() refinement with unwritten extents check up from page cache
xfs: Introduce a helper routine to probe data or hole offset from page cache
xfs: Remove type argument from xfs_seek_data()/xfs_seek_hole()
xfs: fix race while discarding buffers [V4]
xfs: check for possible overflow in xfs_ioc_trim
xfs: unlock the AGI buffer when looping in xfs_dialloc
xfs: kill struct declarations in xfs_mount.h
xfs: fix uninitialised variable in xfs_rtbuf_get()
Pull vfs update from Al Viro:
- big one - consolidation of descriptor-related logics; almost all of
that is moved to fs/file.c
(BTW, I'm seriously tempted to rename the result to fd.c. As it is,
we have a situation when file_table.c is about handling of struct
file and file.c is about handling of descriptor tables; the reasons
are historical - file_table.c used to be about a static array of
struct file we used to have way back).
A lot of stray ends got cleaned up and converted to saner primitives,
disgusting mess in android/binder.c is still disgusting, but at least
doesn't poke so much in descriptor table guts anymore. A bunch of
relatively minor races got fixed in process, plus an ext4 struct file
leak.
- related thing - fget_light() partially unuglified; see fdget() in
there (and yes, it generates the code as good as we used to have).
- also related - bits of Cyrill's procfs stuff that got entangled into
that work; _not_ all of it, just the initial move to fs/proc/fd.c and
switch of fdinfo to seq_file.
- Alex's fs/coredump.c spiltoff - the same story, had been easier to
take that commit than mess with conflicts. The rest is a separate
pile, this was just a mechanical code movement.
- a few misc patches all over the place. Not all for this cycle,
there'll be more (and quite a few currently sit in akpm's tree)."
Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
MAX_LFS_FILESIZE should be a loff_t
compat: fs: Generic compat_sys_sendfile implementation
fs: push rcu_barrier() from deactivate_locked_super() to filesystems
btrfs: reada_extent doesn't need kref for refcount
coredump: move core dump functionality into its own file
coredump: prevent double-free on an error path in core dumper
usb/gadget: fix misannotations
fcntl: fix misannotations
ceph: don't abuse d_delete() on failure exits
hypfs: ->d_parent is never NULL or negative
vfs: delete surplus inode NULL check
switch simple cases of fget_light to fdget
new helpers: fdget()/fdput()
switch o2hb_region_dev_write() to fget_light()
proc_map_files_readdir(): don't bother with grabbing files
make get_file() return its argument
vhost_set_vring(): turn pollstart/pollstop into bool
switch prctl_set_mm_exe_file() to fget_light()
switch xfs_find_handle() to fget_light()
switch xfs_swapext() to fget_light()
...
This function is used by sparc, powerpc and arm64 for compat support.
The patch adds a generic implementation which calls do_sendfile()
directly and avoids set_fs().
The sparc architecture has wrappers for the sign extensions while
powerpc relies on the compiler to do the this. The patch adds wrappers
for powerpc to handle the u32->int type conversion.
compat_sys_sendfile64() can be replaced by a sys_sendfile() call since
compat_loff_t has the same size as off_t on a 64-bit system.
On powerpc, the patch also changes the 64-bit sendfile call from
sys_sendile64 to sys_sendfile.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There's no reason to call rcu_barrier() on every
deactivate_locked_super(). We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.
Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths. E.g. on my machine exit_group() of a last process in IPC
namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All increments and decrements are under the same spinlock - have to be,
since they need to protect the radix_tree it's found in. Just use
int, no need to wank with kref...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This prepares for making core dump functionality optional.
The variable "suid_dumpable" and associated functions are left in fs/exec.c
because they're used elsewhere, such as in ptrace.
Signed-off-by: Alex Kelly <alex.page.kelly@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We currently make no distinction in attribute requests between normal OPENs
and OPEN with CLAIM_PREVIOUS. This offers more possibility of failures in
the GETATTR response which foils OPEN reclaim attempts.
Reduce the requested attributes to the bare minimum needed to update the
reclaim open stateid and split nfs4_opendata_to_nfs4_state processing
accordingly.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pull input updates from Dmitry Torokhov:
"A few drivers were updated with device tree bindings and others got a
few small cleanups and fixes."
Fix trivial conflict in drivers/input/keyboard/omap-keypad.c due to
changes clashing with a whitespace cleanup.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (28 commits)
Input: wacom - mark Intuos5 pad as in-prox when touching buttons
Input: synaptics - adjust threshold for treating position values as negative
Input: hgpk - use %*ph to dump small buffer
Input: gpio_keys_polled - fix dt pdata->nbuttons
Input: Add KD[GS]KBDIACRUC ioctls to the compatible list
Input: omap-keypad - fixed formatting
Input: tegra - move platform data header
Input: wacom - add support for EMR on Cintiq 24HD touch
Input: s3c2410_ts - make s3c_ts_pmops const
Input: samsung-keypad - use of_get_child_count() helper
Input: samsung-keypad - use of_match_ptr()
Input: uinput - fix formatting
Input: uinput - specify exact bit sizes on userspace APIs
Input: uinput - mark failed submission requests as free
Input: uinput - fix race that can block nonblocking read
Input: uinput - return -EINVAL when read buffer size is too small
Input: uinput - take event lock when fetching events from buffer
Input: get rid of MATCH_BIT() macro
Input: rotary-encoder - add DT bindings
Input: rotary-encoder - constify platform data pointers
...
Don't put an ACCESS op in OPEN compound if O_EXCL, because ACCESS
will return permission denied for all bits until close.
Fixes a regression due to commit 6168f62c (NFSv4: Add ACCESS operation to
OPEN compound)
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Don't check MAY_WRITE as a newly created file may not have write mode bits,
but POSIX allows the creating process to write regardless.
This is ok because NFSv4 OPEN ops handle write permissions correctly -
the ACCESS in the OPEN compound is to differentiate READ v EXEC permissions.
Fixes a regression due to commit 6168f62c (NFSv4: Add ACCESS operation to
OPEN compound)
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pull networking changes from David Miller:
1) GRE now works over ipv6, from Dmitry Kozlov.
2) Make SCTP more network namespace aware, from Eric Biederman.
3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.
4) Make openvswitch network namespace aware, from Pravin B Shelar.
5) IPV6 NAT implementation, from Patrick McHardy.
6) Server side support for TCP Fast Open, from Jerry Chu and others.
7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
Borkmann.
8) Increate the loopback default MTU to 64K, from Eric Dumazet.
9) Use a per-task rather than per-socket page fragment allocator for
outgoing networking traffic. This benefits processes that have very
many mostly idle sockets, which is quite common.
From Eric Dumazet.
10) Use up to 32K for page fragment allocations, with fallbacks to
smaller sizes when higher order page allocations fail. Benefits are
a) less segments for driver to process b) less calls to page
allocator c) less waste of space.
From Eric Dumazet.
11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.
12) VXLAN device driver, one way to handle VLAN issues such as the
limitation of 4096 VLAN IDs yet still have some level of isolation.
From Stephen Hemminger.
13) As usual there is a large boatload of driver changes, with the scale
perhaps tilted towards the wireless side this time around.
Fix up various fairly trivial conflicts, mostly caused by the user
namespace changes.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
hyperv: Add buffer for extended info after the RNDIS response message.
hyperv: Report actual status in receive completion packet
hyperv: Remove extra allocated space for recv_pkt_list elements
hyperv: Fix page buffer handling in rndis_filter_send_request()
hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
hyperv: Fix the max_xfer_size in RNDIS initialization
vxlan: put UDP socket in correct namespace
vxlan: Depend on CONFIG_INET
sfc: Fix the reported priorities of different filter types
sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
sfc: Fix loopback self-test with separate_tx_channels=1
sfc: Fix MCDI structure field lookup
sfc: Add parentheses around use of bitfield macro arguments
sfc: Fix null function pointer in efx_sriov_channel_type
vxlan: virtual extensible lan
igmp: export symbol ip_mc_leave_group
netlink: add attributes to fdb interface
tg3: unconditionally select HWMON support when tg3 is enabled.
Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
gre: fix sparse warning
...
This prevents a null pointer dereference when
nfs_idmap_complete_pipe_upcall_locked() calls complete_request_key().
Fixes a regression caused by commit 0cac12023 (NFSv4: Ensure that
idmap_pipe_downcall sanity-checks the downcall data).
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since the addition of NFSv4 server trunking detection the mount context
calls nfs4_proc_exchange_id then schedules the state manager, which also
calls nfs4_proc_exchange_id. Setting the NFS4CLNT_LEASE_CONFIRM bit
makes the state manager skip the unneeded EXCHANGE_ID and continue on
with session creation.
Reported-by: Jorge Mora <mora@netapp.com>
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pull user namespace changes from Eric Biederman:
"This is a mostly modest set of changes to enable basic user namespace
support. This allows the code to code to compile with user namespaces
enabled and removes the assumption there is only the initial user
namespace. Everything is converted except for the most complex of the
filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
nfs, ocfs2 and xfs as those patches need a bit more review.
The strategy is to push kuid_t and kgid_t values are far down into
subsystems and filesystems as reasonable. Leaving the make_kuid and
from_kuid operations to happen at the edge of userspace, as the values
come off the disk, and as the values come in from the network.
Letting compile type incompatible compile errors (present when user
namespaces are enabled) guide me to find the issues.
The most tricky areas have been the places where we had an implicit
union of uid and gid values and were storing them in an unsigned int.
Those places were converted into explicit unions. I made certain to
handle those places with simple trivial patches.
Out of that work I discovered we have generic interfaces for storing
quota by projid. I had never heard of the project identifiers before.
Adding full user namespace support for project identifiers accounts
for most of the code size growth in my git tree.
Ultimately there will be work to relax privlige checks from
"capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
root in a user names to do those things that today we only forbid to
non-root users because it will confuse suid root applications.
While I was pushing kuid_t and kgid_t changes deep into the audit code
I made a few other cleanups. I capitalized on the fact we process
netlink messages in the context of the message sender. I removed
usage of NETLINK_CRED, and started directly using current->tty.
Some of these patches have also made it into maintainer trees, with no
problems from identical code from different trees showing up in
linux-next.
After reading through all of this code I feel like I might be able to
win a game of kernel trivial pursuit."
Fix up some fairly trivial conflicts in netfilter uid/git logging code.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
userns: Convert the ufs filesystem to use kuid/kgid where appropriate
userns: Convert the udf filesystem to use kuid/kgid where appropriate
userns: Convert ubifs to use kuid/kgid
userns: Convert squashfs to use kuid/kgid where appropriate
userns: Convert reiserfs to use kuid and kgid where appropriate
userns: Convert jfs to use kuid/kgid where appropriate
userns: Convert jffs2 to use kuid and kgid where appropriate
userns: Convert hpfs to use kuid and kgid where appropriate
userns: Convert btrfs to use kuid/kgid where appropriate
userns: Convert bfs to use kuid/kgid where appropriate
userns: Convert affs to use kuid/kgid wherwe appropriate
userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
userns: On ia64 deal with current_uid and current_gid being kuid and kgid
userns: On ppc convert current_uid from a kuid before printing.
userns: Convert s390 getting uid and gid system calls to use kuid and kgid
userns: Convert s390 hypfs to use kuid and kgid where appropriate
userns: Convert binder ipc to use kuids
userns: Teach security_path_chown to take kuids and kgids
userns: Add user namespace support to IMA
userns: Convert EVM to deal with kuids and kgids in it's hmac computation
...
Pull cgroup updates from Tejun Heo:
- xattr support added. The implementation is shared with tmpfs. The
usage is restricted and intended to be used to manage per-cgroup
metadata by system software. tmpfs changes are routed through this
branch with Hugh's permission.
- cgroup subsystem ID handling simplified.
* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Define CGROUP_SUBSYS_COUNT according the configuration
cgroup: Assign subsystem IDs during compile time
cgroup: Do not depend on a given order when populating the subsys array
cgroup: Wrap subsystem selection macro
cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT
cgroup: net_prio: Do not define task_netpioidx() when not selected
cgroup: net_cls: Do not define task_cls_classid() when not selected
cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
cgroup: trivial fixes for Documentation/cgroups/cgroups.txt
xattr: mark variable as uninitialized to make both gcc and smatch happy
fs: add missing documentation to simple_xattr functions
cgroup: add documentation on extended attributes usage
cgroup: rename subsys_bits to subsys_mask
cgroup: add xattr support
cgroup: revise how we re-populate root directory
xattr: extract simple_xattr code from tmpfs
Pull workqueue changes from Tejun Heo:
"This is workqueue updates for v3.7-rc1. A lot of activities this
round including considerable API and behavior cleanups.
* delayed_work combines a timer and a work item. The handling of the
timer part has always been a bit clunky leading to confusing
cancelation API with weird corner-case behaviors. delayed_work is
updated to use new IRQ safe timer and cancelation now works as
expected.
* Another deficiency of delayed_work was lack of the counterpart of
mod_timer() which led to cancel+queue combinations or open-coded
timer+work usages. mod_delayed_work[_on]() are added.
These two delayed_work changes make delayed_work provide interface
and behave like timer which is executed with process context.
* A work item could be executed concurrently on multiple CPUs, which
is rather unintuitive and made flush_work() behavior confusing and
half-broken under certain circumstances. This problem doesn't
exist for non-reentrant workqueues. While non-reentrancy check
isn't free, the overhead is incurred only when a work item bounces
across different CPUs and even in simulated pathological scenario
the overhead isn't too high.
All workqueues are made non-reentrant. This removes the
distinction between flush_[delayed_]work() and
flush_[delayed_]_work_sync(). The former is now as strong as the
latter and the specified work item is guaranteed to have finished
execution of any previous queueing on return.
* In addition to the various bug fixes, Lai redid and simplified CPU
hotplug handling significantly.
* Joonsoo introduced system_highpri_wq and used it during CPU
hotplug.
There are two merge commits - one to pull in IRQ safe timer from
tip/timers/core and the other to pull in CPU hotplug fixes from
wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."
Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.
Tejun pointed out a few of them, I fixed a couple more.
* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
workqueue: remove @delayed from cwq_dec_nr_in_flight()
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
workqueue: use __cpuinit instead of __devinit for cpu callbacks
workqueue: rename manager_mutex to assoc_mutex
workqueue: WORKER_REBIND is no longer necessary for idle rebinding
workqueue: WORKER_REBIND is no longer necessary for busy rebinding
workqueue: reimplement idle worker rebinding
workqueue: deprecate __cancel_delayed_work()
workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
workqueue: use mod_delayed_work() instead of __cancel + queue
workqueue: use irqsafe timer for delayed_work
workqueue: clean up delayed_work initializers and add missing one
workqueue: make deferrable delayed_work initializer names consistent
workqueue: cosmetic whitespace updates for macro definitions
workqueue: deprecate system_nrt[_freezable]_wq
workqueue: deprecate flush[_delayed]_work_sync()
...
Sparse identified an execution path in nfs41_walk_client_list()
where the nfs_client_lock is not re-acquired before taking the next
loop iteration.
fs/nfs/nfs4client.c:437:9: sparse: context imbalance in
'nfs41_walk_client_list' - different lock contexts for basic block
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the layoutget call returns a stateid error, we want to invalidate the
layout stateid, and/or recover the open stateid.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Sparse warning:
fs/nfs/nfs4getroot.c:11:5: warning: symbol 'nfs4_get_rootfh' was not declared.
Should it be static?
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Sparse warnings:
fs/nfs/nfs4sysctl.c:56:5: warning: symbol 'nfs4_register_sysctl' was not
declared. Should it be static?
fs/nfs/nfs4sysctl.c:64:6: warning: symbol 'nfs4_unregister_sysctl' was not
declared. Should it be static?
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Sparse warning:
fs/nfs/super.c:2517:15: warning: symbol 'nfs_xdev_mount' was not declared.
Should it be static?
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Sparse warning:
fs/nfs/super.c:2638:16: sparse: symbol 'nfs_callback_tcpport' was not
declared. Should it be static?
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We should reclaim reboot state when the clientid is stale.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The current spaghetti code confuses some versions of gcc (and just
looks ugly as hell)! Clean up...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There are some unnecessary semicolons in function find_nfs_version. Just remove them.
Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Using list_move_tail() instead of list_del() + list_add_tail().
spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For DIO writes, if it is not blocksize aligned, we need to do
internal serialization. It may slow down writers anyway. So we
just bail them out and resend to MDS.
Cc: stable <stable@vger.kernel.org> [since v3.4]
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For DIO read, if it is not sector aligned, we should reject it
and resend via MDS. Otherwise there might be data corruption.
Also teach bl_read_pagelist to handle partial page reads for DIO.
Cc: stable <stable@vger.kernel.org> [since v3.4]
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If applications use flock to protect its write range, generic NFS
will not do read-modify-write cycle at page cache level. Therefore
LD should know how to handle non-sector aligned writes. Otherwise
there will be data corruption.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This reverts commit 159e0561e3, in favor
of a more complete fix to the alignment issue.
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
After commit e38eb650 (NFS: set_pnfs_layoutdriver() from
nfs4_proc_fsinfo()), set_pnfs_layoutdriver() is called inside
nfs4_proc_fsinfo(), but pnfs_blksize is not set. It causes setting
blocklayoutdriver failure and pnfsblock mount failure.
Cc: stable <stable@vger.kernel.org> [since v3.5]
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
pnfs_within_mdsthreshold() is called inside pg_init. We need to set
read_io/write_io before that. Otherwise we fail pnfs_within_mdsthreshold()
and IO goes to MDS.
A simple test case:
dd if=foo of=/mnt/pnfs/bar bs=10M count=1 oflag=direct
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
An optional boot parameter is introduced to allow client
administrators to specify a string that the Linux NFS client can
insert into its nfs_client_id4 id string, to make it both more
globally unique, and to ensure that it doesn't change even if the
client's nodename changes.
If this boot parameter is not specified, the client's nodename is
used, as before.
Client installation procedures can create a unique string (typically,
a UUID) which remains unchanged during the lifetime of that client
instance. This works just like creating a UUID for the label of the
system's root and boot volumes.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
"Server trunking" is a fancy named for a multi-homed NFS server.
Trunking might occur if a client sends NFS requests for a single
workload to multiple network interfaces on the same server. There
are some implications for NFSv4 state management that make it useful
for a client to know if a single NFSv4 server instance is
multi-homed. (Note this is only a consideration for NFSv4, not for
legacy versions of NFS, which are stateless).
If a client cares about server trunking, no NFSv4 operations can
proceed until that client determines who it is talking to. Thus
server IP trunking discovery must be done when the client first
encounters an unfamiliar server IP address.
The nfs_get_client() function walks the nfs_client_list and matches
on server IP address. The outcome of that walk tells us immediately
if we have an unfamiliar server IP address. It invokes
nfs_init_client() in this case. Thus, nfs4_init_client() is a good
spot to perform trunking discovery.
Discovery requires a client to establish a fresh client ID, so our
client will now send SETCLIENTID or EXCHANGE_ID as the first NFS
operation after a successful ping, rather than waiting for an
application to perform an operation that requires NFSv4 state.
The exact process for detecting trunking is different for NFSv4.0 and
NFSv4.1, so a minorversion-specific init_client callout method is
introduced.
CLID_INUSE recovery is important for the trunking discovery process.
CLID_INUSE is a sign the server recognizes the client's nfs_client_id4
id string, but the client is using the wrong principal this time for
the SETCLIENTID operation. The SETCLIENTID must be retried with a
series of different principals until one works, and then the rest of
trunking discovery can proceed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, when identifying itself to NFS servers, the Linux NFS
client uses a unique nfs_client_id4.id string for each server IP
address it talks with. For example, when client A talks to server X,
the client identifies itself using a string like "AX". The
requirements for these strings are specified in detail by RFC 3530
(and bis).
This form of client identification presents a problem for Transparent
State Migration. When client A's state on server X is migrated to
server Y, it continues to be associated with string "AX." But,
according to the rules of client string construction above, client
A will present string "AY" when communicating with server Y.
Server Y thus has no way to know that client A should be associated
with the state migrated from server X. "AX" is all but abandoned,
interfering with establishing fresh state for client A on server Y.
To support transparent state migration, then, NFSv4.0 clients must
instead use the same nfs_client_id4.id string to identify themselves
to every NFS server; something like "A".
Now a client identifies itself as "A" to server X. When a file
system on server X transitions to server Y, and client A identifies
itself as "A" to server Y, Y will know immediately that the state
associated with "A," whether it is native or migrated, is owned by
the client, and can merge both into a single lease.
As a pre-requisite to adding support for NFSv4 migration to the Linux
NFS client, this patch changes the way Linux identifies itself to NFS
servers via the SETCLIENTID (NFSv4 minor version 0) and EXCHANGE_ID
(NFSv4 minor version 1) operations.
In addition to removing the server's IP address from nfs_client_id4,
the Linux NFS client will also no longer use its own source IP address
as part of the nfs_client_id4 string. On multi-homed clients, the
value of this address depends on the address family and network
routing used to contact the server, thus it can be different for each
server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, the Linux client uses a unique nfs_client_id4.id string
when identifying itself to distinct NFS servers.
To support transparent state migration, the Linux client will have to
use the same nfs_client_id4 string for all servers it communicates
with (also known as the "uniform client string" approach). Otherwise
NFS servers can not recognize that open and lock state need to be
merged after a file system transition.
Unfortunately, there are some NFSv4.0 servers currently in the field
that do not tolerate the uniform client string approach.
Thus, by default, our NFSv4.0 mounts will continue to use the current
approach, and we introduce a mount option that switches them to use
the uniform model. Client administrators must identify which servers
can be mounted with this option. Eventually most NFSv4.0 servers will
be able to handle the uniform approach, and we can change the default.
The first mount of a server controls the behavior for all subsequent
mounts for the lifetime of that set of mounts of that server. After
the last mount of that server is gone, the client erases the data
structure that tracks the lease. A subsequent lease may then honor
a different "migration" setting.
This patch adds only the infrastructure for parsing the new mount
option. Support for uniform client strings is added in a subsequent
patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
An ULP is supposed to be able to replace a GSS rpc_auth object with
another GSS rpc_auth object using rpcauth_create(). However,
rpcauth_create() in 3.5 reliably fails with -EEXIST in this case.
This is because when gss_create() attempts to create the upcall pipes,
sometimes they are already there. For example if a pipe FS mount
event occurs, or a previous GSS flavor was in use for this rpc_clnt.
It turns out that's not the only problem here. While working on a
fix for the above problem, we noticed that replacing an rpc_clnt's
rpc_auth is not safe, since dereferencing the cl_auth field is not
protected in any way.
So we're deprecating the ability of rpcauth_create() to switch an
rpc_clnt's security flavor during normal operation. Instead, let's
add a fresh API that clones an rpc_clnt and gives the clone a new
flavor before it's used.
This makes immediate use of the new __rpc_clone_client() helper.
This can be used in a similar fashion to rpcauth_create() when a
client is hunting for the correct security flavor. Instead of
replacing an rpc_clnt's security flavor in a loop, the ULP replaces
the whole rpc_clnt.
To fix the -EEXIST problem, any ULP logic that relies on replacing
an rpc_clnt's rpc_auth with rpcauth_create() must be changed to use
this API instead.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the state manager thread is not actually able to fully recover from
some situation, it wakes up waiters, who kick off a new state manager
thread. Quite often the fresh invocation of the state manager is just
as successful.
This results in a livelock as the client dumps thousands of NFS
requests a second on the network in a vain attempt to recover. Not
very friendly.
To mitigate this situation, add a delay in the state manager after
an unhandled error, so that the client sends just a few requests
every second in this case.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
fs/nfs/super.c: In function ‘nfs_compare_remount_data’:
fs/nfs/super.c:2042:18: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2043:18: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2044:20: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2046:21: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2047:21: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2048:21: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2049:21: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2050:18: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
Seen with gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NSM RPC client can be required on NFSv3 umount, when child reaper is dying
(and destroying it's mount namespace). It means, that current nsproxy is set
to NULL already, but creation of RPC client requires UTS namespace for gaining
hostname string.
This patch creates reference-counted per-net NSM client on first monitor
request and destroys it after last unmonitor request.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Taking hostname from uts namespace if not safe, because this cuold be
performind during umount operation on child reaper death. And in this case
current->nsproxy is NULL already.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pull CIFS updates from Steve French:
"This patchset is the final section of the SMB2.1 support merge for
cifs.ko. It also includes improvements to the cifs socket handling
from Jeff, and also fixes a few cifs bug fixes. It adds SMB2 support
for file and inode operations as well as moves some existing cifs code
to use ops server struct of protocol specific callbacks.
Most of this code is SMB2 specific. When enabled SMB2.1 does pass
various functional tests including most of the connectathon test
suite, For SMB2.1, Connectathon test 4 and some related tests fail due
to not updating mode bits remotely (cifsacl support where mode bits
are approximated with the cifs acl is not enable for smb2), and test8
(symlink) support is not completed for SMB2 yet (note that we will
likely have a "Unix Extensions" eventually, at least for Samba, so in
the long run posix locks won't have to be emulated when mounting Linux
to Linux, but for most NAS and for Windows mounts posix lock emulation
will still used for SMB2 in a similar fashion as we do for cifs).
SMB2.1 dialect is supported. Although additional fixes to enable smb2
(the original smb2.02) dialect and to add various optional features of
the smb3 dialect are expected to be added in the future as testing
progresses, currently mounting with the "vers=2.1" is supported (in
order to mount using SMB2.1 to servers like Samba 4, and Windows 7,
Windows 2008R2)."
* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6: (82 commits)
[CIFS] Fix indentation of fs/cifs/Kconfig entries
[CIFS] Fix SMB2 negotiation support to select only one dialect (based on vers=)
cifs: obtain file access during backup intent lookup (resend)
CIFS: Fix possible freed pointer dereference in CIFS_SessSetup
CIFS: Fix possible freed pointer dereference in SMB2_sess_setup
CIFS: Make ops->close return void
cifs: change DOS/NT/POSIX mapping of ERRnoresource
cifs: remove support for deprecated "forcedirectio" and "strictcache" mount options
cifs: remove support for CIFS_IOC_CHECKUMOUNT ioctl
CIFS: Fix possible memory leaks in SMB2 code
CIFS: Fix endian conversion of IndexNumber
Trivial endian fixes
MARK SMB2 support EXPERIMENTAL
Update cifs version number
cifs: add FL_CLOSE to fl_flags mask in cifs_read_flock
cifs: Mangle string used for unc in /proc/mounts
cifs: cleanups for cifs_mkdir_qinfo
CIFS: Fix fast lease break after open problem
CIFS: Add SMB2.1 lease break support
CIFS: Fix cache coherency for read oplock case
...
NSM RPC client can be required on NFSv3 umount, when child reaper is dying (and
destroying it's mount namespace). It means, that current nsproxy is set to
NULL already, but creation of RPC client requires UTS namespace for gaining
hostname string.
This patch introduces reference counted NFS RPC clients creation and
destruction helpers (similar to RPCBIND RPC clients).
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch also introduces refcount-aware nfs_callback_down_net() wrapper for
svc_shutdown_net().
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Usage coutner now increased only is the service was started sccessfully.
Even if service is running already, then goto is not required anymore, because
service creation and start will be skipped.
With this patch code looks clearer.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is just a code move, which from my POW makes code looks better.
I.e. now on start we have 3 different stages:
1) Service creation.
2) Service per-net data allocation.
3) Service start.
Patch also renames goto label "out_err:" into "err_start:" to reflect new
changes.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
No need to assign transports backchannel server explicitly in
nfs41_callback_up() - there is nfs_callback_bc_serv() function for this.
By using it, nfs4_callback_up() and nfs41_callback_up() can be called without
transport argument.
Note: service have to be passed to nfs_callback_bc_serv() instead of callback,
since callback link can be uninitialized.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
v4:
1) Callback transport creation routine selection by version simlified.
This new function in now called before nfs_minorversion_callback_svc_setup()).
Also few small changes:
1) current network namespace in nfs_callback_up() was replaced by transport net.
2) svc_shutdown_net() was moved prior to callback usage counter decrement
(because in case of per-net data allocation faulure svc_shutdown_net() have to
be skipped).
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This function creates service if it's not exist, or increase usage counter of
the existent, and returns pointer to it.
Usage counter will be droppepd by svc_destroy() later in nfs_callback_up().
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The OPEN operation has no way to differentiate an open for read and an
open for execution - both look like read to the server. This allowed
users to read files that didn't have READ access but did have EXEC access,
which is obviously wrong.
This patch adds an ACCESS call to the OPEN compound to handle the
difference between OPENs for reading and execution. Since we're going
through the trouble of calling ACCESS, we check all possible access bits
and cache the results hopefully avoiding an ACCESS call in the future.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we are creating an osd request and get an invalid layout, return
an EINVAL to the caller. We switch up the return to have an error
code instead of NULL implying -ENOMEM.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
This will allocate memory that has already been zeroed, allowing us to
remove the memset later on.
Signed-off-by: Bryan Schumaker <bjchuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I put the client into an open recovery loop by:
Client: Open file
read half
Server: Expire client (echo 0 > /sys/kernel/debug/nfsd/forget_clients)
Client: Drop vm cache (echo 3 > /proc/sys/vm/drop_caches)
finish reading file
This causes a loop because the client never updates the nfs4_state after
discovering that the delegation is invalid. This means it will keep
trying to read using the bad delegation rather than attempting to re-open
the file.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
CC: stable@vger.kernel.org [3.4+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we are reading through a delegation, and the delegation is OK then
state->stateid will still point to a delegation stateid and not an open
stateid.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There are two main patches in this set, both related to the userland
dlm_controld daemon. The first fixes a deadlock between dlm_controld and
the dlm_send workqueue when both access configfs data simultaneously. The
second reworks some code to get around a long standing, but intentional,
unlock balance warning. The userland daemon no longer takes a lock that
is later released from the kernel. The other commits are minor fixes and
changes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQaeyUAAoJEDgbc8f8gGmqokAQALIt28LyaIAh9bfc0AWyLldM
t81sT2gDR/uF0r2sLHdevunCecUbxR8CN5i+pxbTf1OsDkrhYQDx7GtQvTUefc2M
/+JA/8UosZkrsTC0LvayWbsbC15c8epwxmERel/feinZ7dmgqNGBLaMPGJzZMv5/
KvHPRMg/JSJOpnwC64xZ097RsNkX5P8lwnK7U5ivs4vLCYR3ac3+V2+t/hxAMONE
dwt7EZ/ADCBMUkQDTgisJA/Xr4yub+tTxUMshwkX+ve8nzSdj8dPzMFfky0bSady
+x/r3GSVScQ//qqp8a01Jb2ZeFhKMfpTcPWve4chsngszsSzElJeMbw7numVuU98
7hCUbH5MeSuQ2Nx3qYfi1LhrV0687lbkNLlqET+3PX/r6m6T4SqucOmPv2SeSet9
rLGYNM1hJxx1rPASVtEsE/Xhr1agI58EO9tNb9p79wS4X5fcGqUsQXv4gRnu2/LN
GRUTADC7nFDXy6BItYDwsuQfuJjMaDIsZ49dEUXYTlzav/zYK4IpU3fGuG3z/UDZ
pQl+g0aK7rsq5hk7jxGfdHDBvhM+kcRMeQuTJqxJuJvLs0pmLWvFJXbyn+xh7y9Y
MZYSSI4L3j1s4u6SkGUsMRezfcfJEYQdOzlIs7HEaLG++yBgcScF70ZgJByxzmfr
eb9tpzwQvGLY4CvUVbEi
=Ghb3
-----END PGP SIGNATURE-----
Merge tag 'dlm-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"There are two main patches in this set, both related to the userland
dlm_controld daemon.
The first fixes a deadlock between dlm_controld and the dlm_send
workqueue when both access configfs data simultaneously.
The second reworks some code to get around a long standing, but
intentional, unlock balance warning. The userland daemon no longer
takes a lock that is later released from the kernel.
The other commits are minor fixes and changes."
* tag 'dlm-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: check the maximum size of a request from user
dlm: cleanup send_to_sock routine
dlm: convert add_sock routine return value type to void
dlm: remove redundant variable assignments
dlm: fix unlock balance warnings
dlm: fix uninitialized spinlock
dlm: fix deadlock between dlm_send and dlm_controld
Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu().
dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Sage Weil <sage@inktank.com>
A recent change to /sbin/mountall causes any trailing '/' character
in the "device" (or fs_spec) field in /etc/fstab to be stripped. As
a result, an entry for a ceph mount that intends to mount the root
of the name space ends up with now path portion, and the ceph mount
option processing code rejects this.
That is, an entry in /etc/fstab like:
cephserver:port:/ /mnt ceph defaults 0 0
provides to the ceph code just "cephserver:port:" as the "device,"
and that gets rejected.
Although this is a bug in /sbin/mountall, we can have the ceph mount
code support an empty/nonexistent path, interpreting it to mean the
root of the name space.
RFC 5952 offers recommendations for how to express IPv6 addresses,
and recommends the usage found in RFC 3986 (which specifies the
format for URI's) for representing both IPv4 and IPv6 addresses that
include port numbers. (See in particular the definition of
"authority" found in the Appendix of RFC 3986.)
According to those standards, no host specification will ever
contain a '/' character. As a result, it is sufficient to scan a
provided "device" from an /etc/fstab entry for the first '/'
character, and if it's found, treat that as the beginning of the
path. If no '/' character is present, we can treat the entire
string as the monitor host specification(s), and assume the path
to be the root of the name space. We'll still require a ':' to
separate the host portion from the (possibly empty) path portion.
This means that we can more formally define how ceph will interpret
the "device" it's provided when processing a mount request:
"device" will look like:
<server_spec>[,<server_spec>...]:[<path>]
where
<server_spec> is <ip>[:<port>]
<path> is optional, but if present must begin with '/'
This addresses http://tracker.newdream.net/issues/2919
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
As we skipped the merge window for 3.6-rc1 for the tty tree, everything
is now settled down and working properly, so we are ready for 3.7-rc1.
Here's the patchset, it's big, but the large changes are removing a
firmware file and adding a staging tty driver (it depended on the tty
core changes, so it's going through this tree instead of the staging
tree.)
All of these patches have been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlBp36oACgkQMUfUDdst+yk4WgCdEy13hot8fI2Lqnc7W0LKu7GX
4p8AoLTjzrXhLosxdijskDQ9X1OtjrxU
=S5Ng
-----END PGP SIGNATURE-----
Merge tag 'tty-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull TTY changes from Greg Kroah-Hartman:
"As we skipped the merge window for 3.6-rc1 for the tty tree,
everything is now settled down and working properly, so we are ready
for 3.7-rc1. Here's the patchset, it's big, but the large changes are
removing a firmware file and adding a staging tty driver (it depended
on the tty core changes, so it's going through this tree instead of
the staging tree.)
All of these patches have been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
Fix up more-or-less trivial conflicts in
- drivers/char/pcmcia/synclink_cs.c:
tty NULL dereference fix vs tty_port_cts_enabled() helper function
- drivers/staging/{Kconfig,Makefile}:
add-add conflict (dgrp driver added close to other staging drivers)
- drivers/staging/ipack/devices/ipoctal.c:
"split ipoctal_channel from iopctal" vs "TTY: use tty_port_register_device"
* tag 'tty-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (235 commits)
tty/serial: Add kgdb_nmi driver
tty/serial/amba-pl011: Quiesce interrupts in poll_get_char
tty/serial/amba-pl011: Implement poll_init callback
tty/serial/core: Introduce poll_init callback
kdb: Turn KGDB_KDB=n stubs into static inlines
kdb: Implement disable_nmi command
kernel/debug: Mask KGDB NMI upon entry
serial: pl011: handle corruption at high clock speeds
serial: sccnxp: Make 'default' choice in switch last
serial: sccnxp: Remove mask termios caps for SW flow control
serial: sccnxp: Report actual baudrate back to core
serial: samsung: Add poll_get_char & poll_put_char
Powerpc 8xx CPM_UART setting MAXIDL register proportionaly to baud rate
Powerpc 8xx CPM_UART maxidl should not depend on fifo size
Powerpc 8xx CPM_UART too many interrupts
Powerpc 8xx CPM_UART desynchronisation
serial: set correct baud_base for EXSYS EX-41092 Dual 16950
serial: omap: fix the reciever line error case
8250: blacklist Winbond CIR port
8250_pnp: do pnp probe before legacy probe
...
This reverts commit 0885ef5b56
After applying the above patch, the performance slowed down because the dirty
page flush can only be done by one task, so revert it.
The following is the test result of sysbench:
Before After
24MB/s 39MB/s
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Everybody is just making stuff up, and it's just used to see if we really do
need to alloc a chunk, and since we do this when we already know we really
do it's just a waste of space. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
So we have lots of places where we try to preallocate chunks in order to
make sure we have enough space as we make our allocations. This has
historically meant that we're constantly tweaking when we should allocate a
new chunk, and historically we have gotten this horribly wrong so we way
over allocate either metadata or data. To try and keep this from happening
we are going to make it so that the block group item insertion is done out
of band at the end of a transaction. This will allow us to create chunks
even if we are trying to make an allocation for the extent tree. With this
patch my enospc tests run faster (didn't expect this) and more efficiently
use the disk space (this is what I wanted). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
For immutable bio vecs, I've been auditing and removing bi_idx
references. These were harmless, but removing them will make auditing
easier.
scrub_bio_end_io_worker() was open coding a bio_reset() - but this
doesn't appear to have been needed for anything as right after it does a
bio_put(), and perusing the code it doesn't appear anything else was
holding a reference to the bio.
The other use end_bio_extent_readpage() was just for a pr_debug() -
changed it to something that might be a bit more useful.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Stefan Behrens <sbehrens@giantdisaster.de>
When we wrote some data by compress mode into a btrfs filesystem which was full
of the fragments, the kernel will report:
BTRFS warning (device xxx): Aborting unused transaction.
The reason is:
We can not find a long enough free space to store the compressed data because
of the fragmentary free space, and the compressed data can not be splited,
so the kernel outputed the above message.
In fact, btrfs can deal with this problem very well: it fall back to
uncompressed IO, split the uncompressed data into small ones, and then
store them into to the fragmentary free space. So we shouldn't output the
above warning message.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Wade Cline reported a problem where he was getting garbage and warnings when
writing to a preallocated range via O_DIRECT. This is because we weren't
creating our normal pinned extent_map for the range we were writing to,
which was causing all sorts of issues. This patch fixes the problem and
makes his testcase much happier. Thanks,
Reported-by: Wade Cline <clinew@linux.vnet.ibm.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Sage reported the following lockdep backtrace
=====================================
[ BUG: bad unlock balance detected! ]
3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted
-------------------------------------
btrfs-cleaner/7607 is trying to release lock (sb_internal) at:
[<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
but there are no more locks to release!
other info that might help us debug this:
1 lock held by btrfs-cleaner/7607:
#0: (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b405>] cleaner_kthread+0x95/0x120 [btrfs]
stack backtrace:
Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1
Call Trace:
[<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
[<ffffffff810afa9e>] print_unlock_inbalance_bug+0xfe/0x110
[<ffffffff810b289e>] lock_release_non_nested+0x1ee/0x310
[<ffffffff81172f9b>] ? kmem_cache_free+0x7b/0x160
[<ffffffffa004106c>] ? put_transaction+0x8c/0x130 [btrfs]
[<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
[<ffffffff810b2a95>] lock_release+0xd5/0x220
[<ffffffff81173071>] ? kmem_cache_free+0x151/0x160
[<ffffffff8117d9ed>] __sb_end_write+0x7d/0x90
[<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
[<ffffffff81079850>] ? __init_waitqueue_head+0x60/0x60
[<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffffa0042758>] __btrfs_end_transaction+0x368/0x3c0 [btrfs]
[<ffffffffa0042808>] btrfs_end_transaction_throttle+0x18/0x20 [btrfs]
[<ffffffffa00318f0>] btrfs_drop_snapshot+0x410/0x600 [btrfs]
[<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0
[<ffffffffa00430ef>] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs]
[<ffffffffa003b405>] ? cleaner_kthread+0x95/0x120 [btrfs]
[<ffffffffa003b419>] cleaner_kthread+0xa9/0x120 [btrfs]
[<ffffffffa003b370>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
[<ffffffff810791ee>] kthread+0xae/0xc0
[<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
[<ffffffff81635430>] ? retint_restore_args+0x13/0x13
[<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0
[<ffffffff8163e740>] ? gs_change+0x13/0x13
This is because the throttle stuff can commit the transaction, which expects to
be the one stopping the intwrite stuff, but we've already done it in the
__btrfs_end_transaction. Moving the sb_end_intewrite after this logic makes the
lockdep go away. Thanks,
Tested-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This is the change of the kernel side.
Translation of logical to inode used to have an upper limit 4k on
inode container's size, but the limit is not large enough for a data
with a great many of refs, so when resolving logical address,
we can end up with
"ioctl ret=0, bytes_left=0, bytes_missing=19944, cnt=510, missed=2493"
This changes to regard 64k as the upper limit and use vmalloc instead of
kmalloc to get memory more easily.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag.
It is possible to lose our errors because
(-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true.
I'm not sure if it is on purpose, it just looks too hacky if it is.
I'd rather use a real flag and a 'ret' to catch errors.
Acked-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Liu Bo <liub.liubo@gmail.com>
As ref cache has been removed from btrfs, there is no user on
its lock and its check.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
When we delete a inode, we will remove all the delayed items including delayed
inode update, and then truncate all the relative metadata. If there is lots of
metadata, we will end the current transaction, and start a new transaction to
truncate the left metadata. In this way, we will leave a inode item that its
link counter is > 0, and also may leave some directory index items in fs/file tree
after the current transaction ends. In other words, the metadata in this fs/file tree
is inconsistent. If we create a snapshot for this tree now, we will find a inode with
corrupted metadata in the new snapshot, and we won't continue to drop the left metadata,
because its link counter is not 0.
We fix this problem by updating the inode item before the current transaction ends.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
I noticed I was seeing large lags when running my torrent test in a vm on my
laptop. While trying to make it lag less I noticed that our overcommit math
was taking into account the number of bytes we wanted to reclaim, not the
number of bytes we actually wanted to allocate, which means we wouldn't
overcommit as often. This patch fixes the overcommit math and makes
shrink_delalloc() use that logic so that it will stop looping faster. We
still have pretty high spikes of latency, but the test now takes 3 minutes
less time (about 5% faster). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Mitch reported a problem where you could get an ENOSPC error when untarring
a kernel git tree onto a 16gb file system with compress-force=zlib. This is
because compression is a huge pain, it will return from ->writepages()
without having actually created any ordered extents. To get around this we
check to see if the async submit counter is up, and if it is wait until it
drops to 0 before doing our normal ordered wait dance. With this patch I
can now untar a kernel git tree onto a 16gb file system without getting
ENOSPC errors. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We're going to use this flag EXTENT_DEFRAG to indicate which range
belongs to defragment so that we can implement snapshow-aware defrag:
We set the EXTENT_DEFRAG flag when dirtying the extents that need
defragmented, so later on writeback thread can differentiate between
normal writeback and writeback started by defragmentation.
Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
ulist_alloc() has the possibility of returning NULL.
So, it is necessary to check the return value.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
When we ran fsstress(a program in xfstests), the filesystem hung up when it
is full. It was because the space reserved in btrfs_fallocate() was wrong,
btrfs_fallocate() just used the size of the pre-allocation to reserve the
space, didn't took the block size aligning into account, so the size of
the reserved space was less than the allocated space, it caused the over
reserve problem and made the filesystem hung up when invoking cow_file_range().
Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Though we dump the stack information when aborting a unused transaction
handle, we don't know the correct place where we decide to abort the
transaction handle if one function has several place where the transaction
abort function is invoked and jumps to the same place after this call.
And beside that we also don't know the reason why we jump to abort
the current handle. So I modify the transaction abort function and make
it output the function name, line and error information.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
We forget to protect ->log_batch when syncing a file, this patch fix
this problem by atomic operation. And ->log_batch is used to check
if there are parallel sync operations or not, so it is unnecessary to
reset it to 0 after the sync operation of the current log tree complete.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
We should insert/update 6 items(root ref, root backref, dir item, dir index,
root item and parent inode) when creating a snapshot, not 5 items, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
The snapshot should be the image of the fs tree before it was created,
so the metadata of the snapshot should not exist in the its tree. But now, we
found the directory item and directory name index is in both the snapshot tree
and the fs tree. It introduces some problems and makes the users feel strange:
# mkfs.btrfs /dev/sda1
# mount /dev/sda1 /mnt
# mkdir /mnt/1
# cd /mnt/1
# btrfs subvolume snapshot /mnt snap0
# ls -a /mnt/1/snap0/1
. .. [no other file/dir]
# ll /mnt/1/snap0/
total 0
drwxr-xr-x 1 root root 10 Ju1 24 12:11 1
^^^
There is no file/dir in it, but it's size is 10
# cd /mnt/1/snap0/1/snap0
[Enter a unexisted directory successfully...]
There is nothing in the directory 1 in snap0, but btrfs told the length of
this directory is 10. Beside that, we can enter an unexisted directory, it is
very strange to the users.
# btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1
# ll /mnt/1/snap0/1/
total 0
[None]
# ll /mnt/snap1/1/
total 0
drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0
And the source of snap1 did have any directory in Directory 1, but snap1 have
a snap0, it is different between the source and the snapshot.
So I think we should insert directory item and directory name index and update
the parent inode as the last step of snapshot creation, and do not leave the
useless metadata in the file tree.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Sometimes we need choose the method of the reservation according to the type
of the block reservation, such as the reservation for the delayed inode update.
Now we identify the type just by comparing the address of the reservation
variants, it is very ugly if it is a temporary one because we need compare it
with all the common reservation variants. So we add a new "type" field to keep
the type the reservation variants.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
The ordered extent allocation is in the fast path of the IO, so use a slab
to improve the speed of the allocation.
"Size of the struct is 280, so this will fall into the size-512 bucket,
giving 8 objects per page, while own slab will pack 14 objects into a page.
Another benefit I see is to check for leaked objects when the module is
removed (and the cache destroy takes place)."
-- David Sterba
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
If a snapshot is created while we are writing some data into the file,
the i_size of the corresponding file in the snapshot will be wrong, it will
be beyond the end of the last file extent. And btrfsck will report:
root 256 inode 257 errors 100
Steps to reproduce:
# mkfs.btrfs <partition>
# mount <partition> <mnt>
# cd <mnt>
# dd if=/dev/zero of=tmpfile bs=4M count=1024 &
# for ((i=0; i<4; i++))
> do
> btrfs sub snap . $i
> done
This because the algorithm of disk_i_size update is wrong. Though there are
some ordered extents behind the current one which we use to update disk_i_size,
it doesn't mean those extents will be dealt with in the same transaction. So
We shouldn't use the offset of those extents to update disk_i_size. Or we will
get the wrong i_size in the snapshot.
We fix this problem by recording the max real i_size. If we find there is a
ordered extent which is in front of the current one and doesn't complete, we
will record the end of the current one into that ordered extent. Surely, if
the current extent holds the end of other extent(it must be greater than
the current one because it is behind the current one), we will record the
number that the current extent holds. In this way, we can exclude the ordered
extents that may not be dealth with in the same transaction, and be easy to
know the real disk_i_size.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
If we create several snapshots at the same time, the following BUG_ON() will be
triggered.
kernel BUG at fs/btrfs/extent-tree.c:6047!
Steps to reproduce:
# mkfs.btrfs <partition>
# mount <partition> <mnt>
# cd <mnt>
# for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done
# for ((i=0; i<4; i++))
> do
> mkdir $i
> for ((j=0; j<200; j++))
> do
> btrfs sub snap . $i/$j
> done &
> done
The reason is:
Before transaction commit, some operations changed the fs tree and new tree
blocks were allocated because of COW. We used the implicit non-shared back
reference for those newly allocated tree blocks because they were not shared by
two or more trees.
And then we created the first snapshot for the fs tree, according to the back
reference rules, we also used implicit back refs for the child tree blocks of
the root node of the fs tree, now those child nodes/leaves were shared by two
trees.
Then We didn't deal with the delayed references, and continued to change the fs
tree(created the second snapshot and inserted the dir item of the new snapshot
into the fs tree). According to the rules of the back reference, we added full
back refs for those tree blocks whose parents have be shared by two trees.
Now some newly allocated tree blocks had two types of the references.
As we know, the delayed reference system handles these delayed references from
back to front, and the full delayed reference is inserted after the implicit
ones. So when we dealt with the back references of those newly allocated tree
blocks, the full references was dealt with at first. And if the first reference
is a shared back reference and the tree block that the reference points to is
newly allocated, It would be considered as a tree block which is shared by two
or more trees when it is allocated and should be a full back reference not a
implicit one, the flag of its reference also should be set to FULL_BACKREF.
But in fact, it was a non-shared tree block with a implicit reference at
beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON
was triggered.
We have several methods to fix this bug:
1. deal with delayed references after the snapshot is created and before we
change the source tree of the snapshot. This is the easiest and safest way.
2. modify the sort method of the delayed reference tree, make the full delayed
references be inserted before the implicit ones. It is also very easy, but
I don't know if it will introduce some problems or not.
3. modify select_delayed_ref() and make it select the implicit delayed reference
at first. This way is not so good because it may wastes CPU time if we have
lots of delayed references.
4. set the flags to FULL_BACKREF, this method is a little complex comparing with
the 1st way.
I chose the 1st way to fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
This patch fixes the following problem:
- If we failed to deal with the delayed dir items, we should abort transaction,
just as its comment said. Fix it.
- If root reference or root back reference insertion failed, we should
abort transaction. Fix it.
- Fix the double free problem of pending->inherit.
- Do not restore the trans->rsv if we doesn't change it.
- make the error path more clearly.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
bbio has been malloced in btrfs_map_block() and should be
freed before leaving from the error handling cases.
spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
I noticed this when I was doing the fsync stuff, we allocate split extents if we
drop an extent range that is in the middle of an existing extent. This BUG()'s
if we fail to allocate memory, but the fact is this is just a cache, we will
just regenerate the cache if we need it, the important part is that we free the
range we are given. This can be done without allocations, so if we fail to
allocate splits just skip the splitting stage and free our em and look for more
extents to drop. This also makes btrfs_drop_extent_cache a void since nobody
was checking the return value anyway. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The freeze rwsem is taken by sb_start_intwrite() and dropped during the
commit_ or end_transaction(). In the async case, that happens in a worker
thread. Tell lockdep the calling thread is releasing ownership of the
rwsem and the async thread is picking it up.
XFS plays the same trick in fs/xfs/xfs_aops.c.
Signed-off-by: Sage Weil <sage@inktank.com>
I audited all users of btrfs_drop_extents and found that nobody actually uses
the hint_byte argument. I'm sure it was used for something at some point but
it's not used now, and the way the pinning works the disk bytenr would never be
immediately useful anyway so lets just remove it. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This is based on Josef's "Btrfs: turbo charge fsync".
If an inode is a BTRFS_INODE_NODATASUM one, we don't need to look for csum
items any more.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
This is based on Josef's "Btrfs: turbo charge fsync".
The current btrfs checks if an inode is in log by comparing
root's last_log_commit to inode's last_sub_trans[2].
But the problem is that this root->last_log_commit is shared among
inodes.
Say we have N inodes to be logged, after the first inode,
root's last_log_commit is updated and the N-1 remained files will
be skipped.
This fixes the bug by keeping a local copy of root's last_log_commit
inside each inode and this local copy will be maintained itself.
[1]: we regard each log transaction as a subset of btrfs's transaction,
i.e. sub_trans
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
This is based on Josef's "Btrfs: turbo charge fsync".
The above Josef's patch performs very good in random sync write test,
because we won't have too much extents to merge.
However, it does not performs good on the test:
dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync
The reason is when we do sequencial sync write, we need to merge the
current extent just with the previous one, so that we can get accumulated
extents to log:
A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ...
So we'll have to flush more and more checksum into log tree, which is the
bottleneck according to my tests.
But we can avoid this by telling fsync the real extents that are needed
to be logged.
With this, I did the above dd sync write test (size=50m),
w/o (orig) w/ (josef's) w/ (this)
SATA 104KB/s 109KB/s 121KB/s
ramdisk 1.5MB/s 1.5MB/s 10.7MB/s (613%)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
We will stop and restart a transaction every time we move to a different leaf
when truncating a file. This is for enospc reasons, but really we could
probably get away with doing this a little better by actually working until we
hit an ENOSPC. So add a ->failfast flag to the block_rsv and set it when we do
truncates which will fail as soon as the block rsv runs out of space, and then
at that point we can stop and restart the transaction and refill the block rsv
and carry on. This will make rm'ing of a file with lots of extents a bit
faster. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This is based on Josef's "Btrfs: turbo charge fsync".
We should cleanup those extents after we've finished logging inode,
otherwise we may do redundant work on them.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
I hit this a couple times while working on my fsync patch (all my bugs, not
normal operation), but with my new stuff we could have new errors from cases
I have not encountered, so instead of BUG()'ing we should be WARN()'ing so
that we are notified there is a problem but the user doesn't lose their
data. We can easily commit the transaction in the case that the tree
logging fails and still be fine, so let's try and be as nice to the user as
possible. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
At least for the vm workload. Currently on fsync we will
1) Truncate all items in the log tree for the given inode if they exist
and
2) Copy all items for a given inode into the log
The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing. This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them. We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already. Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write
Original Patched
SATA drive 82KB/s 140KB/s
Fusion drive 431KB/s 2532KB/s
So around 2-6 times faster depending on your hardware. There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok. This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get. All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.
The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok. I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
While working on my fsync patch my fsync tester kept hitting mismatching
md5sums when I would randomly write to a prealloc'ed region, syncfs() and
then write to the prealloced region some more and then fsync() and then
immediately reboot. This is because the tree logging code will skip writing
csums for file extents who's generation is less than the current running
transaction. When we mark extents as written we haven't been updating their
generation so they were always being skipped. This wouldn't happen if you
were to preallocate and then write in the same transaction, but if you for
example prealloced a VM you could definitely run into this problem. This
patch makes my fsync tester happy again. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Swinging this pendulum back the other way. We've been allocating chunks up
to 2% of the disk no matter how much we actually have allocated. So instead
fix this calculation to only allocate chunks if we have more than 80% of the
space available allocated. Please test this as it will likely cause all
sorts of ENOSPC problems to pop up suddenly. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is a completely impossible situation to hit where you can preallocate
a file, fsync it, write into the preallocated region, have the transaction
commit twice and then fsync and then immediately lose power and lose all of
the contents of the write. This patch fixes this just so I feel better
about the situation and because it is lightweight, we just update the
last_trans when we finish an ordered IO and we don't update the inode
itself. This way we are completely safe and I feel better. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The btrfs send code was assuming the offset of the file item into the
extent translated to bytes on disk. If we're compressed, this isn't
true, and so it was off into extents owned by other files.
It was also improperly handling inline extents. This solves a crash
where we may have gone past the end of the file extent item by not
testing early enough for an inline extent. It also solves problems
where we have a whole between the end of the inline item and the start
of the full extent.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We can't do the deleted/reused logic for top/root inodes as it would
create a stream that tries to delete and recreate the root dir.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We have to ignore inode/space cache objects in send/receive.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We need to pass the root that we determined earlier to iterate_inode_ref.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
The previous check was working fine, but this check should be
easier to read. Also, we could theoritically have some exotic
bugs with the previous checks.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
A leftover from older code and unused now.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Updating send_progress in process_recorded_refs was not correct.
It got updated too early in the cur_inode_new_gen case.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Btrfs send/receive uses the aux field to store inode numbers. On
32 bit machines this may become a problem.
Also fix all users of ulist_add and ulist_add_merged.
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We can't easily use the index of the radix tree for inums as the
radix tree uses 32bit indexes on 32bit kernels. For 32bit kernels,
we now use the lower 32bit of the inum as index and an additional
list to store multiple entries per radix tree entry.
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
When everything is done, name_cache_free is called which however
forgot to call kfree on the cache entries.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
If we break, we may miss the clone from send_root which we prefer
over all other clones.
Commit is a result of Arne's review.
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Don't have a seperate return path for the mentioned case. Now
we do the same "take lowest inode/offset" logic for all found clones.
Commit is a result of Arne's review.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Make sure to never get in trouble due to the backref_ctx
which was on the stack before.
Commit is a result of Arne's review.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We only added the parent for the new position of a moved dir.
We also need to add the old parent of the moved dir.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
fs_path_remove is not used at the moment due to a previous patch.
Remove it for now (with #if 0) to avoid compile warnings.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We missed that check which resultet in all refs with the same name
being reported as first_ref.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
When the current inodes inum is smaller then the inum of the
parent directory strange things were happending due to wrong
path resolution and other bugs. Fix this with a new approach
for the problem.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Here is the big driver core update for 3.7-rc1.
A number of firmware_class.c updates (as you saw a month or so ago), and
some hyper-v updates and some printk fixes as well. All patches that
are outside of the drivers/base area have been acked by the respective
maintainers, and have all been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlBp3vkACgkQMUfUDdst+ylQoACgldktGFgkCLzH+rGYthrXOC5P
9hUAnjmOhdoHlMTL81vWTlH+BrGernym
=khrr
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core merge from Greg Kroah-Hartman:
"Here is the big driver core update for 3.7-rc1.
A number of firmware_class.c updates (as you saw a month or so ago),
and some hyper-v updates and some printk fixes as well. All patches
that are outside of the drivers/base area have been acked by the
respective maintainers, and have all been in the linux-next tree for a
while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits)
memory: tegra{20,30}-mc: Fix reading incorrect register in mc_readl()
device.h: Add missing inline to #ifndef CONFIG_PRINTK dev_vprintk_emit
memory: emif: Add ifdef CONFIG_DEBUG_FS guard for emif_debugfs_[init|exit]
Documentation: Fixes some translation error in Documentation/zh_CN/gpio.txt
Documentation: Remove 3 byte redundant code at the head of the Documentation/zh_CN/arm/booting
Documentation: Chinese translation of Documentation/video4linux/omap3isp.txt
device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit
dev: Add dev_vprintk_emit and dev_printk_emit
netdev_printk/netif_printk: Remove a superfluous logging colon
netdev_printk/dynamic_netdev_dbg: Directly call printk_emit
dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack
driver-core: Shut up dev_dbg_reatelimited() without DEBUG
tools/hv: Parse /etc/os-release
tools/hv: Check for read/write errors
tools/hv: Fix exit() error code
tools/hv: Fix file handle leak
Tools: hv: Implement the KVP verb - KVP_OP_GET_IP_INFO
Tools: hv: Rename the function kvp_get_ip_address()
Tools: hv: Implement the KVP verb - KVP_OP_SET_IP_INFO
Tools: hv: Add an example script to configure an interface
...
Features currently supported:
- 39-bit address space for user and kernel (each)
- 4KB and 64KB page configurations
- Compat (32-bit) user applications (ARMv7, EABI only)
- Flattened Device Tree (mandated for all AArch64 platforms)
- ARM generic timers
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iQIcBAABAgAGBQJQabRiAAoJEGvWsS0AyF7xXgcQAK+FTXt0ikdQYMkV5AIZXb9i
xHRhuiZWx2vKyk0mCqpyGLY58GSmSb6uTBg/2P2Ej7vXdH/RB2goPzjlspfjkDL4
o8RJp7eQ07Uz3KRDYEJgMP8xKZid6KFG93RJ6TjjpKZLuDBdwiG1GP1vb0jVcWfo
ttZrj/aI8lMcqrh3Vq5qefP7GWP1OVATqeaGTiT7oo38pXwF3t237xfBr2iDGFBp
ZgIRddrxpa7JYUesfJDDDdGHvLq7Vh2jJV+io9qasBZDrtppGJIhZ0vUni2DgIi7
r4i1LcynDN4JaG0maZ4U/YQm74TCD4BqxV8GJ7zwLPTWeN+of+skjhPSLOkA+0fp
I+sWjXlv200gDfJZ9qnUld2kFpoDfJi2b7fNDouSDd2OhmVOVWG3jnVP4Z7meVSb
O8BYzWDdsAiabuwciUY3OsmW6424lT93b2v86Vncs4unKMvEjOPxYZbUxhqX8f2j
gsmWwwD/yS4THx2B6OyW9VT3I5J6miqs2Glt/GG6vPWT5AKQJn9jCxKaBGhPMPIs
xe5/GycBYjdk/Y8qRjegxFbEqzQuiRzmkeFn5jwjmBLqpGNbZDpvMaL6adhAKM5/
v6UIKa91ra4fC9N0h6G61pOc9N9DbT8wPbCbdYY0RMTMRuLDZDgAM3Bvz0r2APdD
96leNy6vx684hbkCSLJs
=buJB
-----END PGP SIGNATURE-----
Merge tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64
Pull arm64 support from Catalin Marinas:
"Linux support for the 64-bit ARM architecture (AArch64)
Features currently supported:
- 39-bit address space for user and kernel (each)
- 4KB and 64KB page configurations
- Compat (32-bit) user applications (ARMv7, EABI only)
- Flattened Device Tree (mandated for all AArch64 platforms)
- ARM generic timers"
* tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (35 commits)
arm64: ptrace: remove obsolete ptrace request numbers from user headers
arm64: Do not set the SMP/nAMP processor bit
arm64: MAINTAINERS update
arm64: Build infrastructure
arm64: Miscellaneous header files
arm64: Generic timers support
arm64: Loadable modules
arm64: Miscellaneous library functions
arm64: Performance counters support
arm64: Add support for /proc/sys/debug/exception-trace
arm64: Debugging support
arm64: Floating point and SIMD
arm64: 32-bit (compat) applications support
arm64: User access library functions
arm64: Signal handling support
arm64: VDSO support
arm64: System calls handling
arm64: ELF definitions
arm64: SMP support
arm64: DMA mapping API
...
make menuconfig for cifs shows multiple entries toward
the end of the list with the incorrect indentation
(probably a bug in Kconfig parsing of items
that are dependant on the module (cifs=m instead of
just CONFIG_CIFS). This patch fixes the indentation
of all but the last entry (CIFS_ACL) which I don't
know how to fix. It also clarifies wording in
two places
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Based on whether the user (on mount command) chooses:
vers=3.0 (for smb3.0 support)
vers=2.1 (for smb2.1 support)
or (with subsequent patch, which will allow SMB2 support)
vers=2.0 (for original smb2.02 dialect support)
send only one dialect at a time during negotiate (we
had been sending a list).
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Pull the trivial tree from Jiri Kosina:
"Tiny usual fixes all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
doc: fix old config name of kprobetrace
fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc
btrfs: fix the commment for the action flags in delayed-ref.h
btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID
vfs: fix kerneldoc for generic_fh_to_parent()
treewide: fix comment/printk/variable typos
ipr: fix small coding style issues
doc: fix broken utf8 encoding
nfs: comment fix
platform/x86: fix asus_laptop.wled_type module parameter
mfd: printk/comment fixes
doc: getdelays.c: remember to close() socket on error in create_nl_socket()
doc: aliasing-test: close fd on write error
mmc: fix comment typos
dma: fix comments
spi: fix comment/printk typos in spi
Coccinelle: fix typo in memdup_user.cocci
tmiofb: missing NULL pointer checks
tools: perf: Fix typo in tools/perf
tools/testing: fix comment / output typos
...
Pull GFS2 updates from Steven Whitehouse:
"The major feature this time is the "rbm" conversion in the resource
group code. The new struct gfs2_rbm specifies the location of an
allocatable block in (resource group, bitmap, offset) form. There are
a number of added helper functions, and later patches then rewrite
some of the resource group code in terms of this new structure. Not
only does this give us a nice code clean up, but it also removes some
of the previous restrictions where extents could not cross bitmap
boundaries, for example.
In addition to that, there are a few bug fixes and clean ups, but the
rbm work is by far the majority of this patch set in terms of number
of changed lines."
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (27 commits)
GFS2: Write out dirty inode metadata in delayed deletes
GFS2: fix s_writers.counter imbalance in gfs2_ail_empty_gl
GFS2: Fix infinite loop in rbm_find
GFS2: Consolidate free block searching functions
GFS2: Get rid of I_MUTEX_QUOTA usage
GFS2: Stop block extents at the end of bitmaps
GFS2: Fix unclaimed_blocks() wrapping bug and clean up
GFS2: Improve block reservation tracing
GFS2: Fall back to ignoring reservations, if there are no other blocks left
GFS2: Fix ->show_options() for statfs slow
GFS2: Use rbm for gfs2_setbit()
GFS2: Use rbm for gfs2_testbit()
GFS2: Eliminate unnecessary check for state > 3 in bitfit
GFS2: Eliminate redundant calls to may_grant
GFS2: Combine functions gfs2_glock_dq_wait and wait_on_demote
GFS2: Combine functions gfs2_glock_wait and wait_on_holder
GFS2: inline __gfs2_glock_schedule_for_reclaim
GFS2: change function gfs2_direct_IO to use a normal gfs2_glock_dq
GFS2: rbm code cleanup
GFS2: Fix case where reservation finished at end of rgrp
...
Commits 5e8830dc85 and 41c4d25f78 introduced a regression into
v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not
take place for files modified via mmap if the page was already in the
page cache. This would also affect ext3 file systems mounted using
the ext4 file system driver.
The problem was that ext4_page_mkwrite() had a shortcut which would
avoid calling __block_page_mkwrite() under some circumstances, and the
above two commit transferred the responsibility of calling
file_update_time() to __block_page_mkwrite --- which woudln't get
called in some circumstances.
Since __block_page_mkwrite() only has three callers,
block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the
best way to solve this is to move the responsibility for calling
file_update_time() to its caller.
This problem was found via xfstests #215 with a file system mounted
with -o nodelalloc.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: stable@vger.kernel.org
Inode is allowed to have empty leaf only if it this is blockless inode.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
punch_hole is the place where we have to wait for all existing writers
(writeback, aio, dio), but currently we simply flush pended end_io request
which is not sufficient. Other issue is that punch_hole performed w/o i_mutex
held which obviously result in dangerous data corruption due to
write-after-free.
This patch performs following changes:
- Guard punch_hole with i_mutex
- Recheck inode flags under i_mutex
- Block all new dio readers in order to prevent information leak caused by
read-after-free pattern.
- punch_hole now wait for all writers in flight
NOTE: XXX write-after-free race is still possible because new dirty pages
may appear due to mmap(), and currently there is no easy way to stop
writeback while punch_hole is in progress.
[ Fixed error return from ext4_ext_punch_hole() to make sure that we
release i_mutex before returning EPERM or ETXTBUSY -- Ted ]
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Selected by __ARCH_WANT_SYS_EXECVE in unistd.h. Requires
* working current_pt_regs()
* *NOT* doing a syscall-in-kernel kind of kernel_execve()
implementation. Using generic kernel_execve() is fine.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
based mostly on arm and alpha versions. Architectures can define
__ARCH_WANT_KERNEL_EXECVE and use it, provided that
* they have working current_pt_regs(), even for kernel threads.
* kernel_thread-spawned threads do have space for pt_regs
in the normal location. Normally that's as simple as switching to
generic kernel_thread() and making sure that kernel threads do *not*
go through return from syscall path; call the payload from equivalent
of ret_from_fork if we are in a kernel thread (or just have separate
ret_from_kernel_thread and make copy_thread() use it instead of
ret_from_fork in kernel thread case).
* they have ret_from_kernel_execve(); it is called after
successful do_execve() done by kernel_execve() and gets normal
pt_regs location passed to it as argument. It's essentially
a longjmp() analog - it should set sp, etc. to the situation
expected at the return for syscall and go there. Eventually
the need for that sucker will disappear, but that'll take some
surgery on kernel_thread() payloads.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
IBM reported a deadlock in select_parent(). This was found to be caused
by taking rename_lock when already locked when restarting the tree
traversal.
There are two cases when the traversal needs to be restarted:
1) concurrent d_move(); this can only happen when not already locked,
since taking rename_lock protects against concurrent d_move().
2) racing with final d_put() on child just at the moment of ascending
to parent; rename_lock doesn't protect against this rare race, so it
can happen when already locked.
Because of case 2, we need to be able to handle restarting the traversal
when rename_lock is already held. This patch fixes all three callers of
try_to_ascend().
IBM reported that the deadlock is gone with this patch.
[ I rewrote the patch to be smaller and just do the "goto again" if the
lock was already held, but credit goes to Miklos for the real work.
- Linus ]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
JFFS2 was designed without thought for OOB bitflips, it seems, but they
can occur and will be reported to JFFS2 via mtd_read_oob()[1]. We don't
want to fail on these transactions, since the data was corrected.
[1] Few drivers report bitflips for OOB-only transactions. With such
drivers, this patch should have no effect.
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
This patch fixes regression introduced by
"8bdc81c jffs2: get rid of jffs2_sync_super". We submit a delayed work in order
to make sure the write-buffer is synchronized at some point. But we do not
flush it when we unmount, which causes an oops when we unmount the file-system
and then the delayed work is executed.
This patch fixes the issue by adding a "cancel_delayed_work_sync()" infocation
in the '->sync_fs()' handler. This will make sure the delayed work is canceled
on sync, unmount and re-mount. And because VFS always callse 'sync_fs()' before
unmounting or remounting, this fixes the issue.
Reported-by: Ludovic Desroches <ludovic.desroches@atmel.com>
Cc: stable@vger.kernel.org [3.5+]
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Ludovic Desroches <ludovic.desroches@atmel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Jan Kara have spotted interesting issue:
There are potential data corruption issue with direct IO overwrites
racing with truncate:
Like:
dio write truncate_task
->ext4_ext_direct_IO
->overwrite == 1
->down_read(&EXT4_I(inode)->i_data_sem);
->mutex_unlock(&inode->i_mutex);
->ext4_setattr()
->inode_dio_wait()
->truncate_setsize()
->ext4_truncate()
->down_write(&EXT4_I(inode)->i_data_sem);
->__blockdev_direct_IO
->ext4_get_block
->submit_io()
->up_read(&EXT4_I(inode)->i_data_sem);
# truncate data blocks, allocate them to
# other inode - bad stuff happens because
# dio is still in flight.
In order to serialize with truncate dio worker should grab extra i_dio_count
reference before drop i_mutex.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If we have enough aggressive DIO readers, truncate and other dio
waiters will wait forever inside inode_dio_wait(). It is reasonable
to disable nonlock DIO read optimization during truncate.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Current serialization will works only for DIO which holds
i_mutex, but nonlocked DIO following race is possible:
dio_nolock_read_task truncate_task
->ext4_setattr()
->inode_dio_wait()
->ext4_ext_direct_IO
->ext4_ind_direct_IO
->__blockdev_direct_IO
->ext4_get_block
->truncate_setsize()
->ext4_truncate()
#alloc truncated blocks
#to other inode
->submit_io()
#INFORMATION LEAK
In order to serialize with unlocked DIO reads we have to
rearrange wait sequence
1) update i_size first
2) if i_size about to be reduced wait for outstanding DIO requests
3) and only after that truncate inode blocks
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Inode's block defrag and ext4_change_inode_journal_flag() may
affect nonlocked DIO reads result, so proper synchronization
required.
- Add missed inode_dio_wait() calls where appropriate
- Check inode state under extra i_dio_count reference.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Current unwritten extent conversion state-machine is very fuzzy.
- For unknown reason it performs conversion under i_mutex. What for?
My diagnosis:
We already protect extent tree with i_data_sem, truncate and punch_hole
should wait for DIO, so the only data we have to protect is end_io->flags
modification, but only flush_completed_IO and end_io_work modified this
flags and we can serialize them via i_completed_io_lock.
Currently all these games with mutex_trylock result in the following deadlock
truncate: kworker:
ext4_setattr ext4_end_io_work
mutex_lock(i_mutex)
inode_dio_wait(inode) ->BLOCK
DEADLOCK<- mutex_trylock()
inode_dio_done()
#TEST_CASE1_BEGIN
MNT=/mnt_scrach
unlink $MNT/file
fallocate -l $((1024*1024*1024)) $MNT/file
aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
sleep 2
truncate -s 0 $MNT/file
#TEST_CASE1_END
Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286
This patch makes state machine simple and clean:
(1) xxx_end_io schedule final extent conversion simply by calling
ext4_add_complete_io(), which append it to ei->i_completed_io_list
NOTE1: because of (2A) work should be queued only if
->i_completed_io_list was empty, otherwise the work is scheduled already.
(2) ext4_flush_completed_IO is responsible for handling all pending
end_io from ei->i_completed_io_list
Flushing sequence consists of following stages:
A) LOCKED: Atomically drain completed_io_list to local_list
B) Perform extents conversion
C) LOCKED: move converted io's to to_free list for final deletion
This logic depends on context which we was called from.
D) Final end_io context destruction
NOTE1: i_mutex is no longer required because end_io->flags modification
is protected by ei->ext4_complete_io_lock
Full list of changes:
- Move all completion end_io related routines to page-io.c in order to improve
logic locality
- Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
- remove EXT4_IO_END_FSYNC
- Improve SMP scalability by removing useless i_mutex which does not
protect io->flags anymore.
- Reduce lock contention on i_completed_io_lock by optimizing list walk.
- Rename ext4_end_io_nolock to end4_end_io and make it static
- Check flush completion status to ext4_ext_punch_hole(). Because it is
not good idea to punch blocks from corrupted inode.
Changes since V3 (in request to Jan's comments):
Fall back to active flush_completed_IO() approach in order to prevent
performance issues with nolocked DIO reads.
Changes since V2:
Fix use-after-free caused by race truncate vs end_io_work
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
once we mark end_io with EXT4_END_IO_UNWRITTEN we have to revert it back
on error path.
- add missed error checks to prevent counter leakage
- ext4_end_io_nolock() will clear EXT4_END_IO_UNWRITTEN flag to signal
that conversion finished.
- add BUG_ON to ext4_free_end_io() to prevent similar leakage in future.
Visible effect of this bug is that unaligned aio_stress may deadlock
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
AIO/DIO prefix is wrong because it account unwritten extents which
also may be scheduled from buffered write endio
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Generic inode has unused i_private pointer which may be used as cur_aio_dio
storage.
TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
to have concurent AIO_DIO requests.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Rebased and resending the patch.
Path based queries can fail for lack of access, especially during lookup
during open.
open itself would actually succeed becasue of back up intent bit
but queries (either path or file handle based) do not have a means to
specifiy backup intent bit.
So query the file info during lookup using
trans2 / findfirst / file_id_full_dir_info
to obtain file info as well as file_id/inode value.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Currently it does not do so if the RPC call failed to start. Fix is to
move the decrement of plh_block_lgets into nfs4_layoutreturn_release.
Also remove a redundant test of task->tk_status in nfs4_layoutreturn_done:
if lrp->res.lrs_present is set, then obviously the RPC call succeeded.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Failure of the layoutreturn allocation fails is not a good reason to
mark the pnfs_layout_hdr as having failed a layoutget or i/o. Just
exit cleanly.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It serves no purpose that the test for whether or not we have valid
layout segments doesn't already serve.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Once all the affected layout segments have been freed up, clear the
NFS_LAYOUT_BULK_RECALL flag so that we can reuse the pnfs_layout_hdr
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We already have a mechanism for blocking LAYOUTGET by means of the
plh_block_lgets counter. The only "service" that NFS_LAYOUT_DESTROYED
provides at this point is to block layoutget once the layout segment
list is empty, which basically means that you have to wait until
the pnfs_layout_hdr is destroyed before you can do pNFS on that file
again.
This patch enables the reuse of the pnfs_layout_hdr if the layout
segment list is empty.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that the reference count for pnfs_layout_hdr reverts to the
original value after a call to pnfs_layout_remove_lseg().
Note that the caller is expected to hold a reference to the struct
pnfs_layout_hdr.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There is no longer a need to use pnfs_free_lseg_list(). Just call
pnfs_free_lseg() directly.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the code into pnfs_free_layout_hdr(), and add checks to
get_layout_by_fh_locked to ensure that they don't reference a layout
that is being freed.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
None of the existing pNFS layout drivers seem to require the inode
to be locked while they free the layout header.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Each layout segment already holds a reference to the pnfs_layout_hdr,
so there is no need to hold an extra reference that is released once
the last layout segment is freed.
Ensure that pnfs_find_alloc_layout() always returns a reference
to the pnfs_layout_hdr, which will be matched by the final call to
pnfs_put_layout_hdr() in pnfs_update_layout().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The latter name is more descriptive of the actual function.
Also rename pnfs_insert_layout to pnfs_layout_insert_lseg.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Instead of resetting the inode MDS threshold counters when we mark
the layout for destruction, do it as part of freeing the layout.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In all cases where we set NFS_LAYOUT_INVALID, we also set NFS_LAYOUT_DESTROYED.
Furthermore, in all cases where we test for NFS_LAYOUT_INVALID, we should
also be testing for NFS_LAYOUT_DESTROYED, since the latter means that
we hold no valid layout segments.
Ergo the two are redundant.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we sleep after dropping the inode->i_lock, then we are no longer
atomic with respect to the rpc_wake_up() call in pnfs_layout_remove_lseg().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If pnfs_layout_io_test_failed() authorises a retry of the failed layoutgets,
we should clear the existing layout segments so that we start afresh. Do
this in pnfs_layout_io_set_failed().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want to cache the pnfs_layout_hdr after a layoutget or i/o
failure so that pnfs_update_layout() can find it and know when
it is time to retry.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we exit after the call to pnfs_find_alloc_layout(), we have to ensure
that we put the struct pnfs_layout_hdr.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In cases where the pNFS data server is just temporarily out of service,
we want to mark it as such, and then try again later. Typically that will
be in cases of network connection errors etc.
This patch allows us to mark the devices as being "unavailable" for such
transient errors, and will make them available for retries after a
2 minute timeout period.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we had to fall back to read/write through MDS, then assume that we should
retry pNFS after a suitable timeout period.
The following patch sets a timeout of 2 minutes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Dereferencing nfsi->layout in order to read plh_flags without holding
a spin lock is bug prone. Furthermore, the dprintk() tells you nothing
about whether or not the call succeeded.
Replace it with something that tells you about whether or not a valid
layout segment was returned for the inode in question.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we do return errors from nfs4_proc_layoutget() and that we
don't mark the layout as having failed if the error was due to a
signal or resource problem on the client side.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>