Internally the mq policy maintains a promotion threshold variable. If
the hit count of a block not in the cache goes above this threshold it
gets promoted to the cache.
This patch introduces three new tunables that allow you to tweak the
promotion threshold by adding a small value. These adjustments depend
on the io type:
read_promote_adjustment: READ io, default 4
write_promote_adjustment: WRITE io, default 8
discard_promote_adjustment: READ/WRITE io to a discarded block, default 1
If you're trying to quickly warm a new cache device you may wish to
reduce these to encourage promotion. Remember to switch them back to
their defaults after the cache fills though.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The pool mode must not be switched until after the corresponding pool
process_* methods have been established. Otherwise, because
set_pool_mode() isn't interlocked with the IO path for performance
reasons, the IO path can end up executing process_* operations that
don't match the mode. This patch eliminates problems like the following
(as seen on really fast PCIe SSD storage when transitioning the pool's
mode from PM_READ_ONLY to PM_WRITE):
kernel: device-mapper: thin: 253:2: reached low water mark for data device: sending event.
kernel: device-mapper: thin: 253:2: no free data space available.
kernel: device-mapper: thin: 253:2: switching pool to read-only mode
kernel: device-mapper: thin: 253:2: switching pool to write mode
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 11 PID: 7564 at drivers/md/dm-thin.c:995 handle_unserviceable_bio+0x146/0x160 [dm_thin_pool]()
...
kernel: Workqueue: dm-thin do_worker [dm_thin_pool]
kernel: 00000000000003e3 ffff880308831cc8 ffffffff8152ebcb 00000000000003e3
kernel: 0000000000000000 ffff880308831d08 ffffffff8104c46c ffff88032502a800
kernel: ffff880036409000 ffff88030ec7ce00 0000000000000001 00000000ffffffc3
kernel: Call Trace:
kernel: [<ffffffff8152ebcb>] dump_stack+0x49/0x5e
kernel: [<ffffffff8104c46c>] warn_slowpath_common+0x8c/0xc0
kernel: [<ffffffff8104c4ba>] warn_slowpath_null+0x1a/0x20
kernel: [<ffffffffa001e2c6>] handle_unserviceable_bio+0x146/0x160 [dm_thin_pool]
kernel: [<ffffffffa001f276>] process_bio_read_only+0x136/0x180 [dm_thin_pool]
kernel: [<ffffffffa0020b75>] process_deferred_bios+0xc5/0x230 [dm_thin_pool]
kernel: [<ffffffffa0020d31>] do_worker+0x51/0x60 [dm_thin_pool]
kernel: [<ffffffff81067823>] process_one_work+0x183/0x490
kernel: [<ffffffff81068c70>] worker_thread+0x120/0x3a0
kernel: [<ffffffff81068b50>] ? manage_workers+0x160/0x160
kernel: [<ffffffff8106e86e>] kthread+0xce/0xf0
kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70
kernel: [<ffffffff8153b3ec>] ret_from_fork+0x7c/0xb0
kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70
kernel: ---[ end trace 3f00528e08ffa55c ]---
kernel: device-mapper: thin: pool mode is PM_WRITE not PM_READ_ONLY like expected!?
dm-thin.c:995 was the WARN_ON_ONCE(get_pool_mode(pool) != PM_READ_ONLY);
at the top of handle_unserviceable_bio(). And as the additional
debugging I had conveys: the pool mode was _not_ PM_READ_ONLY like
expected, it was already PM_WRITE, yet pool->process_bio was still set
to process_bio_read_only().
Also, while fixing this up, reduce logging of redundant pool mode
transitions by checking new_mode is different from old_mode.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The pool's error_if_no_space flag can easily serve the same purpose that
no_free_space did, namely: control whether handle_unserviceable_bio()
will error a bio or requeue it.
This is cleaner since error_if_no_space is established when the pool's
features are processed during table load. So it avoids managing the
no_free_space flag by taking the pool's spinlock.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the pool runs out of data or metadata space, the pool can either
queue or error the IO destined to the data device. The default is to
queue the IO until more space is added.
An admin may now configure the pool to error IO when no space is
available by setting the 'error_if_no_space' feature when loading the
thin-pool table.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Now that we switch the pool to read-only mode when the data device runs
out of space it causes active writers to get IO errors once we resume
after resizing the data device.
If no_free_space is set, save bios to the 'retry_on_resume_list' and
requeue them on resume (once the data or metadata device may have been
resized).
With this patch the resize_io test passes again (on slower storage):
dmtest run --suite thin-provisioning -n /resize_io/
Later patches fix some subtle races associated with the pool mode
transitions done as part of the pool's -ENOSPC handling. These races
are exposed on fast storage (e.g. PCIe SSD).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Factor out_of_data_space() out of alloc_data_block(). Eliminate the use
of 'no_free_space' as a latch in alloc_data_block() -- this is no longer
needed now that we switch to read-only mode when we run out of data or
metadata space. In a later patch, the 'no_free_space' flag will be
eliminated entirely (in favor of checking metadata rather than relying
on a transient flag).
Move no metdata space handling into metdata_operation_failed(). Set
no_free_space when metadata space is exhausted too. This is useful,
because it offers consistency, for the following patch that will requeue
data IOs if no_free_space.
Also, rename no_space() to retry_bios_on_resume().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Introduce metadata_operation_failed() wrappers, around set_pool_mode(),
to assist with improving the consistency of how metadata failures are
handled. Logging is improved and metadata operation failures trigger
read-only mode immediately.
Also, eliminate redundant set_pool_mode() calls in the two
alloc_data_block() caller's error paths.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Factor check_low_water_mark() out of alloc_data_block().
Change a couple unsigned flags in the pool structure to bool.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mappings could be processed in descending logical block order,
particularly if buffered IO is used. This could adversely affect the
latency of IO processing. Fix this by adding mappings to the end of the
'prepared_mappings' and 'prepared_discards' lists.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Also, move 'err' member in dm_thin_new_mapping structure to eliminate 4
byte hole (reduces size from 88 bytes to 80).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
DM's persistent-data library is now used my multiple targets so
exclusive references to "pool" or "thin provisioning" need to be
cleaned up. Adjust Kconfig's DM_DEBUG_BLOCK_STACK_TRACING text
and remove "pool" from a block manager error message.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
The "unable to allocate new metadata block" error can be a particularly
verbose error if there is a systemic issue with the metadata device.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Starting with commit c0820cf5ad ("dm: introduce per_bio_data"),
device mapper has the capability to pre-allocate a target-specific
structure with the bio.
This patch changes dm-delay to use this facility instead of a slab cache
and mempool.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A device mapper table is allocated in the following way:
* The function dm_table_create is called, it gets the number of targets
as an argument -- it allocates a targets array accordingly.
* For each target, we call dm_table_add_target.
If we add more targets than were specified in dm_table_create, the
function dm_table_add_target reallocates the targets array. However,
this reallocation code is wrong - it moves the targets array to a new
location, while some target constructors hold pointers to the array in
the old location.
The following DM target drivers save the pointer to the target
structure, so they corrupt memory if the target array is moved:
multipath, raid, mirror, snapshot, stripe, switch, thin, verity.
Under normal circumstances, the reallocation function is not called
(because dm_table_create is called with the correct number of targets),
so the buggy reallocation code is not used.
Prior to the fix "dm table: fail dm_table_create on dm_round_up
overflow", the reallocation code could only be used in case the user
specifies too large a value in param->target_count, such as 0xffffffff.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If a snapshot is created and later deleted the origin dm_thin_device's
snapshotted_time will have been updated to reflect the snapshot's
creation time. The 'shared' flag in the dm_thin_lookup_result struct
returned from dm_thin_find_block() is an approximation based on
snapshotted_time -- this is done to avoid 0(n), or worse, time
complexity. In this case, the shared flag would be true.
But because the 'shared' flag reflects an approximation a block can be
incorrectly assumed to be shared (e.g. false positive for 'shared'
because the snapshot no longer exists). This could result in discards
issued to a thin device not being passed down to the pool's underlying
data device.
To fix this we double check that a thin block is really still in-use
after a mapping is removed using dm_pool_block_is_used(). If the
reference count for a block is now zero the discard is allowed to be
passed down.
Also add a 'definitely_not_shared' member to the dm_thin_new_mapping
structure -- reflects that the 'shared' flag in the response from
dm_thin_find_block() can only be held as definitive if false is
returned.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1043527
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
As additional members are added to the dm_thin_new_mapping structure
care should be taken to make sure they get initialized before use.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
For NUMA systems, initializing the blk-mq layer and using per node hctx.
We initialize submit queues to 1, while blk-mq nr_hw_queues is
initialized to the number of NUMA nodes.
This makes the null_init_hctx function overwrite memory outside of what
it allocated. In my case it lead to writing garbage into struct
request_queue's mq_map.
Signed-off-by: Matias Bjorling <m@bjorling.me>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) Revert CHECKSUM_COMPLETE optimization in pskb_trim_rcsum(), I can't
figure out why it breaks things.
2) Fix comparison in netfilter ipset's hash_netnet4_data_equal(), it
was basically doing "x == x", from Dave Jones.
3) Freescale FEC driver was DMA mapping the wrong number of bytes, from
Sebastian Siewior.
4) Blackhole and prohibit routes in ipv6 were not doing the right thing
because their ->input and ->output methods were not being assigned
correctly. Now they behave properly like their ipv4 counterparts.
From Kamala R.
5) Several drivers advertise the NETIF_F_FRAGLIST capability, but
really do not support this feature and will send garbage packets if
fed fraglist SKBs. From Eric Dumazet.
6) Fix long standing user triggerable BUG_ON over loopback in RDS
protocol stack, from Venkat Venkatsubra.
7) Several not so common code paths can potentially try to invoke
packet scheduler actions that might be NULL without checking. Shore
things up by either 1) defining a method as mandatory and erroring
on registration if that method is NULL 2) defininig a method as
optional and the registration function hooks up a default
implementation when NULL is seen. From Jamal Hadi Salim.
8) Fix fragment detection in xen-natback driver, from Paul Durrant.
9) Kill dangling enter_memory_pressure method in cg_proto ops, from
Eric W Biederman.
10) SKBs that traverse namespaces should have their local_df cleared,
from Hannes Frederic Sowa.
11) IOCB file position is not being updated by macvtap_aio_read() and
tun_chr_aio_read(). From Zhi Yong Wu.
12) Don't free virtio_net netdev before releasing all of the NAPI
instances. From Andrey Vagin.
13) Procfs entry leak in xt_hashlimit, from Sergey Popovich.
14) IPv6 routes that are no cached routes should not count against the
garbage collection limits. We had this almost right, but were
missing handling addrconf generated routes properly. From Hannes
Frederic Sowa.
15) fib{4,6}_rule_suppress() have to consider potentially seeing NULL
route info when they are called, from Stefan Tomanek.
16) TUN and MACVTAP have had truncated packet signalling for some time,
fix from Jason Wang.
17) Fix use after frrr in __udp4_lib_rcv(), from Eric Dumazet.
18) xen-netback does not interpret the NAPI budget properly for TX work,
fix from Paul Durrant.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (132 commits)
igb: Fix for issue where values could be too high for udelay function.
i40e: fix null dereference
xen-netback: fix gso_prefix check
net: make neigh_priv_len in struct net_device 16bit instead of 8bit
drivers: net: cpsw: fix for cpsw crash when build as modules
xen-netback: napi: don't prematurely request a tx event
xen-netback: napi: fix abuse of budget
sch_tbf: use do_div() for 64-bit divide
udp: ipv4: must add synchronization in udp_sk_rx_dst_set()
net:fec: remove duplicate lines in comment about errata ERR006358
Revert "8390 : Replace ei_debug with msg_enable/NETIF_MSG_* feature"
8390 : Replace ei_debug with msg_enable/NETIF_MSG_* feature
xen-netback: make sure skb linear area covers checksum field
net: smc91x: Fix device tree based configuration so it's usable
udp: ipv4: fix potential use after free in udp_v4_early_demux()
macvtap: signal truncated packets
tun: unbreak truncated packet signalling
net: sched: htb: fix the calculation of quantum
net: sched: tbf: fix the calculation of max_size
micrel: add support for KSZ8041RNLI
...
Pull x86 fixes from Peter Anvin:
"This is a pretty small batch:
The biggest single change is to stop using EFI time services on 32-bit
platforms. This matches our current behavior on 64-bit platforms as
we already had ruled them out there as being too unreliable. Turns
out that affects 32-bit platforms, too.
One NULL pointer fix for SGI UV.
Two minor build fixes, one of which only affects icc and the other
which affects icc and future versions or nonstandard default settings
of gcc"
* 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, efi: Don't use (U)EFI time services on 32 bit
x86, build, icc: Remove uninitialized_var() from compiler-intel.h
x86/UV: Fix NULL pointer dereference in uv_flush_tlb_others() if the 'nobau' boot option is used
x86, build: Pass in additional -mno-mmx, -mno-sse options
Pull SELinux fixes from James Morris.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
selinux: process labeled IPsec TCP SYN-ACK packets properly in selinux_ip_postroute()
selinux: look for IPsec labels on both inbound and outbound packets
selinux: handle TCP SYN-ACK packets correctly in selinux_ip_postroute()
selinux: handle TCP SYN-ACK packets correctly in selinux_ip_output()
selinux: fix possible memory leak
This reverts commit 102aefdda4.
Tom London reports that it causes sync() to hang on Fedora rawhide:
https://bugzilla.redhat.com/show_bug.cgi?id=1033965
and Josh Boyer bisected it down to this commit. Reverting the commit in
the rawhide kernel fixes the problem.
Eric Paris root-caused it to incorrect subtype matching in that commit
breaking fuse, and has a tentative patch, but by now we're better off
retrying this in 3.14 rather than playing with it any more.
Reported-by: Tom London <selinux@gmail.com>
Bisected-by: Josh Boyer <jwboyer@fedoraproject.org>
Acked-by: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Anand Avati <avati@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes the igb_phy_has_link function to check the value of the
parameter before deciding to use udelay or mdelay in order to be sure that
the value is not too high for udelay function.
CC: stable kernel <stable@vger.kernel.org> # 3.9+
Signed-off-by: Sunil K Pandey <sunil.k.pandey@intel.com>
Signed-off-by: Kevin B Smith <kevin.b.smith@intel.com>
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the vsi->tx_rings structure is NULL we don't want to panic.
Change-Id: Ic694f043701738c434e8ebe0caf0673f4410dc10
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull ARM fixes from Russell King:
"This resolves some further issues with the dma mask changes on ARM
which have been found by TI and others, and also some corner cases
with the updates to the virtual to physical address translations.
Konstantin also found some problems with the unwinder, which now
performs tighter verification that the stack is valid while unwinding"
* 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm:
ARM: fix asm/memory.h build error
ARM: 7917/1: cacheflush: correctly limit range of memory region being flushed
ARM: 7913/1: fix framepointer check in unwind_frame
ARM: 7912/1: check stack pointer in get_wchan
ARM: 7909/1: mm: Call setup_dma_zone() post early_paging_init()
ARM: 7908/1: mm: Fix the arm_dma_limit calculation
ARM: another fix for the DMA mapping checks
- Couple of fixes for recently added perf code
- Build time extable sort
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJSqqyMAAoJEGnX8d3iisJeTG4QAMUxMnqDHJL919gukLAoivom
BdLdyPkHXECnwvu9G4kd4kOHvF37QUDSaIJYlgHNA7+vkZg2O9qPBWBAl5DQQ8BU
nOQeurxnmNKvhBNcLJzRt+MF6J3TATV28sHB3TF5XSC/JV6yCdwhztBNUjeynRHT
fDBjVyK5tdRCsdh1lRID/4cQW6SnNG4VPuyQHCRt+PZ84nE7AHKu5eYMkSnIpof5
x4/y/kEYLtzuOfbjgze+ZZk9QlR+ymEVq+YSQsbGH8dM5curazGMh4lh2isa0nkJ
G4ptA/E3XSvqNkwgNSYeDss8ugxvwHnjAufgSYOlBSZjfCWxwiA8UC1nA0eks4OW
MBIwCZe6Qo8HyRfZWQgvNOjP81Q9LWfRNa7UObB2HcNvXDghuTmcmOjZJSheZWip
KA7fuISnUz24mwdRlSMwfLjG5zh13GKphpb/PL79m+uzrVB8yHfJWg8nBU7y8Tfn
j8BmyxS9cQQPN6lC2w0ESx4Fp891yR63KNKZq+MLZCj/4iP0h9s2ifL8o/xx03a0
WgqNZJaXYnssqsZAd1BhnV7Oma/OJmrwm7LgWVxAr01FvjONAh/bd3LJR0G2Nksy
PMJI0NnVWrHrso9BeWQ4f0L//tamtmkBqrTjXxgrgQisuxntxdhe16xa0FhsrmIi
B/wfllRLceFyqT78GvNr
=JA1e
-----END PGP SIGNATURE-----
Merge tag 'arc-fixes-for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull ARC fixes from Vineet Gupta:
"These are couple of weeks old already, but I just couldn't get them to
you earlier.
- couple of fixes for recently added perf code
- build time extable sort"
* tag 'arc-fixes-for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: [perf] Fix a few thinkos
ARC: Add guard macro to uapi/asm/unistd.h
ARC: extable: Enable sorting at build time
A fix for possible memory corruption during DM table load, fix a
possible leak of snapshot space in case of a crash, fix a possible
deadlock due to a shared workqueue in the delay target, fix to
initialize read-only module parameters that are used to export metrics
for dm stats and dm bufio.
Quite a few stable fixes were identified for both the thin-provisioning
and caching targets as a result of increased regression testing using
the device-mapper-test-suite (dmts). The most notable of these are the
reference counting fixes for the space map btree that is used by the
dm-array interface -- without these the dm-cache metadata will leak,
resulting in dm-cache devices running out of metadata blocks. Also,
some important fixes related to the thin-provisioning target's
transition to read-only mode on error.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQEcBAABAgAGBQJSq2pVAAoJEMUj8QotnQNaAzEH/0qpz06SVd3+oTbsWeWxXwax
B0zIZ4HPnaD9FDfSWN+6IdQ6Rkn51cspMBHPxWX0I8pmqOTg7MUM7wbaZZaKT3UW
aDlibzG1O3zBGbkr+Qhbh89fJK5/3xnZXlK0hOyaNI2vz1rA6RThZ2hIeak9xQYr
7USz2bjSMXchwebb0Z01CrRnOkhUFg6yUJURIEu4XFCJmDM/PM9zKogLGYKuZedK
pZePnZhwgkLMn7f9l4lPHk3EcWeD4Nf9WR0lS4eKdbYsvMxm1QE7BpDlVJD0Asjg
/SVOVkgygW18TINMHuw1K0DPdSCo5UZF2HRj1QUD/X4N2BSkio+E1qwF6kT4ltk=
=0Ix+
-----END PGP SIGNATURE-----
Merge tag 'dm-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"A set of device-mapper fixes for 3.13.
A fix for possible memory corruption during DM table load, fix a
possible leak of snapshot space in case of a crash, fix a possible
deadlock due to a shared workqueue in the delay target, fix to
initialize read-only module parameters that are used to export metrics
for dm stats and dm bufio.
Quite a few stable fixes were identified for both the thin-
provisioning and caching targets as a result of increased regression
testing using the device-mapper-test-suite (dmts). The most notable
of these are the reference counting fixes for the space map btree that
is used by the dm-array interface -- without these the dm-cache
metadata will leak, resulting in dm-cache devices running out of
metadata blocks. Also, some important fixes related to the
thin-provisioning target's transition to read-only mode on error"
* tag 'dm-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm array: fix a reference counting bug in shadow_ablock
dm space map: disallow decrementing a reference count below zero
dm stats: initialize read-only module parameter
dm bufio: initialize read-only module parameters
dm cache: actually resize cache
dm cache: update Documentation for invalidate_cblocks's range syntax
dm cache policy mq: fix promotions to occur as expected
dm thin: allow pool in read-only mode to transition to read-write mode
dm thin: re-establish read-only state when switching to fail mode
dm thin: always fallback the pool mode if commit fails
dm thin: switch to read-only mode if metadata space is exhausted
dm thin: switch to read only mode if a mapping insert fails
dm space map metadata: return on failure in sm_metadata_new_block
dm table: fail dm_table_create on dm_round_up overflow
dm snapshot: avoid snapshot space leak on crash
dm delay: fix a possible deadlock due to shared workqueue
Jason Gunthorpe reports a build failure when ARM_PATCH_PHYS_VIRT is
not defined:
In file included from arch/arm/include/asm/page.h:163:0,
from include/linux/mm_types.h:16,
from include/linux/sched.h:24,
from arch/arm/kernel/asm-offsets.c:13:
arch/arm/include/asm/memory.h: In function '__virt_to_phys':
arch/arm/include/asm/memory.h:244:40: error: 'PHYS_OFFSET' undeclared (first use in this function)
arch/arm/include/asm/memory.h:244:40: note: each undeclared identifier is reported only once for each function it appears in
arch/arm/include/asm/memory.h: In function '__phys_to_virt':
arch/arm/include/asm/memory.h:249:13: error: 'PHYS_OFFSET' undeclared (first use in this function)
Fixes: ca5a45c06c ("ARM: mm: use phys_addr_t appropriately in p2v and v2p conversions")
Tested-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
A small set of driver fixes plus one larger core change which changes
the way we check to see if we're using DT so that there aren't any races
between deciding we're using DT and the regulator subsystem noticing.
This makes the new support for substituting a dummy regulator and
optional regulators work a lot better on DT systems since it ensures
that we don't trigger probe deferral when we shouldn't which was causing
bugs in clients.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABAgAGBQJSqxI9AAoJELSic+t+oim9nuMQAJd2MqTivDav+1dAJgWT7chT
GUb61Ks3HjmsRz/76YV+CQBO7puFxR9100U/keb3Xmg8oHJSVP6xCroA5JILqhwZ
6rDsRXOKgbqlVruundkceWZJHQEbszUpdbnU8GXiNyNI8EiVoVZCgXSnPM2wyD6o
EdxjqaXi/GUodaGFBfMyMpj387QwWgCi+DocUf622fTUHLEOKjjjndsKssTW2jyf
NrRQiTnQ6Yecf8lI2rHN5C6p8MyJ8IF3i2d4pi1eBAfWF0OfeYRrm694IrbZ8Idl
vAH4BxMf111JC7apuOTHNUSpL1DV4mjYQEeXUvd3wfnWEMRkFaEgwmTRmZZAfl/i
KM+5Yob1IdStfNwayKAVsPbIqYeyV0zDkN4CteY5XtWYLUqKJon6wuSGzYRABID2
uRa82dlSWMaX89+nHPCf22F7op8qRPLgr11yg7Nvo5qB+0Snij341libjrJGY09y
wFx6fdxL4OMkyRpwyB6tkWyAjUPbMJDAvrOnA2x7nU+AS1ytGAJeJMUpzYhUEly/
31kVJBi+mPRRmBsG+Fe9ALp+4k/UpMajCYWXa4/q+Bs7r3FCzWU98NeRxMurUKfO
cco6diDSLTVaQKHcUqPW0g0BWGrggro4H5CHe5MBBi2mHK3IMuqnSYjTDiJpEh7I
Tlad7Or4kd41FCk3Wpfi
=WqDN
-----END PGP SIGNATURE-----
Merge tag 'regulator-v3.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"A small set of driver fixes plus one larger core change which changes
the way we check to see if we're using DT so that there aren't any
races between deciding we're using DT and the regulator subsystem
noticing.
This makes the new support for substituting a dummy regulator and
optional regulators work a lot better on DT systems since it ensures
that we don't trigger probe deferral when we shouldn't which was
causing bugs in clients"
* tag 'regulator-v3.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: pfuze100: allow misprogrammed ID
regulator: pfuze100: Fix address of FABID
regulator: as3722: set the correct current limit
regulator: core: Check for DT every time we check full constraints
regulator: core: Replace checks of have_full_constraints with a function
Two small changes to fix some error handling and checking (both of which
would be quite serious if the errors trigger) plus a trivial
documentation fix.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABAgAGBQJSqw5tAAoJELSic+t+oim9s5sP/3djfdZfsi0qm4hi6paMc6pF
KyP92oHwiMu5RyCV+4UVwtaZnbtBip9uI2z3AB3sozJFNVmSNZ7OoTP2lS1vkLQ+
1XcfJH65m3SPWXliixEjnfUT0UcgXBCDVREcuGX/Iu7RfQzplDfhco2LtfaDbzAa
pTD/Q6vDFxGAm0lGBOPtZ0CB14jb9i0qD4HHBY1TG7YchgLQBe6VC6JB23fGN/Y9
1+rBvS03xrBw6h9RaFLVX7Bk4/i0jrj2UpUy8C3cqK8VdJYLKx06M13oP31mQcwo
jDusm8Pz0SSiT6b8pt334QpU2YvtTNqEAZ1Kp6hZRhGuRqJr0Qb/k+l85koeiT++
+bYht3yTL9/uqQ1felViH4hvvuRhjyHuaaEPinaWyQJhYsP87dG24h9iWG8fE3k7
ft2C+7ho+q8ecW8O48eC9hWFLEBLR9tMNfQM6soHT1h4YsQM3AMih5AhjHdJlavH
yyviE6Nog6279XzJHZb6pcRyIKUKR51fsjaF6ofTWJt5zG4npF3wG96JnHtuPMgh
iIqSSbjBdBRwZ39S1RW2Q+pCai3FpXyVQAlCl7XXWKLTjfLVnMDXQ6VMy4a51UL4
D4fFVlR0nIjeWUTY0vBnFEc+4lYn83aMkkHh/xW+wmzuXstEl7KB/rkD3+EW5Hpu
KZWJmo3xow1oUtziJYDv
=YKD0
-----END PGP SIGNATURE-----
Merge tag 'regmap-v3.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap fixes from Mark Brown:
"Two small changes to fix some error handling and checking (both of
which would be quite serious if the errors trigger) plus a trivial
documentation fix"
* tag 'regmap-v3.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: use IS_ERR() to check clk_get() results
regmap: make sure we unlock on failure in regmap_bulk_write
regmap: trivial comment fix (copy'n'paste error)
Pull i2c fixes from Wolfram Sang:
"Here are two simple but wanted fixes for the i2c subsystem"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: imx: Check the return value from clk_prepare_enable()
i2c: mux: Inherit retry count and timeout from parent for muxed bus
- This driver was not ready to fully Armada 370 NAND, with particularly
notable problems seen on flash with 2KB page sizes. This "compatible" entry
really should have been held back until 3.14 or later.
- Fix a bug seen in rare cases on the error path of a failed probe attempt,
where we free unallocated DMA resources
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJSqrVaAAoJEFySrpd9RFgtaUoP/jklkxrltohyejpHPhsHNWK9
d8OtfIGPNKTFUtuwOAMDBuLt+/lX4f16sWC8PoE1biGhEk/tA/YuisLS1IWNhYYW
Jms1ZTr4juvahJaLoahmJcEd21ZWV36tlvbUPQcQQCFwjuuQUPn0M755XKL7sC8G
09+1VYVjOzX1aAodhYu9JyntGtYqXrVj8hDhhRY4rP4KLbmE/XELmsgp+JPzHs6T
O1X3B3mY4SRWwHQAGtNdzzoGMSIgoeXvydHkgRDYh92iVoL2se8mVfpTGYW9a27T
alBbOm1rCuuS+3WmphB0u5oSRgcykMByP2ucTWlnP1C60DvYctZ42sOdFKPPpwzE
76DJ9HoZt9gO+ch2pn2hXtD60U4b8oZccZJP4WLkio2/nWP/Piae1nyuLtM2AlwO
4Kj/boIymU3VWkPbIj/Dq/9MF7h3eF8M1kr6JKc3MlXU9ZnQQXpBGIBOdiZhcTbY
fyxEBEQoolRE9FPPZWOOiGczTpafoot9o3kCM15G4cHCTVTzK3iuPDGIIDX97nKw
9vAdEP6/yyKnMSdh+OXZ+g134vAXYKezf9QzNastXz5QtYY+0pDOc/5shF9aPsN8
94x7Ub4WzSm+r+dT0m1CtesLMKYIoHtfBjEZA8CVaTfoK+X4oH7eXxNsne5LCRAS
LCbXjnk/WlbydNSbVKGA
=K24+
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20131212' of git://git.infradead.org/linux-mtd
Pull MTD fixes from Brian Norris:
"Two MTD fixes, for the pxa3xx-nand driver:
- This driver was not ready to fully Armada 370 NAND, with
particularly notable problems seen on flash with 2KB page sizes.
This "compatible" entry really should have been held back until
3.14 or later.
- Fix a bug seen in rare cases on the error path of a failed probe
attempt, where we free unallocated DMA resources"
* tag 'for-linus-20131212' of git://git.infradead.org/linux-mtd:
mtd: nand: pxa3xx: Use info->use_dma to release DMA resources
Partially revert "mtd: nand: pxa3xx: Introduce 'marvell,armada370-nand' compatible string"
Pull slave-dmaengine fixes from Vinod Koul:
"Here is the common fixes PULL for dmaengine.
Dan has been working on fixing the build issues in bunch of drivers.
Here we have one fixing s3c24xx-dma, along with fix from Russell on
pl08x. Also we have Kuninori rcar dma fixes. The s3c24xx-dma which
was added in last merge window missed updates to usage of DMA_COMPLETE
so converting the last driver"
* 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
dma: fix build breakage in s3c24xx-dma
Fix pl08x warnings
rcar-hpbdma: initialise plane information when halted
rcar-hpbdma: fixup channel busy check for double plane
rcar-hpbdma: add max transfer size
dma: mmp_pdma: add missing platform_set_drvdata() in mmp_pdma_probe()
dmaengine: s3c24xx-dma: use DMA_COMPLETE for dma completion status
An old array block could have its reference count decremented below
zero when it is being replaced in the btree by a new array block.
The fix is to increment the old ablock's reference count just before
inserting a new ablock into the btree.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.9+
The old behaviour, returning -EINVAL if a ref_count of 0 would be
decremented, was removed in commit f722063 ("dm space map: optimise
sm_ll_dec and sm_ll_inc"). To fix this regression we return an error
code from the mutator function pointer passed to sm_ll_mutate() and have
dec_ref_count() return -EINVAL if the old ref_count is 0.
Add a DMERR to reflect the potential seriousness of this error.
Also, add missing dm_tm_unlock() to sm_ll_mutate()'s error path.
With this fix the following dmts regression test now passes:
dmtest run --suite cache -n /metadata_use_kernel/
The next patch fixes the higher-level dm-array code that exposed this
regression.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.12+
Merge patches from Andrew Morton:
"13 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: memcg: do not allow task about to OOM kill to bypass the limit
mm: memcg: fix race condition between memcg teardown and swapin
thp: move preallocated PTE page table on move_huge_pmd()
mfd/rtc: s5m: fix register updating by adding regmap for RTC
rtc: s5m: enable IRQ wake during suspend
rtc: s5m: limit endless loop waiting for register update
rtc: s5m: fix unsuccesful IRQ request during probe
drivers/rtc/rtc-s5m.c: fix info->rtc assignment
include/linux/kernel.h: make might_fault() a nop for !MMU
drivers/rtc/rtc-at91rm9200.c: correct alarm over day/month wrap
procfs: also fix proc_reg_get_unmapped_area() for !MMU case
mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
include/linux/hugetlb.h: make isolate_huge_page() an inline
Commit 4942642080 ("mm: memcg: handle non-error OOM situations more
gracefully") allowed tasks that already entered a memcg OOM condition to
bypass the memcg limit on subsequent allocation attempts hoping this
would expedite finishing the page fault and executing the kill.
David Rientjes is worried that this breaks memcg isolation guarantees
and since there is no evidence that the bypass actually speeds up fault
processing just change it so that these subsequent charge attempts fail
outright. The notable exception being __GFP_NOFAIL charges which are
required to bypass the limit regardless.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-bt: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a race condition between a memcg being torn down and a swapin
triggered from a different memcg of a page that was recorded to belong
to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension). The
result is unreclaimable pages pointing to dead memcgs, which can lead to
anything from endless loops in later memcg teardown (the page is charged
to all hierarchical parents but is not on any LRU list) or crashes from
following the dangling memcg pointer.
Memcgs with tasks in them can not be torn down and usually charges don't
show up in memcgs without tasks. Swapin with the CONFIG_MEMCG_SWAP
extension is the notable exception because it charges the cgroup that
was recorded as owner during swapout, which may be empty and in the
process of being torn down when a task in another memcg triggers the
swapin:
teardown: swapin:
lookup_swap_cgroup_id()
rcu_read_lock()
mem_cgroup_lookup()
css_tryget()
rcu_read_unlock()
disable css_tryget()
call_rcu()
offline_css()
reparent_charges()
res_counter_charge() (hierarchical!)
css_put()
css_free()
pc->mem_cgroup = dead memcg
add page to dead lru
Add a final reparenting step into css_free() to make sure any such raced
charges are moved out of the memcg before it's finally freed.
In the longer term it would be cleaner to have the css_tryget() and the
res_counter charge under the same RCU lock section so that the charge
reparenting is deferred until the last charge whose tryget succeeded is
visible. But this will require more invasive changes that will be
harder to evaluate and backport into stable, so better defer them to a
separate change set.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Wagin reported crash on VM_BUG_ON() in pgtable_pmd_page_dtor() with
fallowing backtrace:
free_pgd_range+0x2bf/0x410
free_pgtables+0xce/0x120
unmap_region+0xe0/0x120
do_munmap+0x249/0x360
move_vma+0x144/0x270
SyS_mremap+0x3b9/0x510
system_call_fastpath+0x16/0x1b
The crash can be reproduce with this test case:
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>
#define MB (1024 * 1024UL)
#define GB (1024 * MB)
int main(int argc, char **argv)
{
char *p;
int i;
p = mmap((void *) GB, 10 * MB, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
for (i = 0; i < 10 * MB; i += 4096)
p[i] = 1;
mremap(p, 10 * MB, 10 * MB, MREMAP_FIXED | MREMAP_MAYMOVE, 2 * GB);
return 0;
}
Due to split PMD lock, we now store preallocated PTE tables for THP
pages per-PMD table. It means we need to move them to other PMD table
if huge PMD moved there.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Andrey Vagin <avagin@openvz.org>
Tested-by: Andrey Vagin <avagin@openvz.org>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename old regmap field of "struct sec_pmic_dev" to "regmap_pmic" and
add new regmap for RTC.
On S5M8767A registers were not properly updated and read due to usage of
the same regmap as the PMIC. This could be observed in various hangs,
e.g. in infinite loop during waiting for UDR field change.
On this chip family the RTC has different I2C address than PMIC so
additional regmap is needed.
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Reviewed-by: Mark Brown <broonie@linaro.org>
Acked-by: Sangbeom Kim <sbkim73@samsung.com>
Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: Liam Girdwood <lgirdwood@gmail.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add PM suspend/resume ops to rtc-s5m driver and enable IRQ wake during
suspend so the RTC would act like a wake up source. This allows waking
up from suspend to RAM on RTC alarm interrupt.
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mark Brown <broonie@linaro.org>
Acked-by: Sangbeom Kim <sbkim73@samsung.com>
Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: Liam Girdwood <lgirdwood@gmail.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>