OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Tetsuo Handa	1c500ad706	loop: reduce the loop_ctl_mutex scope syzbot is reporting circular locking problem at __loop_clr_fd() [1], for commit `a160c6159d` ("block: add an optional probe callback to major_names") is calling the module's probe function with major_names_lock held. Fortunately, since commit `990e78116d` ("block: loop: fix deadlock between open and remove") stopped holding loop_ctl_mutex in lo_open(), current role of loop_ctl_mutex is to serialize access to loop_index_idr and loop_add()/loop_remove(); in other words, management of id for IDR. To avoid holding loop_ctl_mutex during whole add/remove operation, use a bool flag to indicate whether the loop device is ready for use. loop_unregister_transfer() which is called from cleanup_cryptoloop() currently has possibility of use-after-free problem due to lack of serialization between kfree() from loop_remove() from loop_control_remove() and mutex_lock() from unregister_transfer_cb(). But since lo->lo_encryption should be already NULL when this function is called due to module unload, and commit `222013f9ac` ("cryptoloop: add a deprecation warning") indicates that we will remove this function shortly, this patch updates this function to emit warning instead of checking lo->lo_encryption. Holding loop_ctl_mutex in loop_exit() is pointless, for all users must close /dev/loop-control and /dev/loop$num (in order to drop module's refcount to 0) before loop_exit() starts, and nobody can open /dev/loop-control or /dev/loop$num afterwards. Link: https://syzkaller.appspot.com/bug?id=7bb10e8b62f83e4d445cdf4c13d69e407e629558 [1] Reported-by: syzbot <syzbot+f61766d5763f9e7a118f@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/adb1e792-fc0e-ee81-7ea0-0906fc36419d@i-love.sakura.ne.jp Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-03 22:14:40 -06:00
Dan Schatzberg	c74d40e8b5	loop: charge i/o to mem and blk cg The current code only associates with the existing blkcg when aio is used to access the backing file. This patch covers all types of i/o to the backing file and also associates the memcg so if the backing file is on tmpfs, memory is charged appropriately. This patch also exports cgroup_get_e_css and int_active_memcg so it can be used by the loop module. Link: https://lkml.kernel.org/r/20210610173944.1203706-4-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Chris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:50 -07:00
Dan Schatzberg	87579e9b7d	loop: use worker per cgroup instead of kworker Patch series "Charge loop device i/o to issuing cgroup", v14. The loop device runs all i/o to the backing file on a separate kworker thread which results in all i/o being charged to the root cgroup. This allows a loop device to be used to trivially bypass resource limits and other policy. This patch series fixes this gap in accounting. A simple script to demonstrate this behavior on cgroupv2 machine: ''' #!/bin/bash set -e CGROUP=/sys/fs/cgroup/test.slice LOOP_DEV=/dev/loop0 if [[ ! -d $CGROUP ]] then sudo mkdir $CGROUP fi grep oom_kill $CGROUP/memory.events # Set a memory limit, write more than that limit to tmpfs -> OOM kill sudo unshare -m bash -c " echo \$\$ > $CGROUP/cgroup.procs; echo 0 > $CGROUP/memory.swap.max; echo 64M > $CGROUP/memory.max; mount -t tmpfs -o size=512m tmpfs /tmp; dd if=/dev/zero of=/tmp/file bs=1M count=256" \|\| true grep oom_kill $CGROUP/memory.events # Set a memory limit, write more than that limit through loopback # device -> no OOM kill sudo unshare -m bash -c " echo \$\$ > $CGROUP/cgroup.procs; echo 0 > $CGROUP/memory.swap.max; echo 64M > $CGROUP/memory.max; mount -t tmpfs -o size=512m tmpfs /tmp; truncate -s 512m /tmp/backing_file losetup $LOOP_DEV /tmp/backing_file dd if=/dev/zero of=$LOOP_DEV bs=1M count=256; losetup -D $LOOP_DEV" \|\| true grep oom_kill $CGROUP/memory.events ''' Naively charging cgroups could result in priority inversions through the single kworker thread in the case where multiple cgroups are reading/writing to the same loop device. This patch series does some minor modification to the loop driver so that each cgroup can make forward progress independently to avoid this inversion. With this patch series applied, the above script triggers OOM kills when writing through the loop device as expected. This patch (of 3): Existing uses of loop device may have multiple cgroups reading/writing to the same device. Simply charging resources for I/O to the backing file could result in priority inversion where one cgroup gets synchronously blocked, holding up all other I/O to the loop device. In order to avoid this priority inversion, we use a single workqueue where each work item is a "struct loop_worker" which contains a queue of struct loop_cmds to issue. The loop device maintains a tree mapping blk css_id -> loop_worker. This allows each cgroup to independently make forward progress issuing I/O to the backing file. There is also a single queue for I/O associated with the rootcg which can be used in cases of extreme memory shortage where we cannot allocate a loop_worker. The locking for the tree and queues is fairly heavy handed - we acquire a per-loop-device spinlock any time either is accessed. The existing implementation serializes all I/O through a single thread anyways, so I don't believe this is any worse. [colin.king@canonical.com: fixes] Link: https://lkml.kernel.org/r/20210610173944.1203706-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20210610173944.1203706-2-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Chris Down <chris@chrisdown.name> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:50 -07:00
Christoph Hellwig	990e78116d	block: loop: fix deadlock between open and remove Commit `c76f48eb5c` ("block: take bd_mutex around delete_partitions in del_gendisk") adds disk->part0->bd_mutex in del_gendisk(), this way causes the following AB/BA deadlock between removing loop and opening loop: 1) loop_control_ioctl(LOOP_CTL_REMOVE) -> mutex_lock(&loop_ctl_mutex) -> del_gendisk -> mutex_lock(&disk->part0->bd_mutex) 2) blkdev_get_by_dev -> mutex_lock(&disk->part0->bd_mutex) -> lo_open -> mutex_lock(&loop_ctl_mutex) Add a new Lo_deleting state to remove the need for clearing ->private_data and thus holding loop_ctl_mutex in the ioctl LOOP_CTL_REMOVE path. Based on an analysis and earlier patch from Ming Lei <ming.lei@redhat.com>. Reported-by: Colin Ian King <colin.king@canonical.com> Fixes: `c76f48eb5c` ("block: take bd_mutex around delete_partitions in del_gendisk") Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210605140950.5800-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-11 11:50:54 -06:00
Pavel Tatashin	6cc8e74308	loop: scale loop device by introducing per device lock Currently, loop device has only one global lock: loop_ctl_mutex. This becomes hot in scenarios where many loop devices are used. Scale it by introducing per-device lock: lo_mutex that protects modifications of all fields in struct loop_device. Keep loop_ctl_mutex to protect global data: loop_index_idr, loop_lookup, loop_add. The new lock ordering requirement is that loop_ctl_mutex must be taken before lo_mutex. Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com> Reviewed-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-26 13:08:54 -07:00
Tetsuo Handa	310ca162d7	block/loop: Use global lock for ioctl() operation. syzbot is reporting NULL pointer dereference [1] which is caused by race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other loop devices at loop_validate_file() without holding corresponding lo->lo_ctl_mutex locks. Since ioctl() request on loop devices is not frequent operation, we don't need fine grained locking. Let's use global lock in order to allow safe traversal at loop_validate_file(). Note that syzbot is also reporting circular locking dependency between bdev->bd_mutex and lo->lo_ctl_mutex [2] which is caused by calling blkdev_reread_part() with lock held. This patch does not address it. [1] https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3 [2] https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+bf89c128e05dd6c62523@syzkaller.appspotmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-11-08 06:30:11 -07:00
Tetsuo Handa	d3349b6b3c	loop: remember whether sysfs_create_group() was done syzbot is hitting WARN() triggered by memory allocation fault injection [1] because loop module is calling sysfs_remove_group() when sysfs_create_group() failed. Fix this by remembering whether sysfs_create_group() succeeded. [1] https://syzkaller.appspot.com/bug?id=3f86c0edf75c86d2633aeb9dd69eccc70bc7e90b Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+9f03168400f56df89dbc6f1751f4458fe739ff29@syzkaller.appspotmail.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Renamed sysfs_ready -> sysfs_inited. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-07 15:26:36 -06:00
Jens Axboe	1894e91654	loop: remove cmd->rq member We can always get at the request from the payload, no need to store a pointer to it. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-04-14 22:34:27 -06:00
Shaohua Li	d4478e92d6	block/loop: make loop cgroup aware loop block device handles IO in a separate thread. The actual IO dispatched isn't cloned from the IO loop device received, so the dispatched IO loses the cgroup context. I'm ignoring buffer IO case now, which is quite complicated. Making the loop thread aware cgroup context doesn't really help. The loop device only writes to a single file. In current writeback cgroup implementation, the file can only belong to one cgroup. For direct IO case, we could workaround the issue in theory. For example, say we assign cgroup1 5M/s BW for loop device and cgroup2 10M/s. We can create a special cgroup for loop thread and assign at least 15M/s for the underlayer disk. In this way, we correctly throttle the two cgroups. But this is tricky to setup. This patch tries to address the issue. We record bio's css in loop command. When loop thread is handling the command, we then use the API provided in patch 1 to set the css for current task. The bio layer will use the css for new IO (from patch 3). Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-26 07:41:22 -06:00
Omar Sandoval	e5313c141b	loop: remove union of use_aio and ref in struct loop_cmd When the request is completed, lo_complete_rq() checks cmd->use_aio. However, if this is in fact an aio request, cmd->use_aio will have already been reused as cmd->ref by lo_rw_aio*. Fix it by not using a union. On x86_64, there's a hole after the union anyways, so this doesn't make struct loop_cmd any bigger. Fixes: `92d773324b` ("block/loop: fix use after free") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-25 08:56:05 -06:00
Shaohua Li	bc75705d00	block/loop: remove unused field nobody uses the list. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-01 13:57:35 -06:00
Shaohua Li	92d773324b	block/loop: fix use after free lo_rw_aio->call_read_iter-> 1 aops->direct_IO 2 iov_iter_revert lo_rw_aio_complete could happen between 1 and 2, the bio and bvec could be freed before 2, which accesses bvec. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-01 13:57:33 -06:00
Shaohua Li	40326d8a33	block/loop: allow request merge for directio mode Currently loop disables merge. While it makes sense for buffer IO mode, directio mode can benefit from request merge. Without merge, loop could send small size IO to underlayer disk and harm performance. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-09-01 08:44:34 -06:00
Omar Sandoval	8a0740c410	loop: get rid of lo_blocksize This is only used for setting the soft block size on the struct block_device once and then never used again. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-08-31 13:51:10 -06:00
Omar Sandoval	1e6ec9ea89	Revert "loop: support 4k physical blocksize" There's some stuff still up in the air, let's not get stuck with a subpar ABI. I'll follow up with something better for 4.14. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-08-23 15:57:55 -06:00
Hannes Reinecke	f2c6df7dbf	loop: support 4k physical blocksize When generating bootable VM images certain systems (most notably s390x) require devices with 4k blocksize. This patch implements a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical blocksize to that of the underlying device, and allow to change the logical blocksize for up to the physical blocksize. Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-06-08 08:40:00 -06:00
Christoph Hellwig	fe2cb2905c	loop: zero-fill bio on the submitting cpu In thruth I've just audited which blk-mq drivers don't currently have a complete callback, but I think this change is at least borderline useful. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-04-20 12:16:10 -06:00
Ming Lei	bc07c10a36	block: loop: support DIO & AIO There are at least 3 advantages to use direct I/O and AIO on read/write loop's backing file: 1) double cache can be avoided, then memory usage gets decreased a lot 2) not like user space direct I/O, there isn't cost of pinning pages 3) avoid context switch for obtaining good throughput - in buffered file read, random I/O top throughput is often obtained only if they are submitted concurrently from lots of tasks; but for sequential I/O, most of times they can be hit from page cache, so concurrent submissions often introduce unnecessary context switch and can't improve throughput much. There was such discussion[1] to use non-blocking I/O to improve the problem for application. - with direct I/O and AIO, concurrent submissions can be avoided and random read throughput can't be affected meantime xfstests(-g auto, ext4) is basically passed when running with direct I/O(aio), one exception is generic/232, but it failed in loop buffered I/O(4.2-rc6-next-20150814) too. Follows the fio test result for performance purpose: 4 jobs fio test inside ext4 file system over loop block 1) How to run - KVM: 4 VCPUs, 2G RAM - linux kernel: 4.2-rc6-next-20150814(base) with the patchset - the loop block is over one image on SSD. - linux psync, 4 jobs, size 1500M, ext4 over loop block - test result: IOPS from fio output 2) Throughput(IOPS) becomes a bit better with direct I/O(aio) ------------------------------------------------------------- test cases \|randread \|read \|randwrite \|write \| ------------------------------------------------------------- base \|8015 \|113811 \|67442 \|106978 ------------------------------------------------------------- base+loop aio \|8136 \|125040 \|67811 \|111376 ------------------------------------------------------------- - somehow, it should be caused by more page cache avaiable for application or one extra page copy is avoided in case of direct I/O 3) context switch - context switch decreased by ~50% with loop direct I/O(aio) compared with loop buffered I/O(4.2-rc6-next-20150814) 4) memory usage from /proc/meminfo ------------------------------------------------------------- \| Buffers \| Cached ------------------------------------------------------------- base \| > 760MB \| ~950MB ------------------------------------------------------------- base+loop direct I/O(aio) \| < 5MB \| ~1.6GB ------------------------------------------------------------- - so there are much more page caches available for application with direct I/O [1] https://lwn.net/Articles/612483/ Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-09-23 11:01:16 -06:00
Ming Lei	2e5ab5f379	block: loop: prepare for supporing direct IO This patches provides one interface for enabling direct IO from user space: - userspace(such as losetup) can pass 'file' which is opened/fcntl as O_DIRECT Also __loop_update_dio() is introduced to check if direct I/O can be used on current loop setting. The last big change is to introduce LO_FLAGS_DIRECT_IO flag for userspace to know if direct IO is used to access backing file. Cc: linux-api@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-09-23 11:01:16 -06:00
Ming Lei	e03a3d7a94	block: loop: use kthread_work The following patch will use dio/aio to submit IO to backing file, then it needn't to schedule IO concurrently from work, so use kthread_work for decreasing context switch cost a lot. For non-AIO case, single thread has been used for long long time, and it was just converted to work in v4.0, which has caused performance regression for fedora live booting already. In discussion[1], even though submitting I/O via work concurrently can improve random read IO throughput, meantime it might hurt sequential read IO performance, so better to restore to single thread behaviour. For the following AIO support, it is better to use multi hw-queue with per-hwq kthread than current work approach suppose there is so high performance requirement for loop. [1] http://marc.info/?t=143082678400002&r=1&w=2 Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-09-23 11:01:16 -06:00
Ming Lei	f893366795	block: loop: don't hold lo_ctl_mutex in lo_open The lo_ctl_mutex is held for running all ioctl handlers, and in some ioctl handlers, ioctl_by_bdev(BLKRRPART) is called for rereading partitions, which requires bd_mutex. So it is easy to cause failure because trylock(bd_mutex) may fail inside blkdev_reread_part(), and follows the lock context: blkid or other application: ->open() ->mutex_lock(bd_mutex) ->lo_open() ->mutex_lock(lo_ctl_mutex) losetup(set fd ioctl): ->mutex_lock(lo_ctl_mutex) ->ioctl_by_bdev(BLKRRPART) ->trylock(bd_mutex) This patch trys to eliminate the ABBA lock dependency by removing lo_ctl_mutext in lo_open() with the following approach: 1) make lo_refcnt as atomic_t and avoid acquiring lo_ctl_mutex in lo_open(): - for open vs. add/del loop, no any problem because of loop_index_mutex - freeze request queue during clr_fd, so I/O can't come until clearing fd is completed, like the effect of holding lo_ctl_mutex in lo_open - both open() and release() have been serialized by bd_mutex already 2) don't hold lo_ctl_mutex for decreasing/checking lo_refcnt in lo_release(), then lo_ctl_mutex is only required for the last release. Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jarod Wilson <jarod@redhat.com> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-20 09:06:09 -06:00
Ming Lei	f4aa4c7bba	block: loop: convert to per-device workqueue Documentation/workqueue.txt: If there is dependency among multiple work items used during memory reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM. Loop devices can be stacked, so we have to convert to per-device workqueue. One example is Fedora live CD. Fixes: `b5dd2f6047` Cc: stable@vger.kernel.org (v4.0) Cc: Justin M. Forbes <jforbes@fedoraproject.org> Signed-off-by: Ming Lei <ming.lei@canonical.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-05 13:46:53 -06:00
Jens Axboe	78e367a360	loop: add blk-mq.h include Looks like we pull it in through other ways on x86, but we fail on sparc: In file included from drivers/block/cryptoloop.c:30:0: drivers/block/loop.h:63:24: error: field 'tag_set' has incomplete type struct blk_mq_tag_set tag_set; Add the include to loop.h, kill it from loop.c. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-02 15:20:25 -07:00
Ming Lei	b5dd2f6047	block: loop: improve performance via blk-mq The conversion is a bit straightforward, and use work queue to dispatch requests of loop block, and one big change is that requests is submitted to backend file/device concurrently with work queue, so throughput may get improved much. Given write requests over same file are often run exclusively, so don't handle them concurrently for avoiding extra context switch cost, possible lock contention and work schedule cost. Also with blk-mq, there is opportunity to get loop I/O merged before submitting to backend file/device. In the following test: - base: v3.19-rc2-2041231 - loop over file in ext4 file system on SSD disk - bs: 4k, libaio, io depth: 64, O_DIRECT, num of jobs: 1 - throughput: IOPS ------------------------------------------------------ \| \| base \| base with loop-mq \| delta \| ------------------------------------------------------ \| randread \| 1740 \| 25318 \| +1355%\| ------------------------------------------------------ \| read \| 42196 \| 51771 \| +22.6%\| ----------------------------------------------------- \| randwrite \| 35709 \| 34624 \| -3% \| ----------------------------------------------------- \| write \| 39137 \| 40326 \| +3% \| ----------------------------------------------------- So loop-mq can improve throughput for both read and randread, meantime, performance of write and randwrite isn't hurted basically. Another benefit is that loop driver code gets simplified much after blk-mq conversion, and the patch can be thought as cleanup too. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-02 15:07:49 -07:00
Al Viro	83a8761142	move linux/loop.h to drivers/block Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-06-29 12:46:45 +04:00

25 Commits