OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Gao Xiang	8241fdd3cd	erofs: clean up zmap.c Several trivial cleanups which aren't quite necessary to split: - Rename lcluster load functions as well as justify full indexes since they are typically used for global deduplication for compressed data; - Avoid unnecessary lines, comments for simplicity. No logic changes. Reviewed-by: Guo Xuenan <guoxuenan@huaweicloud.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230615064421.103178-1-hsiangkao@linux.alibaba.com	2023-06-22 21:16:34 +08:00
Yangtao Li	1990595547	erofs: remove unnecessary goto It's redundant, let's remove it. Signed-off-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20230615034539.14286-1-frank.li@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-06-22 21:16:34 +08:00
Sandeep Dhavale	12d0a24afd	erofs: Fix detection of atomic context Current check for atomic context is not sufficient as z_erofs_decompressqueue_endio can be called under rcu lock from blk_mq_flush_plug_list(). See the stacktrace [1] In such case we should hand off the decompression work for async processing rather than trying to do sync decompression in current context. Patch fixes the detection by checking for rcu_read_lock_any_held() and while at it use more appropriate !in_task() check than in_atomic(). Background: Historically erofs would always schedule a kworker for decompression which would incur the scheduling cost regardless of the context. But z_erofs_decompressqueue_endio() may not always be in atomic context and we could actually benefit from doing the decompression in z_erofs_decompressqueue_endio() if we are in thread context, for example when running with dm-verity. This optimization was later added in patch [2] which has shown improvement in performance benchmarks. ============================================== [1] Problem stacktrace [name:core&]BUG: sleeping function called from invalid context at kernel/locking/mutex.c:291 [name:core&]in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1615, name: CpuMonitorServi [name:core&]preempt_count: 0, expected: 0 [name:core&]RCU nest depth: 1, expected: 0 CPU: 7 PID: 1615 Comm: CpuMonitorServi Tainted: G S W OE 6.1.25-android14-5-maybe-dirty-mainline #1 Hardware name: MT6897 (DT) Call trace: dump_backtrace+0x108/0x15c show_stack+0x20/0x30 dump_stack_lvl+0x6c/0x8c dump_stack+0x20/0x48 __might_resched+0x1fc/0x308 __might_sleep+0x50/0x88 mutex_lock+0x2c/0x110 z_erofs_decompress_queue+0x11c/0xc10 z_erofs_decompress_kickoff+0x110/0x1a4 z_erofs_decompressqueue_endio+0x154/0x180 bio_endio+0x1b0/0x1d8 __dm_io_complete+0x22c/0x280 clone_endio+0xe4/0x280 bio_endio+0x1b0/0x1d8 blk_update_request+0x138/0x3a4 blk_mq_plug_issue_direct+0xd4/0x19c blk_mq_flush_plug_list+0x2b0/0x354 __blk_flush_plug+0x110/0x160 blk_finish_plug+0x30/0x4c read_pages+0x2fc/0x370 page_cache_ra_unbounded+0xa4/0x23c page_cache_ra_order+0x290/0x320 do_sync_mmap_readahead+0x108/0x2c0 filemap_fault+0x19c/0x52c __do_fault+0xc4/0x114 handle_mm_fault+0x5b4/0x1168 do_page_fault+0x338/0x4b4 do_translation_fault+0x40/0x60 do_mem_abort+0x60/0xc8 el0_da+0x4c/0xe0 el0t_64_sync_handler+0xd4/0xfc el0t_64_sync+0x1a0/0x1a4 [2] Link: https://lore.kernel.org/all/20210317035448.13921-1-huangjianan@oppo.com/ Reported-by: Will Shiu <Will.Shiu@mediatek.com> Suggested-by: Gao Xiang <xiang@kernel.org> Signed-off-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com> Link: https://lore.kernel.org/r/20230621220848.3379029-1-dhavale@google.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-06-22 21:16:02 +08:00
Jingbo Xu	f02615eb6f	erofs: use separate xattr parsers for listxattr/getxattr There's a callback styled xattr parser, i.e. xattr_foreach(), which is shared among listxattr and getxattr. Convert it to two separate xattr parsers to serve listxattr and getxattr for better readability. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230613074114.120115-6-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-06-18 12:10:54 +08:00
Jingbo Xu	4b077b5012	erofs: unify inline/shared xattr iterators for listxattr/getxattr Make inline_{list,get}xattr() as well as inline_xattr_iter_begin() unified as erofs_xattr_iter_inline(), and shared_{list,get}xattr() unified as erofs_xattr_iter_shared(). After these changes, both erofs_xattr_iter_{inline,shared}() return 0 on success, and negative error on failure. One thing worth noting is that, the logic of returning it->buffer_ofs when there's no shared xattrs in shared_listxattr() is moved to erofs_listxattr() to make the unification possible. The only difference is that, semantically the old behavior will return ENOATTR rather than it->buffer_ofs if ENOATTR encountered when listxattr is parsing upon a specific shared xattr, while now the new behavior will return it->buffer_ofs in this case. This is not an issue, as listxattr upon a specific xattr won't return ENOATTR. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230613074114.120115-5-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-06-18 12:10:54 +08:00
Jingbo Xu	5a8ffb1975	erofs: make the size of read data stored in buffer_ofs Since now xattr_iter structures have been unified, make the size of the read data stored in buffer_ofs. Don't bother reusing buffer_size for this use, which may be confusing. This is in preparation for the following further cleanup. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230613074114.120115-4-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-06-18 12:10:54 +08:00
Jingbo Xu	8e823961de	erofs: unify xattr_iter structures Unify xattr_iter/listxattr_iter/getxattr_iter structures into erofs_xattr_iter structure. This is in preparation for the following further cleanup. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230613074114.120115-3-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-06-18 12:10:53 +08:00
Jingbo Xu	eba67eb6de	erofs: use absolute position in xattr iterator Replace blkaddr/ofs with pos in 'struct erofs_xattr_iter'. After erofs_bread() is introduced to replace raw page cache APIs for metadata I/Os handling, xattr_iter_fixup() is no longer needed anymore. In addition, it is also unnecessary to check if the iterated position is span over the block boundary as absolute offset is used instead of blkaddr + offset pairs. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230613074114.120115-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-06-18 12:10:53 +08:00
Gao Xiang	001b8ccd06	erofs: fix compact 4B support for 16k block size In compact 4B, two adjacent lclusters are packed together as a unit to form on-disk indexes for effective random access, as below: (amortized = 4, vcnt = 2) _____________________________________________ \|___@_____ encoded bits __________\|_ blkaddr _\| 0 . amortized * vcnt = 8 . . . . amortized * vcnt - 4 = 4 . . .____________________________. \|_type (2 bits)_\|_clusterofs_\| Therefore, encoded bits for each pack are 32 bits (4 bytes). IOWs, since each lcluster can get 16 bits for its type and clusterofs, the maximum supported lclustersize for compact 4B format is 16k (14 bits). Fix this to enable compact 4B format for 16k lclusters (blocks), which is tested on an arm64 server with 16k page size. Fixes: `152a333a58` ("staging: erofs: add compacted compression indexes support") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230601112341.56960-1-hsiangkao@linux.alibaba.com	2023-06-18 12:10:53 +08:00
Jingbo Xu	9c39ec0cff	erofs: convert erofs_read_metabuf() to erofs_bread() for xattr buf->inode is constant once initialized during erofs_buf's lifetime. Thus call erofs_init_metabuf() and erofs_bread() separately to avoid the repetition of assigning buf->inode when iterating xattrs. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230601024347.108469-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-06-18 12:10:53 +08:00
Gao Xiang	43d86ec936	erofs: use poison pointer to replace the hard-coded address It's safer and cleaner to replace such hard-coded illegal pointer with poison pointers. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Link: https://lore.kernel.org/r/20230526201459.128169-7-hsiangkao@linux.alibaba.com	2023-06-18 12:10:53 +08:00
Gao Xiang	7674a42f35	erofs: use struct lockref to replace handcrafted approach Let's avoid the current handcrafted lockref although `struct lockref` inclusion usually increases extra 4 bytes with an explicit spinlock if CONFIG_DEBUG_SPINLOCK is off. Apart from the size difference, note that the meaning of refcount is also changed to active users. IOWs, it doesn't take an extra refcount for XArray tree insertion. I don't observe any significant performance difference at least on our cloud compute server but the new one indeed simplifies the overall codebase a bit. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Link: https://lore.kernel.org/r/20230529123727.79943-1-hsiangkao@linux.alibaba.com	2023-06-18 12:10:47 +08:00
Christoph Hellwig	05bdb99653	block: replace fmode_t with a block-specific type for block open flags The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 08:04:05 -06:00
Christoph Hellwig	2736e8eeb0	block: use the holder as indication for exclusive opens The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 08:04:04 -06:00
Christoph Hellwig	0718afd47f	block: introduce holder ops Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:53:04 -06:00
Gao Xiang	7b4e372c36	erofs: adapt managed inode operations into folios This patch gets rid of erofs_try_to_free_cached_page() and fold it into .release_folio(). It also moves managed inode operations into zdata.c, which simplifies the code a bit. No logic changes. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Link: https://lore.kernel.org/r/20230526201459.128169-5-hsiangkao@linux.alibaba.com	2023-05-29 23:06:03 +08:00
Gao Xiang	967c28b23f	erofs: kill hooked chains to avoid loops on deduplicated compressed images After heavily stressing EROFS with several images which include a hand-crafted image of repeated patterns for more than 46 days, I found two chains could be linked with each other almost simultaneously and form a loop so that the entire loop won't be submitted. As a consequence, the corresponding file pages will remain locked forever. It can be _only_ observed on data-deduplicated compressed images. For example, consider two chains with five pclusters in total: Chain 1: 2->3->4->5 -- The tail pcluster is 5; Chain 2: 5->1->2 -- The tail pcluster is 2. Chain 2 could link to Chain 1 with pcluster 5; and Chain 1 could link to Chain 2 at the same time with pcluster 2. Since hooked chains are all linked locklessly now, I have no idea how to simply avoid the race. Instead, let's avoid hooked chains completely until I could work out a proper way to fix this and end users finally tell us that it's needed to add it back. Actually, this optimization can be found with multi-threaded workloads (especially even more often on deduplicated compressed images), yet I'm not sure about the overall system impacts of not having this compared with implementation complexity. Fixes: `267f2492c8` ("erofs: introduce multi-reference pclusters (fully-referenced)") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Link: https://lore.kernel.org/r/20230526201459.128169-4-hsiangkao@linux.alibaba.com	2023-05-29 23:05:43 +08:00
Gao Xiang	6ab5eed600	erofs: avoid on-stack pagepool directly passed by arguments On-stack pagepool is used so that short-lived temporary pages could be shared within a single I/O request (e.g. among multiple pclusters). Moving the remaining frontend-related uses into z_erofs_decompress_frontend to avoid too many arguments. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Link: https://lore.kernel.org/r/20230526201459.128169-3-hsiangkao@linux.alibaba.com	2023-05-29 23:05:25 +08:00
Gao Xiang	05b63d2beb	erofs: allocate extra bvec pages directly instead of retrying If non-bootstrap bvecs cannot be kept in place (very rarely), an extra short-lived page is allocated. Let's just allocate it immediately rather than do unnecessary -EAGAIN return first and retry as a cleanup. Also it's unnecessary to use __GFP_NOFAIL here since we could gracefully fail out this case instead. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Link: https://lore.kernel.org/r/20230526201459.128169-2-hsiangkao@linux.alibaba.com	2023-05-29 23:04:45 +08:00
Yue Hu	796e9149a2	erofs: clean up z_erofs_pcluster_readmore() `end` parameter is no needed since it's pointless for !backmost, we can handle it with backmost internally. And we only expand the trailing edge, so the newstart can be replaced with ->headoffset. Also, remove linux/prefetch.h inclusion since that is not used anymore after commit `386292919c` ("erofs: introduce readmore decompression strategy"). Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230525072605.17857-1-zbestahu@gmail.com [ Gao Xiang: update commit description. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-05-29 15:54:51 +08:00
Yue Hu	ef4b4b46c6	erofs: remove the member readahead from struct z_erofs_decompress_frontend The struct member is only used to add REQ_RAHEAD during I/O submission. So it is cleaner to pass it as a parameter than keep it in the struct. Also, rename function z_erofs_get_sync_decompress_policy() to z_erofs_is_sync_decompress() for better clarity and conciseness. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230524063944.1655-1-zbestahu@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-05-29 15:54:50 +08:00
Yue Hu	597e2953ae	erofs: fold in z_erofs_decompress() No need this helper since it's just a simple wrapper for decompress method and only one caller. So, let's fold in directly instead. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230426084449.12781-1-zbestahu@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-05-29 15:54:49 +08:00
David Howells	2cb1e08985	splice: Use filemap_splice_read() instead of generic_file_splice_read() Replace pointers to generic_file_splice_read() with calls to filemap_splice_read(). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: David Hildenbrand <david@redhat.com> cc: John Hubbard <jhubbard@nvidia.com> cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20230522135018.2742245-29-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-24 08:42:17 -06:00
Gao Xiang	cf7f2732b4	erofs: use HIPRI by default if per-cpu kthreads are enabled As Sandeep shown [1], high priority RT per-cpu kthreads are typically helpful for Android scenarios to minimize the scheduling latencies. Switch EROFS_FS_PCPU_KTHREAD_HIPRI on by default if EROFS_FS_PCPU_KTHREAD is on since it's the typical use cases for EROFS_FS_PCPU_KTHREAD. Also clean up unneeded sched_set_normal(). [1] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230522092141.124290-1-hsiangkao@linux.alibaba.com	2023-05-23 16:57:08 +08:00
Yue Hu	285d0f85da	erofs: avoid pcpubuf.c inclusion if CONFIG_EROFS_FS_ZIP is off The function of pcpubuf.c is just for low-latency decompression algorithms (e.g. lz4). Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230515095758.10391-1-zbestahu@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-05-23 16:56:40 +08:00
Jingbo Xu	0a17567b4a	erofs: fix null-ptr-deref caused by erofs_xattr_prefixes_init Fragments and dedupe share one feature bit, and thus packed inode may not exist when fragment feature bit (dedupe feature bit exactly) is set, e.g. when deduplication feature is in use while fragments feature is not. In this case, sbi->packed_inode could be NULL while fragments feature bit is set. Fix this by accessing packed inode only when it exists. Reported-by: syzbot+902d5a9373ae8f748a94@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=902d5a9373ae8f748a94 Reported-and-tested-by: syzbot+bbb353775d51424087f2@syzkaller.appspotmail.com Fixes: `9e38291461` ("erofs: add helpers to load long xattr name prefixes") Fixes: `6a318ccd7e` ("erofs: enable long extended attribute name prefixes") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230515103941.129784-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-05-23 16:56:21 +08:00
Linus Torvalds	61d325dcbc	Changes since last update: - Add sub-page block size support for uncompressed files; - Support flattened block device for multi-blob images to be attached into virtual machines (including cloud servers) and bare metals; - Support long xattr name prefixes to optimize images with common xattr namespaces (e.g. files with overlayfs xattrs) use cases; - Various minor cleanups & fixes. -----BEGIN PGP SIGNATURE----- iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCZETCNREceGlhbmdAa2Vy bmVsLm9yZwAKCRA5NzHcH7XmBJCMAP9VkAPycbbqa6qWUASdyh/HGyuLJTHSfmsJ zO4y6hBgOwD9GXg55sY8ycvcOx9ayaUt5V5f9zhs4wdGcoPhj5fWzgA= =nUva -----END PGP SIGNATURE----- Merge tag 'erofs-for-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "In this cycle, sub-page block support for uncompressed files is available. It's mainly used to enable original signing ('golden') 4k-block images on arm64 with 16/64k pages. In addition, end users could also use this feature to build a manifest to directly refer to golden tar data. Besides, long xattr name prefix support is also introduced in this cycle to avoid too many xattrs with the same prefix (e.g. overlayfs xattrs). It's useful for erofs + overlayfs combination (like Composefs model): the image size is reduced by ~14% and runtime performance is also slightly improved. Others are random fixes and cleanups as usual. Summary: - Add sub-page block size support for uncompressed files - Support flattened block device for multi-blob images to be attached into virtual machines (including cloud servers) and bare metals - Support long xattr name prefixes to optimize images with common xattr namespaces (e.g. files with overlayfs xattrs) use cases - Various minor cleanups & fixes" * tag 'erofs-for-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: cleanup i_format-related stuffs erofs: sunset erofs_dbg() erofs: fix potential overflow calculating xattr_isize erofs: get rid of z_erofs_fill_inode() erofs: enable long extended attribute name prefixes erofs: handle long xattr name prefixes properly erofs: add helpers to load long xattr name prefixes erofs: introduce on-disk format for long xattr name prefixes erofs: move packed inode out of the compression part erofs: keep meta inode into erofs_buf erofs: initialize packed inode after root inode is assigned erofs: stop parsing non-compact HEAD index if clusterofs is invalid erofs: don't warn ztailpacking feature anymore erofs: simplify erofs_xattr_generic_get() erofs: rename init_inode_xattrs with erofs_ prefix erofs: move several xattr helpers into xattr.c erofs: tidy up EROFS on-disk naming erofs: support flattened block device for multi-blob images erofs: set block size to the on-disk block size erofs: avoid hardcoded blocksize for subpage block support	2023-04-24 14:25:39 -07:00
Linus Torvalds	7bcff5a396	v6.4/vfs.acl -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEEhwgAKCRCRxhvAZXjc otwgAQDXHnKiPm/d76lITXbxdUNCtvZz+ig26EbOrD+vEszzIQEA81dru0QbCNCt ctoZdcsmtKbt2VaYQF1CDOhlnNg5VQM= =pER1 -----END PGP SIGNATURE----- Merge tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull acl updates from Christian Brauner: "After finishing the introduction of the new posix acl api last cycle the generic POSIX ACL xattr handlers are still around in the filesystems xattr handlers for two reasons: (1) Because a few filesystems rely on the ->list() method of the generic POSIX ACL xattr handlers in their ->listxattr() inode operation. (2) POSIX ACLs are only available if IOP_XATTR is raised. The IOP_XATTR flag is raised in inode_init_always() based on whether the sb->s_xattr pointer is non-NULL. IOW, the registered xattr handlers of the filesystem are used to raise IOP_XATTR. Removing the generic POSIX ACL xattr handlers from all filesystems would risk regressing filesystems that only implement POSIX ACL support and no other xattrs (nfs3 comes to mind). This contains the work to decouple POSIX ACLs from the IOP_XATTR flag as they don't depend on xattr handlers anymore. So it's now possible to remove the generic POSIX ACL xattr handlers from the sb->s_xattr list of all filesystems. This is a crucial step as the generic POSIX ACL xattr handlers aren't used for POSIX ACLs anymore and POSIX ACLs don't depend on the xattr infrastructure anymore. Adressing problem (1) will require more long-term work. It would be best to get rid of the ->list() method of xattr handlers completely at some point. For erofs, ext{2,4}, f2fs, jffs2, ocfs2, and reiserfs the nop POSIX ACL xattr handler is kept around so they can continue to use array-based xattr handler indexing. This update does simplify the ->listxattr() implementation of all these filesystems however" * tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: acl: don't depend on IOP_XATTR ovl: check for ->listxattr() support reiserfs: rework priv inode handling fs: rename generic posix acl handlers reiserfs: rework ->listxattr() implementation fs: simplify ->listxattr() implementation fs: drop unused posix acl handlers xattr: remove unused argument xattr: add listxattr helper xattr: simplify listxattr helpers	2023-04-24 13:35:23 -07:00
Gao Xiang	745ed7d778	erofs: cleanup i_format-related stuffs Switch EROFS_I_{VERSION,DATALAYOUT}_BITS into EROFS_I_{VERSION,DATALAYOUT}_MASK. Also avoid erofs_bitrange() since its functionality is simple enough. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230414083027.12307-2-hsiangkao@linux.alibaba.com	2023-04-17 01:15:55 +08:00
Gao Xiang	10656f9ca6	erofs: sunset erofs_dbg() Such debug messages are rarely used now. Let's get rid of these, and revert locally if they are needed for debugging. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230414083027.12307-1-hsiangkao@linux.alibaba.com	2023-04-17 01:15:54 +08:00
Jingbo Xu	1b3567a196	erofs: fix potential overflow calculating xattr_isize Given on-disk i_xattr_icount is 16 bits and xattr_isize is calculated from i_xattr_icount multiplying 4, xattr_isize has a theoretical maximum of 256K (64K * 4). Thus declare xattr_isize as unsigned int to avoid the potential overflow. Fixes: `bfb8674dc0` ("staging: erofs: add erofs in-memory stuffs") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230414061810.6479-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:54 +08:00
Gao Xiang	4fdadd5b0f	erofs: get rid of z_erofs_fill_inode() Prior to big pclusters, non-compact compression indexes could have empty headers. Let's just avoid the legacy path since it can be handled properly as a specific compression header with z_erofs_fill_inode_lazy() too. Tested with erofs-utils exist versions. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230413092241.73829-1-hsiangkao@linux.alibaba.com	2023-04-17 01:15:53 +08:00
Jingbo Xu	6a318ccd7e	erofs: enable long extended attribute name prefixes Let's enable long xattr name prefix feature. Old kernels will just ignore / skip such extended attributes. In addition, in case you don't want to mount such images, add another incompatible feature as an option for this. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230407222808.19670-1-jefflexu@linux.alibaba.com [ Gao Xiang: minor commit message fix. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:53 +08:00
Jingbo Xu	82bc1ef41d	erofs: handle long xattr name prefixes properly Make .{list,get}xattr routines adapted to long xattr name prefixes. When the bit 7 of erofs_xattr_entry.e_name_index is set, it indicates that it refers to a long xattr name prefix. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230411093537.127286-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:52 +08:00
Jingbo Xu	9e38291461	erofs: add helpers to load long xattr name prefixes Long xattr name prefixes will be scanned upon mounting and the in-memory long xattr name prefix array will be initialized accordingly. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230407141710.113882-6-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:52 +08:00
Jingbo Xu	b3bfcb9dbf	erofs: introduce on-disk format for long xattr name prefixes Besides the predefined xattr name prefixes, introduces long xattr name prefixes, which work similarly as the predefined name prefixes, except that they are user specified. It is especially useful for use cases together with overlayfs like Composefs model, which introduces diverse xattr values with only a few common xattr names (trusted.overlay.redirect, trusted.overlay.digest, and maybe more in the future). That makes the existing predefined prefixes ineffective in both image size and runtime performance. When a user specified long xattr name prefix is used, only the trailing part of the xattr name apart from the long xattr name prefix will be stored in erofs_xattr_entry.e_name. e_name is empty if the xattr name matches exactly as the long xattr name prefix. All long xattr prefixes are stored in the packed or meta inode, which depends if fragments feature is enabled or not. For each long xattr name prefix, the on-disk format is kept as the same as the unique metadata format: ALIGN({__le16 len, data}, 4), where len represents the total size of struct erofs_xattr_long_prefix, followed by data of struct erofs_xattr_long_prefix itself. Each erofs_xattr_long_prefix keeps predefined prefixes (base_index) and the remaining prefix string without the trailing '\0'. Two fields are introduced to the on-disk superblock, where xattr_prefix_count represents the total number of the long xattr name prefixes recorded, and xattr_prefix_start represents the start offset of recorded name prefixes in the packed/meta inode divided by 4. When referring to a long xattr name prefix, the highest bit (bit 7) of erofs_xattr_entry.e_name_index is set, while the lower bits (bit 0-6) as a whole represents the index of the referred long name prefix among all long xattr name prefixes. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230407141710.113882-5-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:51 +08:00
Jingbo Xu	a97a218b08	erofs: move packed inode out of the compression part packed inode could be used in more scenarios which are independent of compression in the future. For example, packed inode could be used to keep extra long xattr prefixes with the help of following patches. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230407141710.113882-4-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:51 +08:00
Gao Xiang	eb2c5e41be	erofs: keep meta inode into erofs_buf So that erofs_read_metadata() can read metadata from other inodes (e.g. packed inode) as well. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230407141710.113882-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:50 +08:00
Jingbo Xu	cb9bce7951	erofs: initialize packed inode after root inode is assigned As commit `8f7acdae2c` ("staging: erofs: kill all failure handling in fill_super()"), move the initialization of packed inode after root inode is assigned, so that the iput() in .put_super() is adequate as the failure handling. Otherwise, iput() is also needed in .kill_sb(), in case of the mounting fails halfway. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Fixes: `b15b2e307c` ("erofs: support on-disk compressed fragments data") Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230407141710.113882-3-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:49 +08:00
Gao Xiang	cc4efd3dd2	erofs: stop parsing non-compact HEAD index if clusterofs is invalid Syzbot generated a crafted image [1] with a non-compact HEAD index of clusterofs 33024 while valid numbers should be 0 ~ lclustersize-1, which causes the following unexpected behavior as below: BUG: unable to handle page fault for address: fffff52101a3fff9 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 23ffed067 P4D 23ffed067 PUD 0 Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 4398 Comm: kworker/u5:1 Not tainted 6.3.0-rc6-syzkaller-g09a9639e56c0 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023 Workqueue: erofs_worker z_erofs_decompressqueue_work RIP: 0010:z_erofs_decompress_queue+0xb7e/0x2b40 ... Call Trace: <TASK> z_erofs_decompressqueue_work+0x99/0xe0 process_one_work+0x8f6/0x1170 worker_thread+0xa63/0x1210 kthread+0x270/0x300 ret_from_fork+0x1f/0x30 Note that normal images or images using compact indexes are not impacted. Let's fix this now. [1] https://lore.kernel.org/r/000000000000ec75b005ee97fbaa@google.com Reported-and-tested-by: syzbot+aafb3f37cfeb6534c4ac@syzkaller.appspotmail.com Fixes: `02827e1796` ("staging: erofs: add erofs_map_blocks_iter") Fixes: `152a333a58` ("staging: erofs: add compacted compression indexes support") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230410173714.104604-1-hsiangkao@linux.alibaba.com	2023-04-17 01:15:49 +08:00
Yue Hu	12c2987e89	erofs: don't warn ztailpacking feature anymore The ztailpacking feature has been merged for a year, it has been mostly stable now. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230227084457.3510-1-zbestahu@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:48 +08:00
Jingbo Xu	b05f30bcf7	erofs: simplify erofs_xattr_generic_get() erofs_xattr_generic_get() won't be called from xattr handlers other than user/trusted/security xattr handler, and thus there's no need of extra checking. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20230330082910.125374-4-jefflexu@linux.alibaba.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:48 +08:00
Jingbo Xu	6020ffe76d	erofs: rename init_inode_xattrs with erofs_ prefix Rename init_inode_xattrs() to erofs_init_inode_xattrs() without logic change. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230330082910.125374-3-jefflexu@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:47 +08:00
Jingbo Xu	399f36d8c6	erofs: move several xattr helpers into xattr.c Move xattrblock_addr() and xattrblock_offset() helpers into xattr.c, as they are not used outside of xattr.c. inlinexattr_header_size() has only one caller, and thus make it inlined into the caller directly. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230330082910.125374-2-jefflexu@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:47 +08:00
Gao Xiang	1c7f49a767	erofs: tidy up EROFS on-disk naming - Get rid of all "vle" (variable-length extents) expressions since they only expand overall name lengths unnecessarily; - Rename COMPRESSION_LEGACY to COMPRESSED_FULL; - Move on-disk directory definitions ahead of compression; - Drop unused extended attribute definitions; - Move inode ondisk union `i_u` out as `union erofs_inode_i_u`. No actual logical change. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230331063149.25611-1-hsiangkao@linux.alibaba.com	2023-04-17 01:15:46 +08:00
Jia Zhu	8b465fecc3	erofs: support flattened block device for multi-blob images In order to support mounting multi-blobs container image as a single block device, add flattened block device feature for EROFS. In this mode, all meta/data contents will be mapped into one block space. User could compose a block device(by nbd/ublk/virtio-blk/ vhost-user-blk) from multiple sources and mount the block device by EROFS directly. It can reduce the number of block devices used, and it's also benefits in both VM file passthrough and distributed storage scenarios. You can test this using the method mentioned by: https://github.com/dragonflyoss/image-service/pull/1139 1. Compose a (nbd)block device from multi-blobs. 2. Mount EROFS on mntdir/. 3. Compare the md5sum between source dir and mntdir/. Later, we could also use it to refer original tar blobs. Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com> Signed-off-by: Xin Yin <yinxin.x@bytedance.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Tested-by: Jiang Liu <gerry@linux.alibaba.com> Link: https://lore.kernel.org/r/20230302071751.48425-1-zhujia.zj@bytedance.com [ Gao Xiang: refine commit message and use erofs_pos(). ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:46 +08:00
Jingbo Xu	d3c4bdcc75	erofs: set block size to the on-disk block size Set the block size to that specified in on-disk superblock. Also remove the hard constraint of PAGE_SIZE block size for the uncompressed device backend. This constraint is temporarily remained for compressed device and fscache backend, as there is more work needed to handle the condition where the block size is not equal to PAGE_SIZE. It is worth noting that the on-disk block size is read prior to erofs_superblock_csum_verify(), as the read block size is needed in the latter. Besides, later we are going to make erofs refer to tar data blobs (which is 512-byte aligned) for OCI containers, where the block size is 512 bytes. In this case, the 512-byte block size may not be adequate for a directory to contain enough dirents. To fix this, we are also going to introduce directory block size independent on the block size. Due to we have already supported block size smaller than PAGE_SIZE now, disable all these images with such separated directory block size until we supported this feature later. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230313135309.75269-3-jefflexu@linux.alibaba.com [ Gao Xiang: update documentation. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:45 +08:00
Jingbo Xu	3acea5fc33	erofs: avoid hardcoded blocksize for subpage block support As the first step of converting hardcoded blocksize to that specified in on-disk superblock, convert all call sites of hardcoded blocksize to sb->s_blocksize except for: 1) use sbi->blkszbits instead of sb->s_blocksize in erofs_superblock_csum_verify() since sb->s_blocksize has not been updated with the on-disk blocksize yet when the function is called. 2) use inode->i_blkbits instead of sb->s_blocksize in erofs_bread(), since the inode operated on may be an anonymous inode in fscache mode. Currently the anonymous inode is allocated from an anonymous mount maintained in erofs, while in the near future we may allocate anonymous inodes from a generic API directly and thus have no access to the anonymous inode's i_sb. Thus we keep the block size in i_blkbits for anonymous inodes in fscache mode. Be noted that this patch only gets rid of the hardcoded blocksize, in preparation for actually setting the on-disk block size in the following patch. The hard limit of constraining the block size to PAGE_SIZE still exists until the next patch. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230313135309.75269-2-jefflexu@linux.alibaba.com [ Gao Xiang: fold a patch to fix incorrect truncated offsets. ] Link: https://lore.kernel.org/r/20230413035734.15457-1-zhujia.zj@bytedance.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:44 +08:00
Yue Hu	3993f4f456	erofs: use wrapper i_blocksize() in erofs_file_read_iter() linux/fs.h has a wrapper for this operation. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230306075527.1338-1-zbestahu@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-03-09 23:36:04 +08:00
Gao Xiang	9ff471800b	erofs: get rid of a useless DBG_BUGON `err` could be -EINTR and it should not be the case. Actually such DBG_BUGON is useless. Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230309053148.9223-2-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-03-09 23:36:01 +08:00
Gao Xiang	647dd2c3f0	erofs: Revert "erofs: fix kvcalloc() misuse with __GFP_NOFAIL" Let's revert commit `12724ba389` ("erofs: fix kvcalloc() misuse with __GFP_NOFAIL") since kvmalloc() already supports __GFP_NOFAIL in commit `a421ef3030` ("mm: allow !GFP_KERNEL allocations for kvmalloc"). So the original fix was wrong. Actually there was some issue as [1] discussed, so before that mm fix is landed, the warn could still happen but applying this commit first will cause less. [1] https://lore.kernel.org/r/20230305053035.1911-1-hsiangkao@linux.alibaba.com Fixes: `12724ba389` ("erofs: fix kvcalloc() misuse with __GFP_NOFAIL") Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230309053148.9223-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-03-09 23:36:01 +08:00
Gao Xiang	8f121dfb15	erofs: fix wrong kunmap when using LZMA on HIGHMEM platforms As the call trace shown, the root cause is kunmap incorrect pages: BUG: kernel NULL pointer dereference, address: 00000000 CPU: 1 PID: 40 Comm: kworker/u5:0 Not tainted 6.2.0-rc5 #4 Workqueue: erofs_worker z_erofs_decompressqueue_work EIP: z_erofs_lzma_decompress+0x34b/0x8ac z_erofs_decompress+0x12/0x14 z_erofs_decompress_queue+0x7e7/0xb1c z_erofs_decompressqueue_work+0x32/0x60 process_one_work+0x24b/0x4d8 ? process_one_work+0x1a4/0x4d8 worker_thread+0x14c/0x3fc kthread+0xe6/0x10c ? rescuer_thread+0x358/0x358 ? kthread_complete_and_exit+0x18/0x18 ret_from_fork+0x1c/0x28 ---[ end trace 0000000000000000 ]--- The bug is trivial and should be fixed now. It has no impact on !HIGHMEM platforms. Fixes: `622ceaddb7` ("erofs: lzma compression support") Cc: <stable@vger.kernel.org> # 5.16+ Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230305134455.88236-1-hsiangkao@linux.alibaba.com	2023-03-09 22:49:57 +08:00
Yangtao Li	a279adedbb	erofs: mark z_erofs_lzma_init/erofs_pcpubuf_init w/ __init They are used during the erofs module init phase. Let's mark it as __init like any other function. Signed-off-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230303063731.66760-1-frank.li@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-03-09 22:49:30 +08:00
Christian Brauner	d549b74174	fs: rename generic posix acl handlers Reflect in their naming and document that they are kept around for legacy reasons and shouldn't be used anymore by new code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>	2023-03-06 09:57:13 +01:00
Christian Brauner	a5488f2983	fs: simplify ->listxattr() implementation The ext{2,4}, erofs, f2fs, and jffs2 filesystems use the same logic to check whether a given xattr can be listed. Simplify them and avoid open-coding the same check by calling the helper we introduced earlier. Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: linux-f2fs-devel@lists.sourceforge.net Cc: linux-erofs@lists.ozlabs.org Cc: linux-ext4@vger.kernel.org Cc: linux-mtd@lists.infradead.org Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>	2023-03-06 09:57:12 +01:00
Christian Brauner	0c95c025a0	fs: drop unused posix acl handlers Remove struct posix_acl_{access,default}_handler for all filesystems that don't depend on the xattr handler in their inode->i_op->listxattr() method in any way. There's nothing more to do than to simply remove the handler. It's been effectively unused ever since we introduced the new posix acl api. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>	2023-03-06 09:57:12 +01:00
Linus Torvalds	3822a7c409	- Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K DmxHkn0LAitGgJRS/W9w81yrgig9tAQ= =MlGs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...	2023-02-23 17:09:35 -08:00
Linus Torvalds	dc483c851f	Changes since last update: - Add per-cpu kthreads for low-latency decompression for Android use cases; - Get rid of tagged pointer helpers since they are rarely used now; - Several code cleanups to reduce codebase; - Documentation and MAINTAINERS updates. -----BEGIN PGP SIGNATURE----- iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCY/IDjhEceGlhbmdAa2Vy bmVsLm9yZwAKCRA5NzHcH7XmBNbTAQDT2njll/B2JSYbVC2I2HYTZSyFXEaHhH+M 6gHRbEhTWAD/VNiAcdE600IkUwut/78tDvwlz/XJSd2JQMMwkTSviwc= =oroQ -----END PGP SIGNATURE----- Merge tag 'erofs-for-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "The most noticeable feature for this cycle is per-CPU kthread decompression since Android use cases need low-latency I/O handling in order to ensure the app runtime performance, currently unbounded workqueue latencies are not quite good for production on many aarch64 hardwares and thus we need to introduce a deterministic expectation for these. Decompression is CPU-intensive and it is sleepable for EROFS, so other alternatives like decompression under softirq contexts are not considered. More details are in the corresponding commit message. Others are random cleanups around the whole codebase and we will continue to clean up further in the next few months. Due to Lunar New Year holidays, some other new features were not completely reviewed and solidified as expected and we may delay them into the next version. Summary: - Add per-cpu kthreads for low-latency decompression for Android use cases - Get rid of tagged pointer helpers since they are rarely used now - Several code cleanups to reduce codebase - Documentation and MAINTAINERS updates" * tag 'erofs-for-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (21 commits) erofs: fix an error code in z_erofs_init_zip_subsystem() erofs: unify anonymous inodes for blob erofs: relinquish volume with mutex held erofs: maintain cookies of share domain in self-contained list erofs: remove unused device mapping in meta routine MAINTAINERS: erofs: Add Documentation/ABI/testing/sysfs-fs-erofs Documentation/ABI: sysfs-fs-erofs: update supported features erofs: remove unused EROFS_GET_BLOCKS_RAW flag erofs: update print symbols for various flags in trace erofs: make kobj_type structures constant erofs: add per-cpu threads for decompression as an option erofs: tidy up internal.h erofs: get rid of z_erofs_do_map_blocks() forward declaration erofs: move zdata.h into zdata.c erofs: remove tagged pointer helpers erofs: avoid tagged pointers to mark sync decompression erofs: get rid of erofs_inode_datablocks() erofs: simplify iloc() erofs: get rid of debug_one_dentry() erofs: remove linux/buffer_head.h dependency ...	2023-02-20 12:23:40 -08:00
Linus Torvalds	05e6295f7b	fs.idmapped.v6.3 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5NlQAKCRCRxhvAZXjc orOaAP9i2h3OJy95nO2Fpde0Bt2UT+oulKCCcGlvXJ8/+TQpyQD/ZQq47gFQ0EAz Br5NxeyGeecAb0lHpFz+CpLGsxMrMwQ= =+BG5 -----END PGP SIGNATURE----- Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs idmapping updates from Christian Brauner: - Last cycle we introduced the dedicated struct mnt_idmap type for mount idmapping and the required infrastucture in `256c8aed2b` ("fs: introduce dedicated idmap type for mounts"). As promised in last cycle's pull request message this converts everything to rely on struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevant on the mount level. Especially for non-vfs developers without detailed knowledge in this area this was a potential source for bugs. This finishes the conversion. Instead of passing the plain namespace around this updates all places that currently take a pointer to a mnt_userns with a pointer to struct mnt_idmap. Now that the conversion is done all helpers down to the really low-level helpers only accept a struct mnt_idmap argument instead of two namespace arguments. Conflating mount and other idmappings will now cause the compiler to complain loudly thus eliminating the possibility of any bugs. This makes it impossible for filesystem developers to mix up mount and filesystem idmappings as they are two distinct types and require distinct helpers that cannot be used interchangeably. Everything associated with struct mnt_idmap is moved into a single separate file. With that change no code can poke around in struct mnt_idmap. It can only be interacted with through dedicated helpers. That means all filesystems are and all of the vfs is completely oblivious to the actual implementation of idmappings. We are now also able to extend struct mnt_idmap as we see fit. For example, we can decouple it completely from namespaces for users that don't require or don't want to use them at all. We can also extend the concept of idmappings so we can cover filesystem specific requirements. In combination with the vfs{g,u}id_t work we finished in v6.2 this makes this feature substantially more robust and thus difficult to implement wrong by a given filesystem and also protects the vfs. - Enable idmapped mounts for tmpfs and fulfill a longstanding request. A long-standing request from users had been to make it possible to create idmapped mounts for tmpfs. For example, to share the host's tmpfs mount between multiple sandboxes. This is a prerequisite for some advanced Kubernetes cases. Systemd also has a range of use-cases to increase service isolation. And there are more users of this. However, with all of the other work going on this was way down on the priority list but luckily someone other than ourselves picked this up. As usual the patch is tiny as all the infrastructure work had been done multiple kernel releases ago. In addition to all the tests that we already have I requested that Rodrigo add a dedicated tmpfs testsuite for idmapped mounts to xfstests. It is to be included into xfstests during the v6.3 development cycle. This should add a slew of additional tests. * tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits) shmem: support idmapped mounts for tmpfs fs: move mnt_idmap fs: port vfs{g,u}id helpers to mnt_idmap fs: port fs{g,u}id helpers to mnt_idmap fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap fs: port i_{g,u}id_{needs_}update() to mnt_idmap quota: port to mnt_idmap fs: port privilege checking helpers to mnt_idmap fs: port inode_owner_or_capable() to mnt_idmap fs: port inode_init_owner() to mnt_idmap fs: port acl to mnt_idmap fs: port xattr to mnt_idmap fs: port ->permission() to pass mnt_idmap fs: port ->fileattr_set() to pass mnt_idmap fs: port ->set_acl() to pass mnt_idmap fs: port ->get_acl() to pass mnt_idmap fs: port ->tmpfile() to pass mnt_idmap fs: port ->rename() to pass mnt_idmap fs: port ->mknod() to pass mnt_idmap fs: port ->mkdir() to pass mnt_idmap ...	2023-02-20 11:53:11 -08:00
Dan Carpenter	8d1b80a794	erofs: fix an error code in z_erofs_init_zip_subsystem() Return -ENOMEM if alloc_workqueue() fails. Don't return success. Fixes: d8a650adf429 ("erofs: add per-cpu threads for decompression as an option") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/Y+4d0FRsUq8jPoOu@kili Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-02-16 22:51:53 +08:00
Jingbo Xu	61fef98945	erofs: unify anonymous inodes for blob Currently there're two anonymous inodes (inode and anon_inode in struct erofs_fscache) for each blob. The former was introduced as the address_space of page cache for bootstrap. The latter was initially introduced as both the address_space of page cache and also a sentinel in the shared domain. Since now the management of cookies in share domain has been decoupled with the anonymous inode, there's no need to maintain an extra anonymous inode. Let's unify these two anonymous inodes. Besides, in non-share-domain mode only bootstrap will allocate anonymous inode. To simplify the implementation, always allocate anonymous inode for both bootstrap and data blobs. Similarly release anonymous inodes for data blobs when .put_super() is called, or we'll get "VFS: Busy inodes after unmount." warning. Also remove the redundant set_nlink() when initializing the anonymous inode, since i_nlink has already been initialized to 1 when the inode gets allocated. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Link: https://lore.kernel.org/r/20230209063913.46341-5-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-02-15 08:11:28 +08:00
Jingbo Xu	7032809a44	erofs: relinquish volume with mutex held Relinquish fscache volume with mutex held. Otherwise if a new domain is registered when the old domain with the same name gets removed from the list but not relinquished yet, fscache may complain the collision. Fixes: `8b7adf1dff` ("erofs: introduce fscache-based domain") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Link: https://lore.kernel.org/r/20230209063913.46341-4-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-02-15 08:11:27 +08:00
Jingbo Xu	2dfb8c3b12	erofs: maintain cookies of share domain in self-contained list We'd better not touch sb->s_inodes list and inode->i_count directly. Let's maintain cookies of share domain in a self-contained list in erofs. Besides, relinquish cookie with the mutex held. Otherwise if a cookie is registered when the old cookie with the same name in the same domain has been removed from the list but not relinquished yet, fscache may complain "Duplicate cookie detected". Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Link: https://lore.kernel.org/r/20230209063913.46341-3-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-02-15 08:11:27 +08:00
Jingbo Xu	bdfa90142e	erofs: remove unused device mapping in meta routine Currently metadata is always on bootstrap, and thus device mapping is not needed so far. Remove the redundant device mapping in the meta routine. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230209063913.46341-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-02-15 08:11:27 +08:00
Jingbo Xu	8b58f9f021	erofs: remove unused EROFS_GET_BLOCKS_RAW flag For erofs_map_blocks() and erofs_map_blocks_flatmode(), the flags argument is always EROFS_GET_BLOCKS_RAW. Thus remove the unused flags parameter for these two functions. Besides EROFS_GET_BLOCKS_RAW is originally introduced for reading compressed (raw) data for compressed files. However it's never used actually and let's remove it now. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230209024825.17335-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-02-15 08:11:26 +08:00
Thomas Weißschuh	339bc4d3cd	erofs: make kobj_type structures constant Since commit `ee6d3dd4ed` ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230209-kobj_type-erofs-v1-1-078c945e2c4b@weissschuh.net Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-02-15 08:11:26 +08:00
Sandeep Dhavale	3fffb589b9	erofs: add per-cpu threads for decompression as an option Using per-cpu thread pool we can reduce the scheduling latency compared to workqueue implementation. With this patch scheduling latency and variation is reduced as per-cpu threads are high priority kthread_workers. The results were evaluated on arm64 Android devices running 5.10 kernel. The table below shows resulting improvements of total scheduling latency for the same app launch benchmark runs with 50 iterations. Scheduling latency is the latency between when the task (workqueue kworker vs kthread_worker) became eligible to run to when it actually started running. +-------------------------+-----------+----------------+---------+ \| \| workqueue \| kthread_worker \| diff \| +-------------------------+-----------+----------------+---------+ \| Average (us) \| 15253 \| 2914 \| -80.89% \| \| Median (us) \| 14001 \| 2912 \| -79.20% \| \| Minimum (us) \| 3117 \| 1027 \| -67.05% \| \| Maximum (us) \| 30170 \| 3805 \| -87.39% \| \| Standard deviation (us) \| 7166 \| 359 \| \| +-------------------------+-----------+----------------+---------+ Background: Boot times and cold app launch benchmarks are very important to the Android ecosystem as they directly translate to responsiveness from user point of view. While EROFS provides a lot of important features like space savings, we saw some performance penalty in cold app launch benchmarks in few scenarios. Analysis showed that the significant variance was coming from the scheduling cost while decompression cost was more or less the same. Having per-cpu thread pool we can see from the above table that this variation is reduced by ~80% on average. This problem was discussed at LPC 2022. Link to LPC 2022 slides and talk at [1] [1] https://lpc.events/event/16/contributions/1338/ [ Gao Xiang: At least, we have to add this until WQ_UNBOUND workqueue issue [2] on many arm64 devices is resolved. ] [2] https://lore.kernel.org/r/CAJkfWY490-m6wNubkxiTPsW59sfsQs37Wey279LmiRxKt7aQYg@mail.gmail.com Signed-off-by: Sandeep Dhavale <dhavale@google.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230208093322.75816-1-hsiangkao@linux.alibaba.com	2023-02-15 08:11:26 +08:00
Gao Xiang	557afdd94c	erofs: tidy up internal.h Reorder internal.h code so that removing unneeded macros and more. No logic changes. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230204093040.97967-6-hsiangkao@linux.alibaba.com	2023-02-15 08:11:25 +08:00
Gao Xiang	999f2f9a63	erofs: get rid of z_erofs_do_map_blocks() forward declaration The code can be neater without forward declarations. Let's get rid of z_erofs_do_map_blocks() forward declaration. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230204093040.97967-5-hsiangkao@linux.alibaba.com	2023-02-15 08:11:25 +08:00
Gao Xiang	a9a94d9373	erofs: move zdata.h into zdata.c Definitions in zdata.h are only used in zdata.c and for internal use only. No logic changes. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230204093040.97967-4-hsiangkao@linux.alibaba.com	2023-02-15 08:11:25 +08:00
Gao Xiang	b1ed220c62	erofs: remove tagged pointer helpers Just open-code the remaining one to simplify the code. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230204093040.97967-3-hsiangkao@linux.alibaba.com	2023-02-15 08:11:25 +08:00
Gao Xiang	cdba55067f	erofs: avoid tagged pointers to mark sync decompression We could just use a boolean in z_erofs_decompressqueue for sync decompression to simplify the code. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230204093040.97967-2-hsiangkao@linux.alibaba.com	2023-02-15 08:11:24 +08:00
Gao Xiang	4efdec36dc	erofs: get rid of erofs_inode_datablocks() erofs_inode_datablocks() has the only one caller, let's just get rid of it entirely. No logic changes. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230204093040.97967-1-hsiangkao@linux.alibaba.com	2023-02-15 08:11:24 +08:00
Gao Xiang	b780d3fc61	erofs: simplify iloc() Actually we could pass in inodes directly to clean up all callers. Also rename iloc() as erofs_iloc(). Link: https://lore.kernel.org/r/20230114150823.432069-1-xiang@kernel.org Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-02-15 08:11:24 +08:00
Gao Xiang	e324eaa979	erofs: get rid of debug_one_dentry() Since erofsdump is available, no need to keep this debugging functionality at all. Also drop a useless comment since it's the VFS behavior. Link: https://lore.kernel.org/r/20230114125746.399253-1-xiang@kernel.org Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-02-15 08:11:23 +08:00
Gao Xiang	768bb10afb	erofs: remove linux/buffer_head.h dependency EROFS actually never uses buffer heads, therefore just get rid of BH_xxx definitions and linux/buffer_head.h inclusive. Link: https://lore.kernel.org/r/20230113065226.68801-2-hsiangkao@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-02-15 08:11:10 +08:00
Gao Xiang	7c3511a2c8	erofs: clean up erofs_iget() Move inode hash function into inode.c and simplify erofs_iget(). Link: https://lore.kernel.org/r/20230113065226.68801-1-hsiangkao@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-02-15 08:10:46 +08:00
Suren Baghdasaryan	1c71222e5f	mm: replace vma->vm_flags direct modifications with modifier calls Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-09 16:51:39 -08:00
Christian Brauner	b74d24f7a7	fs: port ->getattr() to pass mnt_idmap Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in `256c8aed2b` ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>	2023-01-19 09:24:25 +01:00
Jingbo Xu	e02ac3e732	erofs: clean up parsing of fscache related options ... to avoid the mess of conditional preprocessing as we are continually adding fscache related mount options. Reviewd-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20230112065431.124926-3-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-01-16 22:39:47 +08:00
Gao Xiang	12724ba389	erofs: fix kvcalloc() misuse with __GFP_NOFAIL As reported by syzbot [1], kvcalloc() cannot work with __GFP_NOFAIL. Let's use kcalloc() instead. [1] https://lore.kernel.org/r/0000000000007796bd05f1852ec2@google.com Reported-by: syzbot+c3729cda01706a04fb98@syzkaller.appspotmail.com Fixes: `fe3e5914e6` ("erofs: try to leave (de)compressed_pages on stack if possible") Fixes: `4f05687fd7` ("erofs: introduce struct z_erofs_decompress_backend") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230110074927.41651-1-hsiangkao@linux.alibaba.com	2023-01-10 16:41:45 +08:00
Siddh Raman Pant	6acd87d509	erofs/zmap.c: Fix incorrect offset calculation Effective offset to add to length was being incorrectly calculated, which resulted in iomap->length being set to 0, triggering a WARN_ON in iomap_iter_done(). Fix that, and describe it in comments. This was reported as a crash by syzbot under an issue about a warning encountered in iomap_iter_done(), but unrelated to erofs. C reproducer: https://syzkaller.appspot.com/text?tag=ReproC&x=1037a6b2880000 Kernel config: https://syzkaller.appspot.com/text?tag=KernelConfig&x=e2021a61197ebe02 Dashboard link: https://syzkaller.appspot.com/bug?extid=a8e049cd3abd342936b6 Reported-by: syzbot+a8e049cd3abd342936b6@syzkaller.appspotmail.com Suggested-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Siddh Raman Pant <code@siddh.me> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20221209102151.311049-1-code@siddh.me Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-01-10 16:38:53 +08:00
Linus Torvalds	4a6bff1187	Changes since the last update: - Enable large folios for iomap/fscache mode; - Avoid sysfs warning due to mounting twice with the same fsid and domain_id in fscache mode; - Refine fscache interface among erofs, fscache, and cachefiles; - Use kmap_local_page() only for metabuf; - Fixes around crafted images found by syzbot; - Minor cleanups and documentation updates. -----BEGIN PGP SIGNATURE----- iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCY5S3khEceGlhbmdAa2Vy bmVsLm9yZwAKCRA5NzHcH7XmBLr3AQDA5xpztSsxfe0Gp+bwf12ySuntimJxXmAj 83EHCfSC+AEAu4fcWkIF38MBBVJvFVjFaXCZKmFossbI5Rp8TuqPpgk= =HDsJ -----END PGP SIGNATURE----- Merge tag 'erofs-for-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "In this cycle, large folios are now enabled in the iomap/fscache mode for uncompressed files first. In order to do that, we've also cleaned up better interfaces between erofs and fscache, which are acked by fscache/netfs folks and included in this pull request. Other than that, there are random fixes around erofs over fscache and crafted images by syzbot, minor cleanups and documentation updates. Summary: - Enable large folios for iomap/fscache mode - Avoid sysfs warning due to mounting twice with the same fsid and domain_id in fscache mode - Refine fscache interface among erofs, fscache, and cachefiles - Use kmap_local_page() only for metabuf - Fixes around crafted images found by syzbot - Minor cleanups and documentation updates" * tag 'erofs-for-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: validate the extent length for uncompressed pclusters erofs: fix missing unmap if z_erofs_get_extent_compressedlen() fails erofs: Fix pcluster memleak when its block address is zero erofs: use kmap_local_page() only for erofs_bread() erofs: enable large folios for fscache mode erofs: support large folios for fscache mode erofs: switch to prepare_ondemand_read() in fscache mode fscache,cachefiles: add prepare_ondemand_read() callback erofs: clean up cached I/O strategies erofs: update documentation erofs: check the uniqueness of fsid in shared domain in advance erofs: enable large folios for iomap mode	2022-12-12 20:14:04 -08:00
Linus Torvalds	6a518afcc2	fs.acl.rework.v6.2 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bwTgAKCRCRxhvAZXjc ovd2AQCK00NAtGjQCjQPQGyTa4GAPqvWgq1ef0lnhv+TL5US5gD9FncQ8UofeMXt pBfjtAD6ettTPCTxUQfnTwWEU4rc7Qg= =27Wm -----END PGP SIGNATURE----- Merge tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull VFS acl updates from Christian Brauner: "This contains the work that builds a dedicated vfs posix acl api. The origins of this work trace back to v5.19 but it took quite a while to understand the various filesystem specific implementations in sufficient detail and also come up with an acceptable solution. As we discussed and seen multiple times the current state of how posix acls are handled isn't nice and comes with a lot of problems: The current way of handling posix acls via the generic xattr api is error prone, hard to maintain, and type unsafe for the vfs until we call into the filesystem's dedicated get and set inode operations. It is already the case that posix acls are special-cased to death all the way through the vfs. There are an uncounted number of hacks that operate on the uapi posix acl struct instead of the dedicated vfs struct posix_acl. And the vfs must be involved in order to interpret and fixup posix acls before storing them to the backing store, caching them, reporting them to userspace, or for permission checking. Currently a range of hacks and duct tape exist to make this work. As with most things this is really no ones fault it's just something that happened over time. But the code is hard to understand and difficult to maintain and one is constantly at risk of introducing bugs and regressions when having to touch it. Instead of continuing to hack posix acls through the xattr handlers this series builds a dedicated posix acl api solely around the get and set inode operations. Going forward, the vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl() helpers must be used in order to interact with posix acls. They operate directly on the vfs internal struct posix_acl instead of abusing the uapi posix acl struct as we currently do. In the end this removes all of the hackiness, makes the codepaths easier to maintain, and gets us type safety. This series passes the LTP and xfstests suites without any regressions. For xfstests the following combinations were tested: - xfs - ext4 - btrfs - overlayfs - overlayfs on top of idmapped mounts - orangefs - (limited) cifs There's more simplifications for posix acls that we can make in the future if the basic api has made it. A few implementation details: - The series makes sure to retain exactly the same security and integrity module permission checks. Especially for the integrity modules this api is a win because right now they convert the uapi posix acl struct passed to them via a void pointer into the vfs struct posix_acl format to perform permission checking on the mode. There's a new dedicated security hook for setting posix acls which passes the vfs struct posix_acl not a void pointer. Basing checking on the posix acl stored in the uapi format is really unreliable. The vfs currently hacks around directly in the uapi struct storing values that frankly the security and integrity modules can't correctly interpret as evidenced by bugs we reported and fixed in this area. It's not necessarily even their fault it's just that the format we provide to them is sub optimal. - Some filesystems like 9p and cifs need access to the dentry in order to get and set posix acls which is why they either only partially or not even at all implement get and set inode operations. For example, cifs allows setxattr() and getxattr() operations but doesn't allow permission checking based on posix acls because it can't implement a get acl inode operation. Thus, this patch series updates the set acl inode operation to take a dentry instead of an inode argument. However, for the get acl inode operation we can't do this as the old get acl method is called in e.g., generic_permission() and inode_permission(). These helpers in turn are called in various filesystem's permission inode operation. So passing a dentry argument to the old get acl inode operation would amount to passing a dentry to the permission inode operation which we shouldn't and probably can't do. So instead of extending the existing inode operation Christoph suggested to add a new one. He also requested to ensure that the get and set acl inode operation taking a dentry are consistently named. So for this version the old get acl operation is renamed to ->get_inode_acl() and a new ->get_acl() inode operation taking a dentry is added. With this we can give both 9p and cifs get and set acl inode operations and in turn remove their complex custom posix xattr handlers. In the future I hope to get rid of the inode method duplication but it isn't like we have never had this situation. Readdir is just one example. And frankly, the overall gain in type safety and the more pleasant api wise are simply too big of a benefit to not accept this duplication for a while. - We've done a full audit of every codepaths using variant of the current generic xattr api to get and set posix acls and surprisingly it isn't that many places. There's of course always a chance that we might have missed some and if so I'm sure we'll find them soon enough. The crucial codepaths to be converted are obviously stacking filesystems such as ecryptfs and overlayfs. For a list of all callers currently using generic xattr api helpers see [2] including comments whether they support posix acls or not. - The old vfs generic posix acl infrastructure doesn't obey the create and replace semantics promised on the setxattr(2) manpage. This patch series doesn't address this. It really is something we should revisit later though. The patches are roughly organized as follows: (1) Change existing set acl inode operation to take a dentry argument (Intended to be a non-functional change) (2) Rename existing get acl method (Intended to be a non-functional change) (3) Implement get and set acl inode operations for filesystems that couldn't implement one before because of the missing dentry. That's mostly 9p and cifs (Intended to be a non-functional change) (4) Build posix acl api, i.e., add vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl() including security and integrity hooks (Intended to be a non-functional change) (5) Implement get and set acl inode operations for stacking filesystems (Intended to be a non-functional change) (6) Switch posix acl handling in stacking filesystems to new posix acl api now that all filesystems it can stack upon support it. (7) Switch vfs to new posix acl api (semantical change) (8) Remove all now unused helpers (9) Additional regression fixes reported after we merged this into linux-next Thanks to Seth for a lot of good discussion around this and encouragement and input from Christoph" * tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (36 commits) posix_acl: Fix the type of sentinel in get_acl orangefs: fix mode handling ovl: call posix_acl_release() after error checking evm: remove dead code in evm_inode_set_acl() cifs: check whether acl is valid early acl: make vfs_posix_acl_to_xattr() static acl: remove a slew of now unused helpers 9p: use stub posix acl handlers cifs: use stub posix acl handlers ovl: use stub posix acl handlers ecryptfs: use stub posix acl handlers evm: remove evm_xattr_acl_change() xattr: use posix acl api ovl: use posix acl api ovl: implement set acl method ovl: implement get acl method ecryptfs: implement set acl method ecryptfs: implement get acl method ksmbd: use vfs_remove_acl() acl: add vfs_remove_acl() ...	2022-12-12 18:46:39 -08:00
Gao Xiang	c505feba4c	erofs: validate the extent length for uncompressed pclusters syzkaller reported a KASAN use-after-free: https://syzkaller.appspot.com/bug?extid=2ae90e873e97f1faf6f2 The referenced fuzzed image actually has two issues: - m_pa == 0 as a non-inlined pcluster; - The logical length is longer than its physical length. The first issue has already been addressed. This patch addresses the second issue by checking the extent length validity. Reported-by: syzbot+2ae90e873e97f1faf6f2@syzkaller.appspotmail.com Fixes: `02827e1796` ("staging: erofs: add erofs_map_blocks_iter") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20221205150050.47784-2-hsiangkao@linux.alibaba.com	2022-12-07 10:56:31 +08:00
Gao Xiang	d5d188b8f8	erofs: fix missing unmap if z_erofs_get_extent_compressedlen() fails Otherwise, meta buffers could be leaked. Fixes: `cec6e93bea` ("erofs: support parsing big pcluster compress indexes") Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20221205150050.47784-1-hsiangkao@linux.alibaba.com	2022-12-07 10:56:31 +08:00
Chen Zhongjin	c42c0ffe81	erofs: Fix pcluster memleak when its block address is zero syzkaller reported a memleak: https://syzkaller.appspot.com/bug?id=62f37ff612f0021641eda5b17f056f1668aa9aed unreferenced object 0xffff88811009c7f8 (size 136): ... backtrace: [<ffffffff821db19b>] z_erofs_do_read_page+0x99b/0x1740 [<ffffffff821dee9e>] z_erofs_readahead+0x24e/0x580 [<ffffffff814bc0d6>] read_pages+0x86/0x3d0 ... syzkaller constructed a case: in z_erofs_register_pcluster(), ztailpacking = false and map->m_pa = zero. This makes pcl->obj.index be zero although pcl is not a inline pcluster. Then following path adds refcount for grp, but the refcount won't be put because pcl is inline. z_erofs_readahead() z_erofs_do_read_page() # for another page z_erofs_collector_begin() erofs_find_workgroup() erofs_workgroup_get() Since it's illegal for the block address of a non-inlined pcluster to be zero, add check here to avoid registering the pcluster which would be leaked. Fixes: `cecf864d3d` ("erofs: support inline data decompression") Reported-by: syzbot+6f8cd9a0155b366d227f@syzkaller.appspotmail.com Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/Y42Kz6sVkf+XqJRB@debian Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-12-07 10:56:31 +08:00
Gao Xiang	927e5010ff	erofs: use kmap_local_page() only for erofs_bread() Convert all mapped erofs_bread() users to use kmap_local_page() instead of kmap() or kmap_atomic(). Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-and-tested-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20221018105313.4940-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-12-07 10:56:31 +08:00
Jingbo Xu	e6687b8922	erofs: enable large folios for fscache mode Enable large folios for fscache mode. Enable this feature for non-compressed format for now, until the compression part supports large folios later. One thing worth noting is that, the feature is not enabled for the meta data routine since meta inodes don't need large folios for now, nor do they support readahead yet. Also document this new feature. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Link: https://lore.kernel.org/r/20221201074256.16639-3-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-12-07 10:56:31 +08:00
Jingbo Xu	be62c51988	erofs: support large folios for fscache mode When large folios supported, one folio can be split into several slices, each of which may be mapped to META/UNMAPPED/MAPPED, and the folio can be unlocked as a whole only when all slices have completed. Thus always allocate erofs_fscache_request for each .read_folio() or .readahead(), in which case the allocated request is responsible for unlocking folios when all slices have completed. As described above, each folio or folio range can be mapped into several slices, while these slices may be mapped to different cookies, and thus each slice needs its own netfs_cache_resources. Here we introduce chained requests to support this, where each .read_folio() or .readahead() calling can correspond to multiple requests. Each request has its own netfs_cache_resources and thus is used to access one cookie. Among these requests, there's a primary request, with the others pointing to the primary request. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Link: https://lore.kernel.org/r/20221201074256.16639-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-12-07 10:56:30 +08:00
Jingbo Xu	709fe09e28	erofs: switch to prepare_ondemand_read() in fscache mode Switch to prepare_ondemand_read() interface and a self-contained request completion to get rid of netfs_io_[request\|subrequest]. The whole request will still be split into slices (subrequest) according to the cache state of the backing file. As long as one of the subrequests fails, the whole request will be marked as failed. Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Link: https://lore.kernel.org/r/20221124034212.81892-3-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-12-07 10:56:30 +08:00
Gao Xiang	1282dea37b	erofs: clean up cached I/O strategies After commit `4c7e42552b` ("erofs: remove useless cache strategy of DELAYEDALLOC"), only one cached I/O allocation strategy is supported: When cached I/O is preferred, page allocation is applied without direct reclaim. If allocation fails, fall back to inplace I/O. Let's get rid of z_erofs_cache_alloctype. No logical changes. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Yue Hu <huyue2@coolpad.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20221206060352.152830-1-xiang@kernel.org	2022-12-07 10:56:20 +08:00
Hou Tao	27f2a2dcc6	erofs: check the uniqueness of fsid in shared domain in advance When shared domain is enabled, doing mount twice with the same fsid and domain_id will trigger sysfs warning as shown below: sysfs: cannot create duplicate filename '/fs/erofs/d0,meta.bin' CPU: 15 PID: 1051 Comm: mount Not tainted 6.1.0-rc6+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Call Trace: <TASK> dump_stack_lvl+0x38/0x49 dump_stack+0x10/0x12 sysfs_warn_dup.cold+0x17/0x27 sysfs_create_dir_ns+0xb8/0xd0 kobject_add_internal+0xb1/0x240 kobject_init_and_add+0x71/0xa0 erofs_register_sysfs+0x89/0x110 erofs_fc_fill_super+0x98c/0xaf0 vfs_get_super+0x7d/0x100 get_tree_nodev+0x16/0x20 erofs_fc_get_tree+0x20/0x30 vfs_get_tree+0x24/0xb0 path_mount+0x2fa/0xa90 do_mount+0x7c/0xa0 __x64_sys_mount+0x8b/0xe0 do_syscall_64+0x30/0x60 entry_SYSCALL_64_after_hwframe+0x46/0xb0 The reason is erofs_fscache_register_cookie() doesn't guarantee the primary data blob (aka fsid) is unique in the shared domain and erofs_register_sysfs() invoked by the second mount will fail due to the duplicated fsid in the shared domain and report warning. It would be better to check the uniqueness of fsid before doing erofs_register_sysfs(), so adding a new flags parameter for erofs_fscache_register_cookie() and doing the uniqueness check if EROFS_REG_COOKIE_NEED_NOEXIST is enabled. After the patch, the error in dmesg for the duplicated mount would be: erofs: ...: erofs_domain_register_cookie: XX already exists in domain YY Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221125110822.3812942-1-houtao@huaweicloud.com Fixes: `7d41963759` ("erofs: Support sharing cookies in the same domain") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-12-07 10:53:40 +08:00
Jingbo Xu	ce529cc25b	erofs: enable large folios for iomap mode Enable large folios for iomap mode. Then the readahead routine will pass down large folios containing multiple pages. Let's enable this for non-compressed format for now, until the compression part supports large folios later. When large folios supported, the iomap routine will allocate iomap_page for each large folio and thus we need iomap_release_folio() and iomap_invalidate_folio() to free iomap_page when these folios get reclaimed or invalidated. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20221130060455.44532-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-12-07 10:52:06 +08:00
Al Viro	de4eda9de2	use less confusing names for iov_iter direction initializers READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2022-11-25 13:01:55 -05:00
Linus Torvalds	81e7cfa3a9	Changes since last update: - Fix packed_inode invalid access when reading fragments on crafted images; - Add a missing erofs_put_metabuf() in an error path in fscache mode; - Fix incorrect `count' for unmapped extents in fscache mode; - Fix use-after-free of fsid and domain_id string when remounting; - Fix missing xas_retry() in fscache mode. -----BEGIN PGP SIGNATURE----- iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCY3OcchEceGlhbmdAa2Vy bmVsLm9yZwAKCRA5NzHcH7XmBCzgAP92t7Lfu7gBuyhXfCJwJVFK0Iku8j9mhOiT +C/RVB+9zQEAk/2vy3ULcGN5k6k2q7OgEzNxQ/jM3hVQnoK+sQzwVwA= =XHmr -----END PGP SIGNATURE----- Merge tag 'erofs-for-6.1-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "Most patches randomly fix error paths or corner cases in fscache mode reported recently. One fixes an invalid access relating to fragments on crafted images. Summary: - Fix packed_inode invalid access when reading fragments on crafted images - Add a missing erofs_put_metabuf() in an error path in fscache mode - Fix incorrect `count' for unmapped extents in fscache mode - Fix use-after-free of fsid and domain_id string when remounting - Fix missing xas_retry() in fscache mode" * tag 'erofs-for-6.1-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix missing xas_retry() in fscache mode erofs: fix use-after-free of fsid and domain_id string erofs: get correct count for unmapped range in fscache mode erofs: put metabuf in error path in fscache mode erofs: fix general protection fault when reading fragment	2022-11-15 10:30:34 -08:00
Jingbo Xu	37020bbb71	erofs: fix missing xas_retry() in fscache mode The xarray iteration only holds the RCU read lock and thus may encounter XA_RETRY_ENTRY if there's process modifying the xarray concurrently. This will cause oops when referring to the invalid entry. Fix this by adding the missing xas_retry(), which will make the iteration wind back to the root node if XA_RETRY_ENTRY is encountered. Fixes: `d435d53228` ("erofs: change to use asynchronous io for fscache readpage/readahead") Suggested-by: David Howells <dhowells@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20221114121943.29987-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-11-14 23:48:38 +08:00
Jingbo Xu	39bfcb8138	erofs: fix use-after-free of fsid and domain_id string When erofs instance is remounted with fsid or domain_id mount option specified, the original fsid and domain_id string pointer in sbi->opt is directly overridden with the fsid and domain_id string in the new fs_context, without freeing the original fsid and domain_id string. What's worse, when the new fsid and domain_id string is transferred to sbi, they are not reset to NULL in fs_context, and thus they are freed when remount finishes, while sbi is still referring to these strings. Reconfiguration for fsid and domain_id seems unusual. Thus clarify this restriction explicitly and dump a warning when users are attempting to do this. Besides, to fix the use-after-free issue, move fsid and domain_id from erofs_mount_opts to outside. Fixes: `c6be2bd0a5` ("erofs: register fscache volume") Fixes: `8b7adf1dff` ("erofs: introduce fscache-based domain") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20221021023153.1330-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-11-10 09:53:20 +08:00
Johannes Weiner	82e60d00b7	fs: fix leaked psi pressure state When psi annotations were added to to btrfs compression reads, the psi state tracking over add_ra_bio_pages and btrfs_submit_compressed_read was faulty. A pressure state, once entered, is never left. This results in incorrectly elevated pressure, which triggers OOM kills. pflags record the previous memstall state when we enter a new one. The code tried to initialize pflags to 1, and then optimize the leave call when we either didn't enter a memstall, or were already inside a nested stall. However, there can be multiple PageWorkingset pages in the bio, at which point it's that path itself that enters repeatedly and overwrites pflags. This causes us to miss the exit. Enter the stall only once if needed, then unwind correctly. erofs has the same problem, fix that up too. And move the memstall exit past submit_bio() to restore submit accounting originally added by `b8e24a9300` ("block: annotate refault stalls from IO submission"). Link: https://lkml.kernel.org/r/Y2UHRqthNUwuIQGS@cmpxchg.org Fixes: `4088a47e78` ("btrfs: add manual PSI accounting for compressed reads") Fixes: `99486c511f` ("erofs: add manual PSI accounting for the compressed address space") Fixes: `118f3663fb` ("block: remove PSI accounting from the bio layer") Link: https://lore.kernel.org/r/d20a0a85-e415-cf78-27f9-77dd7a94bc8d@leemhuis.info/ Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Thorsten Leemhuis <linux@leemhuis.info> Tested-by: Thorsten Leemhuis <linux@leemhuis.info> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Sterba <dsterba@suse.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-08 15:57:25 -08:00
Jingbo Xu	e6d9f9ba11	erofs: get correct count for unmapped range in fscache mode For unmapped range, the returned map.m_llen is zero, and thus the calculated count is unexpected zero. Prior to the refactoring introduced by commit `1ae9470c3e` ("erofs: clean up .read_folio() and .readahead() in fscache mode"), only the readahead routine suffers from this. With the refactoring of making .read_folio() and .readahead() calling one common routine, both read_folio and readahead have this issue now. Fix this by calculating count separately in unmapped condition. Fixes: `c665b394b9` ("erofs: implement fscache-based data readahead") Fixes: `1ae9470c3e` ("erofs: clean up .read_folio() and .readahead() in fscache mode") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20221104054028.52208-3-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-11-08 14:46:30 +08:00

1 2 3 4 5 ...

427 Commits