OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	3b629f8d6d	io_uring-bio-cache.5-2021-08-30 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs8QQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgAgD/wP9gGxrFE5oxtdozDPkEYTXn5e0QKseDyV cNxLmSb3wc4WIEPwjCavdQHpy0fnbjaYwGveHf9ygQwDZPj9WBgEL3ipPYXCCzFA ysoV86kBRxKDI476r2InxI8WaW7hV0IWxPlScUTA1QeeNAzRJDymQvRuwg5KvVRS Jt6R58khzWpEGYO2CqFTpGsA7x01R0kvZ54xmFgKZ+Pxo+Bk03fkO32YUFC49Wm8 Zy+JMsaiIlLgucDTJ4zAKjQUXiwP2GMEw5Vk/lLUFGBvyw0AN2rO9g18L7QW2ZUu vnkaJQwBbMUbgveXlI/y6GG/vuKUG2i4AmzNJH17qFCnimO3JY6vgzUOg5dqOiwx bx7ZzmnBWgQp95/cSAlZ4QwRYf3z0hvVFKPj9U3X9wKGmuxUKHiLResQwp7bzRdd 4L4Jo1WFDDHR/1MOOzzW0uxE3uTm0LKcncsi4hJL20dl+16RXCIbzHWUTAd8yyMV 9QeUAumc4GHOeswa1Ms8jLPAgXyEoAkec7ca7cRIY/NW+DXGLG9tYBgCw1eLe6BN M7LwMsPNlS2v2dMUbiuw8XxkA+uYso728e2vd/edca2jxXj8+SVnm020aYBnxIzh nmjbf69+QddBPEnk/EPvRj8tXOhr3k7FklI4R7qlei/+IGTujGPvM4kn3p6fnHrx d7bsu/jtaQ== =izfH -----END PGP SIGNATURE----- Merge tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block Pull support for struct bio recycling from Jens Axboe: "This adds bio recycling support for polled IO, allowing quick reuse of a bio for high IOPS scenarios via a percpu bio_set list. It's good for almost a 10% improvement in performance, bumping our per-core IO limit from ~3.2M IOPS to ~3.5M IOPS" * tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block: bio: improve kerneldoc documentation for bio_alloc_kiocb() block: provide bio_clear_hipri() helper block: use the percpu bio cache in __blkdev_direct_IO io_uring: enable use of bio alloc cache block: clear BIO_PERCPU_CACHE flag if polling isn't supported bio: add allocation cache abstraction fs: add kiocb alloc cache flag bio: optimize initialization of a bio	2021-08-30 19:30:30 -07:00
Linus Torvalds	c547d89a9a	for-5.15/io_uring-2021-08-30 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs7tIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpoR2EACdOPj0tivXWufgFnQyQYPpX4/Qe3lNw608 MLJ9/zshPFe5kx+SnHzT3UucHd2LO4C68pNVEHdHoi1gqMnMe9+Du82LJlo+Cx9i 53yiWxNiY7er3t3lC2nvaPF0BPKiaUKaRzqugOTfdWmKLP3MyYyygEBrRDkaK1S1 BjVq2ewmKTYf63gHkluHeRTav9KxcLWvSqgFUC8Y0mNhazTSdzGB6MAPFodpsuj1 Vv8ytCiagp9Gi0AibvS2mZvV/WxQtFP7qBofbnG3KcgKgOU+XKTJCH+cU6E/3J9Q nPy1loh2TISOtYAz2scypAbwsK4FWeAHg1SaIj/RtUGKG7zpVU5u97CPZ8+UfHzu CuR3a1o36Cck8+ZdZIjtRZfvQGog0Dh5u4ZQ4dRwFQd6FiVxdO8LHkegqkV6PStc dVrHSo5kUE5hGT8ed1YFfuOSJDZ6w0/LleZGMdU4pRGGs8wqkerZlfLL4ustSfLk AcS+azmG2f3iI5iadnxOUNeOT2lmE84fvvLyH2krfsA3AtX0CtHXQcLYAguRAIwg gnvYf70JOya6Lb/hbXUi8d8h3uXxeFsoitz1QyasqAEoJoY05EW+l94NbiEzXwol mKrVfZCk+wuhw3npbVCqK7nkepvBWso1qTip//T0HDAaKaZnZcmhobUn5SwI+QPx fcgAo6iw8g== =WBzF -----END PGP SIGNATURE----- Merge tag 'for-5.15/io_uring-2021-08-30' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: - cancellation cleanups (Hao, Pavel) - io-wq accounting cleanup (Hao) - io_uring submit locking fix (Hao) - io_uring link handling fixes (Hao) - fixed file improvements (wangyangbo, Pavel) - allow updates of linked timeouts like regular timeouts (Pavel) - IOPOLL fix (Pavel) - remove batched file get optimization (Pavel) - improve reference handling (Pavel) - IRQ task_work batching (Pavel) - allow pure fixed file, and add support for open/accept (Pavel) - GFP_ATOMIC RT kernel fix - multiple CQ ring waiter improvement - funnel IRQ completions through task_work - add support for limiting async workers explicitly - add different clocksource support for timeouts - io-wq wakeup race fix - lots of cleanups and improvement (Pavel et al) * tag 'for-5.15/io_uring-2021-08-30' of git://git.kernel.dk/linux-block: (87 commits) io-wq: fix wakeup race when adding new work io-wq: wqe and worker locks no longer need to be IRQ safe io-wq: check max_worker limits if a worker transitions bound state io_uring: allow updating linked timeouts io_uring: keep ltimeouts in a list io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts io-wq: provide a way to limit max number of workers io_uring: add build check for buf_index overflows io_uring: clarify io_req_task_cancel() locking io_uring: add task-refs-get helper io_uring: fix failed linkchain code logic io_uring: remove redundant req_set_fail() io_uring: don't free request to slab io_uring: accept directly into fixed file table io_uring: hand code io_accept() fd installing io_uring: openat directly into fixed fd table net: add accept helper not installing fd io_uring: fix io_try_cancel_userdata race for iowq io_uring: IRQ rw completion batching io_uring: batch task work locking ...	2021-08-30 19:22:52 -07:00
Linus Torvalds	679369114e	for-5.15/block-2021-08-30 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs6H0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpukbD/9Qk9fQte+WJVmpbdvhV40gcKBVnGOVH0ke k+36x6AB/gWKnFHwtprsSyVqPxmzqwTv9VIq5l/s3Vydt3L61znvTneBeN03Wlkn UTxD0lY8HzyVWnZb82LBBjjy7cs6EzrFG4kBH/ZiTAyTcBsCAvzo5J7mywb4gFjj L/HeBq58EJ3WCUlxlVW1ijctvi7wnGoaH5bZY1TE00GGT6TysN2bEPfzjkuYHrDz RqhoQdWPLDz6h3x9lAncPw2MWlcmlGvJ96ABseAKFPKvXxE2PzgolSoQfVUUJtko bqGyy2ns+pxN11SrcGYjogEKVKhONoms/5UN1RtwRBVsgvecxlHER/SgyZ8luBDo lFhVXulkSjpswbWutRy3USge98GwMu2Z4ppP2CDmO7hkQd0DF8sL0kPKyaREkcHi NmsD/0zF2uUhUVN+PRC/MuzngAmL4Mmxjk70L+MohlK7e+H3pnEo1ec3OMcXe+wB dG6t/BFD9bYmj0UjsHeXEoR/iRuvSba1L8zBz5dhRaHH6DvdycYhpynXWWlU3C8K 3nzEVVpcDINMsiRl1Vqb6g6HsMwHIH84FRl7Mc51UmhW9C4gLfWMCt1guQuzOj72 yEbmCLydE/FR2IUPY7eqX8hRG8GTUlMtSvGdgnvBOcWj+K3buT/c5yVTHgTrN8ox LCOXHSvV6w== =S8fs -----END PGP SIGNATURE----- Merge tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "Nothing major in here - lots of good cleanups and tech debt handling, which is also evident in the diffstats. In particular: - Add disk sequence numbers (Matteo) - Discard merge fix (Ming) - Relax disk zoned reporting restrictions (Niklas) - Bio error handling zoned leak fix (Pavel) - Start of proper add_disk() error handling (Luis, Christoph) - blk crypto fix (Eric) - Non-standard GPT location support (Dmitry) - IO priority improvements and cleanups (Damien)o - blk-throtl improvements (Chunguang) - diskstats_show() stack reduction (Abd-Alrhman) - Loop scheduler selection (Bart) - Switch block layer to use kmap_local_page() (Christoph) - Remove obsolete disk_name helper (Christoph) - block_device refcounting improvements (Christoph) - Ensure gendisk always has a request queue reference (Christoph) - Misc fixes/cleanups (Shaokun, Oliver, Guoqing)" * tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits) sg: pass the device name to blk_trace_setup block, bfq: cleanup the repeated declaration blk-crypto: fix check for too-large dun_bytes blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN blk-zoned: allow zone management send operations without CAP_SYS_ADMIN block: mark blkdev_fsync static block: refine the disk_live check in del_gendisk mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA mmc: block: Support alternative_gpt_sector() operation partitions/efi: Support non-standard GPT location block: Add alternative_gpt_sector() operation bio: fix page leak bio_add_hw_page failure block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT block: remove a pointless call to MINOR() in device_add_disk null_blk: add error handling support for add_disk() virtio_blk: add error handling support for add_disk() block: add error handling for device_add_disk / add_disk block: return errors from disk_alloc_events block: return errors from blk_integrity_add block: call blk_register_queue earlier in device_add_disk ...	2021-08-30 18:52:11 -07:00
Linus Torvalds	8596e589b7	Updates for timekeeping, timers and related drivers: Core code: - Cure a couple of incorrectness issues in the posix CPU timer code to prevent that the tick dependency for NOHZ full is kept alive for no reason. - Avoid expensive double reprogramming of the clockevent device in hrtimer_start_range_ns(). - Avoid pointless SMP function calls when the clock was set to avoid disturbing CPUs which do not have any affected timers queued. - Make the clocksource watchdog test work correctly when CONFIG_HZ is less than 100. Drivers: - Prefer the ARM architected timer over the Exynos timer which is way more expensive to access. - Add device tree bindings for new Ingenic SoCs - The usual improvements and cleanups all over the place -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsnxcTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoZAmEAC0R5+9RkWAOpx0JWC7dQxFuIZoUZD1 8Inqs0ZFLX/LNrkjbBmZ/0XFvoU38+Eqd4Gqy3I748TgzMcB/0NHUUnXaugJE35J UzqdHkzhSReVinDHgRcIGMxkdj+JGomhlOM7RjTcdTixUWBfi1iXJBsit/tIWYq9 R9r/cthpEQzFu7BCZCdAsgRBaLNinCH9cCP0qXNS3hDwHFtziszjBhIrElAaEK4J IqrW/tsoxOlX/yQ/5TI8+E9DKgrr4pf+uNVjMJC0QBDodJrXRrkez9lFg6zJJggw adtzSFgL/OrbwsuEKAREpkSTwWSdQGdtLy2i9fx16jmh78YiTilHYe2A1ZD5v3zd dxfHUexnsgXcn4Im9w+sxLGxf2RQ6SsfVgd+R0lyOKLBFltnmbQWcvVl/6ZUa4Cc je+yuh9DTr0ksUiCnm0spP8AMtsSaHKJUP+MqHbXgo83AutVIFXr4zyQimM1lkY1 TPeXtbKlhd76jDgdcKU85tiyLsJrIImEJDzLtOEvSAk37yO6S9PHzLUMbg9Yp/vp Li4aUaMNmytCJTvKeSu6Wivzmyxqf4zSJus/fWLIJJjg/NSlNNhFDc0vkKPzaxM7 mg2VIcPrQKzQ1ZAap1kTdj1JNDWuANlV6pjE+zwaJbMtdHhvELyWnqC6EA05MsVj Txw2VVFaQcqKXA== =eP5w -----END PGP SIGNATURE----- Merge tag 'timers-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Updates for timekeeping, timers and related drivers: Core code: - Cure a couple of correctness issues in the posix CPU timer code to prevent that the tick dependency for NOHZ full is kept alive for no reason. - Avoid expensive double reprogramming of the clockevent device in hrtimer_start_range_ns(). - Avoid pointless SMP function calls when the clock was set to avoid disturbing CPUs which do not have any affected timers queued. - Make the clocksource watchdog test work correctly when CONFIG_HZ is less than 100. Drivers: - Prefer the ARM architected timer over the Exynos timer which is way more expensive to access. - Add device tree bindings for new Ingenic SoCs - The usual improvements and cleanups all over the place" * tag 'timers-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits) clocksource: Make clocksource watchdog test safe for slow-HZ systems dt-bindings: timer: Add ABIs for new Ingenic SoCs clocksource/drivers/fttmr010: Pass around less pointers clocksource/drivers/mediatek: Optimize systimer irq clear flow on shutdown clocksource/drivers/ingenic: Use bitfield macro helpers clocksource/drivers/sh_cmt: Fix wrong setting if don't request IRQ for clock source channel dt-bindings: timer: convert rockchip,rk-timer.txt to YAML clocksource/drivers/exynos_mct: Mark MCT device as CLOCK_EVT_FEAT_PERCPU clocksource/drivers/exynos_mct: Prioritise Arm arch timer on arm64 hrtimer: Unbreak hrtimer_force_reprogram() hrtimer: Use raw_cpu_ptr() in clock_was_set() hrtimer: Avoid more SMP function calls in clock_was_set() hrtimer: Avoid unnecessary SMP function calls in clock_was_set() hrtimer: Add bases argument to clock_was_set() time/timekeeping: Avoid invoking clock_was_set() twice timekeeping: Distangle resume and clock-was-set events timerfd: Provide timerfd_resume() hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case hrtimer: Ensure timerfd notification for HIGHRES=n hrtimer: Consolidate reprogramming code ...	2021-08-30 15:31:33 -07:00
Linus Torvalds	5d3c0db459	Scheduler changes for v5.15 are: - The biggest change in this cycle is scheduler support for asymmetric scheduling affinity, to support the execution of legacy 32-bit tasks on AArch32 systems that also have 64-bit-only CPUs. Architectures can fill in this functionality by defining their own task_cpu_possible_mask(p). When this is done, the scheduler will make sure the task will only be scheduled on CPUs that support it. (The actual arm64 specific changes are not part of this tree.) For other architectures there will be no change in functionality. - Add cgroup SCHED_IDLE support - Increase node-distance flexibility & delay determining it until a CPU is brought online. (This enables platforms where node distance isn't final until the CPU is only.) - Deadline scheduler enhancements & fixes - Misc fixes & cleanups. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmEsrDgRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gMxBAAmzXPnDm1pDBBUaEwc+DynNGHNxZcBO5E CaNyfywp4GMA+OC3JzUgDg1B9uvKQRdBGtv6SZ8OcyhJMfmkEvjt5/wYUrcdtQVP TA2lt80/Is8LQMnvcz7X0gmsLt+fXWQTF8ik1KT4wsi/k03Xw8BH11zHct6sV2QN NNQ+7BEjqU1HA1UXJFiaoGtWF0gdh29VyE5dSzfAis79L0XUQadS512LJKin/AK0 wYz8E+L7QIrjhfX9FQdOrR6da4TK6jAXyEY6a9dpaMHnFdtxuwhT4/BPtovNTeeY yxEZm3qSZbpghWHsMEa6Z4GIeLE6aNi3wcHt10fgdZDdotSRsNZuF6gi4A8nhRC+ 6wm+fCcFGEIBCL6eE/16Wms6YMdFfuiEAgtJGNy7GGyfH3/mS6u8eylXbLZncYXn DFHY+xUvmVZSzoPzcnYXEy4FB3kywNL7WBFxyhdXf5/EvWmmtHi4K3jVQ8jaqvhL MDk3NX9Hd0ariff3zUltWhMY5ouj6bIbBZmWWnD3s1xQT68VvE563cq0qH15dlnr j5M71eNRWvoOdZKzflgjRZzmdQtsZQ51tiMA6W6ZRfwYkHjb70qiia0r5GFf41X1 MYelmcaA8+RjKrQ5etxzzDjoXl0xDXiZric6gRQHjG1Y1Zm2rVaoD+vkJGD5TQJ0 2XTOGQgAxh4= =VdGE -----END PGP SIGNATURE----- Merge tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - The biggest change in this cycle is scheduler support for asymmetric scheduling affinity, to support the execution of legacy 32-bit tasks on AArch32 systems that also have 64-bit-only CPUs. Architectures can fill in this functionality by defining their own task_cpu_possible_mask(p). When this is done, the scheduler will make sure the task will only be scheduled on CPUs that support it. (The actual arm64 specific changes are not part of this tree.) For other architectures there will be no change in functionality. - Add cgroup SCHED_IDLE support - Increase node-distance flexibility & delay determining it until a CPU is brought online. (This enables platforms where node distance isn't final until the CPU is only.) - Deadline scheduler enhancements & fixes - Misc fixes & cleanups. * tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) eventfd: Make signal recursion protection a task bit sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case sched: Introduce dl_task_check_affinity() to check proposed affinity sched: Allow task CPU affinity to be restricted on asymmetric systems sched: Split the guts of sched_setaffinity() into a helper function sched: Introduce task_struct::user_cpus_ptr to track requested affinity sched: Reject CPU affinity changes based on task_cpu_possible_mask() cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq() cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus() cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1 sched: Introduce task_cpu_possible_mask() to limit fallback rq selection sched: Cgroup SCHED_IDLE support sched/topology: Skip updating masks for non-online nodes sched: Replace deprecated CPU-hotplug functions. sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS sched: Fix UCLAMP_FLAG_IDLE setting sched/deadline: Fix missing clock update in migrate_task_rq_dl() sched/fair: Avoid a second scan of target in select_idle_cpu sched/fair: Use prev instead of new target as recent_used_cpu sched: Don't report SCHED_FLAG_SUGOV in sched_getattr() ...	2021-08-30 13:42:10 -07:00
Linus Torvalds	6f01c935d9	File locking changes for v5.15. -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmEo3WcTHGpsYXl0b25A a2VybmVsLm9yZwAKCRAADmhBGVaCFbnuD/9Vj3dnXRgZ9LHFuQHp5Vy5yaGfujwP kIMN7bJiL0pCZ1zHapyn0cidZuTScDK2qMzGVR9Wrj/uYHEI+T08IEfaq/2CU3pS PdygHgqQi+WtJj4dks1OphVdlF0+46B/zy2YY0LE39zFUDO38VUIgb/gTiQ8JHIr BaDGtohlXmK8bDXo2XqMunSp7uQoRX92mv+bPjvIpsM60RnhPzhX0HeO/xhXO1PP OpswQG3nOTX1alZOXuSKdH7e3zfO+pLnC5NYeo5R4rys5xRiShU19OSAJfiPwf2w 06OX7uLT7aF5LPQPjOP4MA9dHPWWeIuYnKzUpQZZ+D+zjuaxhF62saJG1Um0yQns qqPZgTWhxzkJ16bj94T2UZW0bufFKM5cEfBPFw9ZyzQshX5jIE2V5ecV+F/AdtTj EA55m/5N8NZMsNgnEOqn6o8TkWoV1seHKJXqja0+ZY9/Xs8GkmjHLhjNBCwGZIbw 8S64Szp+yex3m1mRS+L6T0Gudic5WdFxOKQAzvvJcmT6u+ZGBzU0FjNTNNdnu3tp f9mANYxT1vNfYplRuNhMCp9oioSRpO2vfDQLCBRtsCy0QlKCum6qb7E/BcWRlwn1 SOAFmjeXpr+VXGX+Rjp/bzCdHlNOVbR4ODWJ+h5D7QZaQwQ6FafkEq1D2yjPR0OR i722ECoM++vZ4w== =CfcC -----END PGP SIGNATURE----- Merge tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "This starts with a couple of fixes for potential deadlocks in the fowner/fasync handling. The next patch removes the old mandatory locking code from the kernel altogether. The last patch cleans up rw_verify_area a bit more after the mandatory locking removal" * tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: clean up after mandatory file locking support removal fs: remove mandatory file locking support fcntl: fix potential deadlock for &fasync_struct.fa_lock fcntl: fix potential deadlocks for &fown_struct.lock	2021-08-30 12:38:13 -07:00
Linus Torvalds	aa99f3c2b9	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmTZcACgkQnJ2qBz9k QNkkmAgArW6XoF1CePds/ZaC9vfg/nk66/zVo0n+J8xXjMWAPxcKbWFfV0uWVixq yk4lcLV47a2Mu/B/1oLNd3vrSmhwU+srWqNwOFn1nv+lP/6wJqr8oztRHn/0L9Q3 ZSRrukSejbQ6AvTL/WzTNnCjjCc2ne3Kyko6W41aU6uyJuzhSM32wbx7qlV6t54Z iint9OrB4gM0avLohNafTUq6I+tEGzBMNwpCG/tqCmkcvDcv3rTDVAnPSCTm0Tx2 hdrYDcY/rLxo93pDBaW1rYA/fohR+mIVye6k2TjkPAL6T1x+rxeT5qnc+YijH5yF sFPDhlD+ZsfOLi8stWXLOJ+8+gLODg== =pDBR -----END PGP SIGNATURE----- Merge tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fs hole punching vs cache filling race fixes from Jan Kara: "Fix races leading to possible data corruption or stale data exposure in multiple filesystems when hole punching races with operations such as readahead. This is the series I was sending for the last merge window but with your objection fixed - now filemap_fault() has been modified to take invalidate_lock only when we need to create new page in the page cache and / or bring it uptodate" * tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: filesystems/locking: fix Malformed table warning cifs: Fix race between hole punch and page fault ceph: Fix race between hole punch and page fault fuse: Convert to using invalidate_lock f2fs: Convert to using invalidate_lock zonefs: Convert to using invalidate_lock xfs: Convert double locking of MMAPLOCK to use VFS helpers xfs: Convert to use invalidate_lock xfs: Refactor xfs_isilocked() ext2: Convert to using invalidate_lock ext4: Convert to use mapping->invalidate_lock mm: Add functions to lock invalidate_lock for two mappings mm: Protect operations adding pages to page cache with invalidate_lock documentation: Sync file_operations members with reality mm: Fix comments mentioning i_mutex	2021-08-30 10:24:50 -07:00
Trond Myklebust	8cfb901528	NFS: Always provide aligned buffers to the RPC read layers Instead of messing around with XDR padding in the RDMA layer, we should just give the RPC layer an aligned buffer. Try to avoid creating extra RPC calls by aligning to the smaller value of ALIGN(len, rsize) and PAGE_SIZE. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2021-08-30 13:21:38 -04:00
Linus Torvalds	a1ca8e7147	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmPKcACgkQnJ2qBz9k QNk6RwgA2GJlJ6g2I/iT64ii4l8nWVEyp1OI/m2DvEaoci+SLp/oIh/VbEMsdj7D Fvoajo6KRu53/Ucwg+u9i5tygpCcu5X52DB1qZWzNZKxl98EdXRTb9e6n3jNDBti dz4EAd/YZhADiS5sX9kZLbSE1Kzirx8ULsftZ3T31rkU9n1pJD6tH7qyHDot3oV+ XnONRBb2coJ5E7qeRqa8T35uG9a4JGgnDICg3qISPXn+ZNX4u61vXQu/r2YzTGn5 0b6G3X8YTCLti4w14/eGm9bMw3J5TLqRDg/d73B34ju6xcKPP83j/UmHFyxpNnGD mg6mi8PyeEiwPbglSdVBaFQoOroK9Q== =Hket -----END PGP SIGNATURE----- Merge tag 'fs_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF and isofs updates from Jan Kara: "Several smaller fixes and cleanups in UDF and isofs" * tag 'fs_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf_get_extendedattr() had no boundary checks. isofs: joliet: Fix iocharset=utf8 mount option udf: Fix iocharset=utf8 mount option udf: Get rid of 0-length arrays in struct fileIdentDesc udf: Get rid of 0-length arrays udf: Remove unused declaration udf: Check LVID earlier	2021-08-30 10:18:07 -07:00
Linus Torvalds	63b0c40339	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmO6kACgkQnJ2qBz9k QNmP4wf+N4n3pDgCfEBPTFBMzghIqLfFSdRx713cj7u6Bi7qwZB90ChdgGC/J2lh KVh9KvzneSqo4CoaYYLECDkM9cLGZuX0YEWX7L0+LivEoIVuoFK0OKDcyptKl5FK fKi5tK287NlgXh9PqxuPlYUsaxQKh7PSqO4+9IgHFX/P7NMbxdRiGg8Nb+rp9XhP IDzfR/qNqlyjYHJUae5GuE2hsbX8lFAkSnmQiz8ya6c3IZdc8GmaVL1jJI+ynre/ CMHysDV2QPS5UcMjO2YnoG3Osl18TOTsutnCDPzYI2qr4UGB+4RxfCgEVVdc6F3v p4VfqfJB09sHpkRBrN9Lz+MgxxkVkQ== =CRIk -----END PGP SIGNATURE----- Merge tag 'fiemap_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull FIEMAP cleanups from Jan Kara: "FIEMAP cleanups from Christoph transitioning all remaining filesystems supporting FIEMAP (ext2, hpfs) to iomap API and removing the old helper" * tag 'fiemap_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fs: remove generic_block_fiemap hpfs: use iomap_fiemap to implement ->fiemap ext2: use iomap_fiemap to implement ->fiemap ext2: make ext2_iomap_ops available unconditionally	2021-08-30 10:13:02 -07:00
Chao Yu	f7db8dd698	f2fs: enable realtime discard iff device supports discard Let's only enable realtime discard if and only if device supports discard functionality. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-08-30 10:12:50 -07:00
Jaegeuk Kim	dddd3d6529	f2fs: guarantee to write dirty data when enabling checkpoint back We must flush all the dirty data when enabling checkpoint back. Let's guarantee that first by adding a retry logic on sync_inodes_sb(). In addition to that, this patch adds to flush data in fsync when checkpoint is disabled, which can mitigate the sync_inodes_sb() failures in advance. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-08-30 10:12:47 -07:00
Chao Yu	c8dc3047c4	f2fs: fix to unmap pages from userspace process in punch_hole() We need to unmap pages from userspace process before removing pagecache in punch_hole() like we did in f2fs_setattr(). Similar change: commit `5e44f8c374` ("ext4: hole-punch use truncate_pagecache_range") Fixes: `fbfa2cc58d` ("f2fs: add file operations") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-08-30 10:12:47 -07:00
Chao Yu	adf9ea89c7	f2fs: fix unexpected ENOENT comes from f2fs_map_blocks() In below path, it will return ENOENT if filesystem is shutdown: - f2fs_map_blocks - f2fs_get_dnode_of_data - f2fs_get_node_page - __get_node_page - read_node_page - is_sbi_flag_set(sbi, SBI_IS_SHUTDOWN) return -ENOENT - force return value from ENOENT to 0 It should be fine for read case, since it indicates a hole condition, and caller could use .m_next_pgofs to skip the hole and continue the lookup. However it may cause confusing for write case, since leaving a hole there, and said nothing was wrong doesn't help. There is at least one case from dax_iomap_actor() will complain that, so fix this in prior to supporting dax in f2fs. xfstest generic/388 reports below warning: ubuntu godown: xfstests-induced forced shutdown of /mnt/scratch_f2fs: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 485833 at fs/dax.c:1127 dax_iomap_actor+0x339/0x370 Call Trace: iomap_apply+0x1c4/0x7b0 ? dax_iomap_rw+0x1c0/0x1c0 dax_iomap_rw+0xad/0x1c0 ? dax_iomap_rw+0x1c0/0x1c0 f2fs_file_write_iter+0x5ab/0x970 [f2fs] do_iter_readv_writev+0x273/0x2e0 do_iter_write+0xab/0x1f0 vfs_iter_write+0x21/0x40 iter_file_splice_write+0x287/0x540 do_splice+0x37c/0xa60 __x64_sys_splice+0x15f/0x3a0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae ubuntu godown: xfstests-induced forced shutdown of /mnt/scratch_f2fs: ------------[ cut here ]------------ RIP: 0010:dax_iomap_pte_fault.isra.0+0x72e/0x14a0 Call Trace: dax_iomap_fault+0x44/0x70 f2fs_dax_huge_fault+0x155/0x400 [f2fs] f2fs_dax_fault+0x18/0x30 [f2fs] __do_fault+0x4e/0x120 do_fault+0x3cf/0x7a0 __handle_mm_fault+0xa8c/0xf20 ? find_held_lock+0x39/0xd0 handle_mm_fault+0x1b6/0x480 do_user_addr_fault+0x320/0xcd0 ? rcu_read_lock_sched_held+0x67/0xc0 exc_page_fault+0x77/0x3f0 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 Fixes: `83a3bfdb5a` ("f2fs: indicate shutdown f2fs to allow unmount successfully") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-08-30 10:12:47 -07:00
Chao Yu	ad126ebdde	f2fs: fix to account missing .skipped_gc_rwsem There is a missing place we forgot to account .skipped_gc_rwsem, fix it. Fixes: `6f8d445506` ("f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-08-30 10:12:47 -07:00
Chao Yu	d75da8c8a4	f2fs: adjust unlock order for cleanup This patch adjusts unlock order of .i_mmap_sem and .i_gc_rwsem for cleanup. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-08-30 10:12:47 -07:00
Fengnan Chang	4d67490498	f2fs: Don't create discard thread when device doesn't support realtime discard Don't create discard thread when device doesn't support realtime discard or user specifies nodiscard mount option. Signed-off-by: Fengnan Chang <changfengnan@vivo.com> Signed-off-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-08-30 10:12:46 -07:00
Linus Torvalds	3513431926	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmOoUACgkQnJ2qBz9k QNm8XggA6cbU79idj2HgDG5Wj4X3cd1+NlpPP2HytvTomwqYHJmHPAg0b44AsfHT ToHeYhIqb78MbLHn8fl4g5+O01raqihbUyIcRNcgdeQcle7Phuv/hDlsr8EOhfea mec4vottwaoiNn+4o7ciUAIri0stbVL8VPpnOy0X7iCmX7OC2vUvhQQGwvyfmXZm +F/8jppfvfsxl73SbWhG+sgD83D4ASZ+BGwC1Fx9tHitOVTJzg+CR2C1J4seaK6E r8+u/K6YKSrCGgA2dOVgzYyp2vHPGbHS4Z3H1HDXS+mtqxHJYqPR54kRT1BqCEQX Nm4CP7+u6a+86HLpBZi1NhvYrgLTig== =LJaZ -----END PGP SIGNATURE----- Merge tag 'fsnotify_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "fsnotify speedups when notification actually isn't used and support for identifying processes which caused fanotify events through pidfd instead of normal pid" * tag 'fsnotify_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: optimize the case of no marks of any type fsnotify: count all objects with attached connectors fsnotify: count s_fsnotify_inode_refs for attached connectors fsnotify: replace igrab() with ihold() on attach connector fanotify: add pidfd support to the fanotify API fanotify: introduce a generic info record copying helper fanotify: minor cosmetic adjustments to fid labels kernel/pid.c: implement additional checks upon pidfd_create() parameters kernel/pid.c: remove static qualifier from pidfd_create()	2021-08-30 10:04:31 -07:00
Kari Argillander	e8b8e97f91	fs/ntfs3: Restyle comments to better align with kernel-doc Capitalize comments and end with period for better reading. Also function comments are now little more kernel-doc style. This way we can easily convert them to kernel-doc style if we want. Note that these are not yet complete with this style. Example function comments start with /* and in kernel-doc style they start /**. Use imperative mood in function descriptions. Change words like ntfs -> NTFS, linux -> Linux. Use "we" not "I" when commenting code. Signed-off-by: Kari Argillander <kari.argillander@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-30 18:39:14 +03:00
Jens Axboe	87df7fb922	io-wq: fix wakeup race when adding new work When new work is added, io_wqe_enqueue() checks if we need to wake or create a new worker. But that check is done outside the lock that otherwise synchronizes us with a worker going to sleep, so we can end up in the following situation: CPU0 CPU1 lock insert work unlock atomic_read(nr_running) != 0 lock atomic_dec(nr_running) no wakeup needed Hold the wqe lock around the "need to wakeup" check. Then we can also get rid of the temporary work_flags variable, as we know the work will remain valid as long as we hold the lock. Cc: stable@vger.kernel.org Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-30 07:45:47 -06:00
Jens Axboe	a9a4aa9fbf	io-wq: wqe and worker locks no longer need to be IRQ safe io_uring no longer queues async work off completion handlers that run in hard or soft interrupt context, and that use case was the only reason that io-wq had to use IRQ safe locks for wqe and worker locks. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-30 07:32:17 -06:00
Jens Axboe	ecc53c48c1	io-wq: check max_worker limits if a worker transitions bound state For the two places where new workers are created, we diligently check if we are allowed to create a new worker. If we're currently at the limit of how many workers of a given type we can have, then we don't create any new ones. If you have a mixed workload with various types of bound and unbounded work, then it can happen that a worker finishes one type of work and is then transitioned to the other type. For this case, we don't check if we are actually allowed to do so. This can cause io-wq to temporarily exceed the allowed number of workers for a given type. When retrieving work, check that the types match. If they don't, check if we are allowed to transition to the other type. If not, then don't handle the new work. Cc: stable@vger.kernel.org Reported-by: Johannes Lundberg <johalun0@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-30 07:28:19 -06:00
Pavel Begunkov	f1042b6ccb	io_uring: allow updating linked timeouts We allow updating normal timeouts, add support for adjusting timings of linked timeouts as well. Reported-by: Victor Stewart <v@nametag.social> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-29 16:12:21 -06:00
Pavel Begunkov	ef9dd63708	io_uring: keep ltimeouts in a list A preparation patch. Keep all queued linked timeout in a list, so they may be found and updated. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-29 16:12:11 -06:00
Jens Axboe	50c1df2b56	io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts Certain use cases want to use CLOCK_BOOTTIME or CLOCK_REALTIME rather than CLOCK_MONOTONIC, instead of the default CLOCK_MONOTONIC. Add an IORING_TIMEOUT_BOOTTIME and IORING_TIMEOUT_REALTIME flag that allows timeouts and linked timeouts to use the selected clock source. Only one clock source may be selected, and we -EINVAL the request if more than one is given. If neither BOOTIME nor REALTIME are selected, the previous default of MONOTONIC is used. Link: https://github.com/axboe/liburing/issues/369 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-29 07:57:23 -06:00
Jens Axboe	2e480058dd	io-wq: provide a way to limit max number of workers io-wq divides work into two categories: 1) Work that completes in a bounded time, like reading from a regular file or a block device. This type of work is limited based on the size of the SQ ring. 2) Work that may never complete, we call this unbounded work. The amount of workers here is just limited by RLIMIT_NPROC. For various uses cases, it's handy to have the kernel limit the maximum amount of pending workers for both categories. Provide a way to do with with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation. IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets the max worker count to what is being passed in for each category. The old values are returned into that same array. If 0 is being passed in for either category, it simply returns the current value. The value is capped at RLIMIT_NPROC. This actually isn't that important as it's more of a hint, if we're exceeding the value then our attempt to fork a new worker will fail. This happens naturally already if more than one node is in the system, as these values are per-node internally for io-wq. Reported-by: Johannes Lundberg <johalun0@gmail.com> Link: https://github.com/axboe/liburing/issues/420 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-29 07:55:55 -06:00
Thomas Gleixner	b542e383d8	eventfd: Make signal recursion protection a task bit The recursion protection for eventfd_signal() is based on a per CPU variable and relies on the !RT semantics of spin_lock_irqsave() for protecting this per CPU variable. On RT kernels spin_lock_irqsave() neither disables preemption nor interrupts which allows the spin lock held section to be preempted. If the preempting task invokes eventfd_signal() as well, then the recursion warning triggers. Paolo suggested to protect the per CPU variable with a local lock, but that's heavyweight and actually not necessary. The goal of this protection is to prevent the task stack from overflowing, which can be achieved with a per task recursion protection as well. Replace the per CPU variable with a per task bit similar to other recursion protection bits like task_struct::in_page_owner. This works on both !RT and RT kernels and removes as a side effect the extra per CPU storage. No functional change for !RT kernels. Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/87wnp9idso.ffs@tglx	2021-08-28 01:33:02 +02:00
Olga Kornievskaia	2a7a451a90	NFSv4.1 add network transport when session trunking is detected After trunking is discovered in nfs4_discover_server_trunking(), add the transport to the old client structure if the allowed limit of transports has not been reached. An example: there exists a multi-homed server and client mounts one server address and some volume and then doest another mount to a different address of the same server and perhaps a different volume. Previously, the client checks that this is a session trunkable servers (same server), and removes the newly created client structure along with its transport. Now, the client adds the connection from the 2nd mount into the xprt switch of the existing client (it leads to having 2 available connections). Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2021-08-27 16:37:41 -04:00
Olga Kornievskaia	dc48e0abee	SUNRPC enforce creation of no more than max_connect xprts If we are adding new transports via rpc_clnt_test_and_add_xprt() then check if we've reached the limit. Currently only pnfs path adds transports via that function but this is done in preparation when the client would add new transports when session trunking is detected. A warning is logged if the limit is reached. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2021-08-27 16:37:29 -04:00
Olga Kornievskaia	7e134205f6	NFSv4 introduce max_connect mount options This option will control up to how many xprts can the client establish to the server with a distinct address (that means nconnect connections are not counted towards this new limit). This patch is setting up nfs structures to keeep track of the max_connect limit (does not enforce it). The default value is kept at 1 so that no current mounts that don't want any additional connections would be effected. The maximum value is set at 16. Mounts to DS are not limited to default value of 1 but instead set to the maximum default value of 16 (NFS_MAX_TRANSPORTS). Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2021-08-27 16:37:17 -04:00
Ye Bin	79d534f8cb	NFSv3: Delete duplicate judgement in nfs3_async_handle_jukebox As `eb96d5c97b` ("SUNRPC handle EKEYEXPIRED in call_refreshresult") commit handle EKEYEXPIRED in call_refreshresult, so there is only handle when "task->tk_status" is equal "-EJUKEBOX" in nfs3_async_handle_jukebox. Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2021-08-27 16:36:21 -04:00
Namjae Jeon	7d5d8d7156	ksmbd: fix __write_overflow warning in ndr_read_string Dan reported __write_overflow warning in ndr_read_string. CC [M] fs/ksmbd/ndr.o In file included from ./include/linux/string.h:253, from ./include/linux/bitmap.h:11, from ./include/linux/cpumask.h:12, from ./arch/x86/include/asm/cpumask.h:5, from ./arch/x86/include/asm/msr.h:11, from ./arch/x86/include/asm/processor.h:22, from ./arch/x86/include/asm/cpufeature.h:5, from ./arch/x86/include/asm/thread_info.h:53, from ./include/linux/thread_info.h:60, from ./arch/x86/include/asm/preempt.h:7, from ./include/linux/preempt.h:78, from ./include/linux/spinlock.h:55, from ./include/linux/wait.h:9, from ./include/linux/wait_bit.h:8, from ./include/linux/fs.h:6, from fs/ksmbd/ndr.c:7: In function memcpy, inlined from ndr_read_string at fs/ksmbd/ndr.c:86:2, inlined from ndr_decode_dos_attr at fs/ksmbd/ndr.c:167:2: ./include/linux/fortify-string.h:219:4: error: call to __write_overflow declared with attribute error: detected write beyond size of object __write_overflow(); ^~~~~~~~~~~~~~~~~~ This seems to be a false alarm because hex_attr size is always smaller than n->length. This patch fix this warning by allocation hex_attr with n->length. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-08-27 14:03:49 -05:00
Pavel Begunkov	90499ad00c	io_uring: add build check for buf_index overflows req->buf_index is u16 and so we rely on registered buffers indexes fitting into it. Add a build check, so when the upper limit for the number of buffers is lifted we get a compliation fail but not lurking problems. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/787e8e1a17cea51ca6301426b1c4c4887b8bd676.1629920396.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-27 09:23:11 -06:00
Pavel Begunkov	b18a1a4574	io_uring: clarify io_req_task_cancel() locking It's too easy to forget and misjudge about synchronisation in io_req_task_cancel(), add a comment clarifying it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/71099083835f983a1fd73d5a3da6391924da8300.1629920396.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-27 09:23:11 -06:00
Dan Carpenter	b8155e95de	fs/ntfs3: Fix error handling in indx_insert_into_root() There are three bugs in this code: 1) If indx_get_root() fails, then return -EINVAL instead of success. 2) On the "/* make root external */" -EOPNOTSUPP; error path it should free "re" but it has a memory leak. 3) If indx_new() fails then it will lead to an error pointer dereference when we call put_indx_node(). I've re-written the error handling to be more clear. Fixes: `82cae269cf` ("fs/ntfs3: Add initialization of super block") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Kari Argillander <kari.argillander@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:14 +03:00
Dan Carpenter	8c83a4851d	fs/ntfs3: Potential NULL dereference in hdr_find_split() The "e" pointer is dereferenced before it has been checked for NULL. Move the dereference after the NULL check to prevent an Oops. Fixes: `82cae269cf` ("fs/ntfs3: Add initialization of super block") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Kari Argillander <kari.argillander@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:14 +03:00
Dan Carpenter	04810f000a	fs/ntfs3: Fix error code in indx_add_allocate() Return -EINVAL if ni_find_attr() fails. Don't return success. Fixes: `82cae269cf` ("fs/ntfs3: Add initialization of super block") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Kari Argillander <kari.argillander@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:14 +03:00
Dan Carpenter	2926e42970	fs/ntfs3: fix an error code in ntfs_get_acl_ex() The ntfs_get_ea() function returns negative error codes or on success it returns the length. In the original code a zero length return was treated as -ENODATA and results in a NULL return. But it should be treated as an invalid length and result in an PTR_ERR(-EINVAL) return. Fixes: `be71b5cba2` ("fs/ntfs3: Add attrib operations") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:13 +03:00
Dan Carpenter	a1b04d380a	fs/ntfs3: add checks for allocation failure Add a check for when the kzalloc() in init_rsttbl() fails. Some of the callers checked for NULL and some did not. I went down the call tree and added NULL checks where ever they were missing. Fixes: `b46acd6a6a` ("fs/ntfs3: Add NTFS journal") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Kari Argillander <kari.argillander@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:13 +03:00
Kari Argillander	345482bc43	fs/ntfs3: Use kcalloc/kmalloc_array over kzalloc/kmalloc Use kcalloc/kmalloc_array over kzalloc/kmalloc when we allocate array. Checkpatch found these after we did not use our own defined allocation wrappers. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kari Argillander <kari.argillander@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:13 +03:00
Kari Argillander	195c52bdd5	fs/ntfs3: Do not use driver own alloc wrappers Problem with these wrapper is that we cannot take off example GFP_NOFS flag. It is not recomended use those in all places. Also if we change one driver specific wrapper to kernel wrapper then it would look really weird. People should be most familiar with kernel wrappers so let's just use those ones. Driver specific alloc wrapper also confuse some static analyzing tools, good example is example kernels checkpatch tool. After we converter these to kernel specific then warnings is showed. Following Coccinelle script was used to automate changing. virtual patch @alloc depends on patch@ expression x; expression y; @@ ( - ntfs_malloc(x) + kmalloc(x, GFP_NOFS) \| - ntfs_zalloc(x) + kzalloc(x, GFP_NOFS) \| - ntfs_vmalloc(x) + kvmalloc(x, GFP_NOFS) \| - ntfs_free(x) + kfree(x) \| - ntfs_vfree(x) + kvfree(x) \| - ntfs_memdup(x, y) + kmemdup(x, y, GFP_NOFS) ) Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kari Argillander <kari.argillander@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:12 +03:00
Kari Argillander	fa3cacf544	fs/ntfs3: Use kernel ALIGN macros over driver specific The static checkers (Smatch) were complaining because QuadAlign() was buggy. If you try to align something higher than UINT_MAX it got truncated to a u32. Smatch warning was: fs/ntfs3/attrib.c:383 attr_set_size_res() warn: was expecting a 64 bit value instead of '~7' So that this will not happen again we will change all these macros to kernel made ones. This can also help some other static analyzing tools to give us better warnings. Patch was generated with Coccinelle script and after that some style issue was hand fixed. Coccinelle script: virtual patch @alloc depends on patch@ expression x; @@ ( - #define QuadAlign(n) (((n) + 7u) & (~7u)) \| - QuadAlign(x) + ALIGN(x, 8) \| - #define IsQuadAligned(n) (!((size_t)(n)&7u)) \| - IsQuadAligned(x) + IS_ALIGNED(x, 8) \| - #define Quad2Align(n) (((n) + 15u) & (~15u)) \| - Quad2Align(x) + ALIGN(x, 16) \| - #define IsQuad2Aligned(n) (!((size_t)(n)&15u)) \| - IsQuad2Aligned(x) + IS_ALIGNED(x, 16) \| - #define Quad4Align(n) (((n) + 31u) & (~31u)) \| - Quad4Align(x) + ALIGN(x, 32) \| - #define IsSizeTAligned(n) (!((size_t)(n) & (sizeof(size_t) - 1))) \| - IsSizeTAligned(x) + IS_ALIGNED(x, sizeof(size_t)) \| - #define DwordAlign(n) (((n) + 3u) & (~3u)) \| - DwordAlign(x) + ALIGN(x, 4) \| - #define IsDwordAligned(n) (!((size_t)(n)&3u)) \| - IsDwordAligned(x) + IS_ALIGNED(x, 4) \| - #define WordAlign(n) (((n) + 1u) & (~1u)) \| - WordAlign(x) + ALIGN(x, 2) \| - #define IsWordAligned(n) (!((size_t)(n)&1u)) \| - IsWordAligned(x) + IS_ALIGNED(x, 2) \| ) Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Kari Argillander <kari.argillander@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:12 +03:00
Kari Argillander	24516d481d	fs/ntfs3: Restyle comment block in ni_parse_reparse() First of this fix one none utf8 char in this comment block. Maybe this happened because error in filesystem ;) Also this block was hard to read because long lines so make it max 80 long. And while we doing this stuff make little better grammer. Signed-off-by: Kari Argillander <kari.argillander@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:12 +03:00
Jiapeng Chong	1263eddfea	fs/ntfs3: Remove unused including <linux/version.h> Eliminate the follow versioncheck warning: ./fs/ntfs3/inode.c: 16 linux/version.h not needed. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Fixes: `82cae269cf` ("fs/ntfs3: Add initialization of super block") Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Kari Argillander <kari.argillander@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:11 +03:00
Gustavo A. R. Silva	abfeb2ee21	fs/ntfs3: Fix fall-through warnings for Clang Fix the following fallthrough warnings: fs/ntfs3/inode.c:1792:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/ntfs3/index.c:178:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] This helps with the ongoing efforts to globally enable -Wimplicit-fallthrough for Clang. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:11 +03:00
Kari Argillander	be87e821fd	fs/ntfs3: Fix one none utf8 char in source file In one source file there is for some reason non utf8 char. But hey this is fs development so this kind of thing might happen. Signed-off-by: Kari Argillander <kari.argillander@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:11 +03:00
Nathan Chancellor	8c01308b6d	fs/ntfs3: Remove unused variable cnt in ntfs_security_init() Clang warns: fs/ntfs3/fsntfs.c:1874:9: warning: variable 'cnt' set but not used [-Wunused-but-set-variable] size_t cnt, off; ^ 1 warning generated. It is indeed unused so remove it. Fixes: `82cae269cf` ("fs/ntfs3: Add initialization of super block") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Kari Argillander <kari.argillander@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:10 +03:00
Colin Ian King	71eeb6ace8	fs/ntfs3: Fix integer overflow in multiplication The multiplication of the u32 data_size with a int is being performed using 32 bit arithmetic however the results is being assigned to the variable nbits that is a size_t (64 bit) value. Fix a potential integer overflow by casting the u32 value to a size_t before the multiply to use a size_t sized bit multiply operation. Addresses-Coverity: ("Unintentional integer overflow") Fixes: `82cae269cf` ("fs/ntfs3: Add initialization of super block") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:10 +03:00
Kari Argillander	87790b6534	fs/ntfs3: Add ifndef + define to all header files Add guards so that compiler will only include header files once. Signed-off-by: Kari Argillander <kari.argillander@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:10 +03:00
Kari Argillander	528c9b3d1e	fs/ntfs3: Use linux/log2 is_power_of_2 function We do not need our own implementation for this function in this driver. It is much better to use generic one. Signed-off-by: Kari Argillander <kari.argillander@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:05:09 +03:00
Colin Ian King	f8d87ed9f0	fs/ntfs3: Fix various spelling mistakes There is a spelling mistake in a ntfs_err error message. Also fix various spelling mistakes in comments. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Kari Argillander <kari.argillander@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2021-08-27 17:04:45 +03:00
Pavel Begunkov	9a10867ae5	io_uring: add task-refs-get helper As we have a more complicated task referencing, which apart from normal task references includes taking tctx->inflight and caching all that, it would be a good idea to have all that isolated in helpers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d9114d037f1c195897aa13f38a496078eca2afdb.1630023531.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-27 07:29:41 -06:00
Hao Xu	a8295b982c	io_uring: fix failed linkchain code logic Given a linkchain like this: req0(link_flag)-->req1(link_flag)-->...-->reqn(no link_flag) There is a problem: - if some intermediate linked req like req1 's submittion fails, reqs after it won't be cancelled. - sqpoll disabled: maybe it's ok since users can get the error info of req1 and stop submitting the following sqes. - sqpoll enabled: definitely a problem, the following sqes will be submitted in the next round. The solution is to refactor the code logic to: - if a linked req's submittion fails, just mark it and the head(if it exists) as REQ_F_FAIL. Leverage req->result to indicate whether it is failed or cancelled. - submit or fail the whole chain when we come to the end of it. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210827094609.36052-3-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-27 07:27:24 -06:00
Hao Xu	14afdd6ee3	io_uring: remove redundant req_set_fail() req_set_fail() in io_submit_sqe() is redundant, remove it. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210827094609.36052-2-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-27 07:27:24 -06:00
David Howells	20ec197bfa	fscache: Use refcount_t for the cookie refcount instead of atomic_t Use refcount_t for the fscache_cookie refcount instead of atomic_t and rename the 'usage' member to 'ref' in such cases. The tracepoints that reference it change from showing "u=%d" to "r=%d". Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/162431204358.2908479.8006938388213098079.stgit@warthog.procyon.org.uk/	2021-08-27 13:34:03 +01:00
David Howells	33cba85922	fscache: Fix fscache_cookie_put() to not deref after dec fscache_cookie_put() accesses the cookie it has just put inside the tracepoint that monitors the change - but this is something it's not allowed to do if we didn't reduce the count to zero. Fix this by dropping most of those values from the tracepoint and grabbing the cookie debug ID before doing the dec. Also take the opportunity to switch over the usage and where arguments on the tracepoint to put the reason last. Fixes: `a18feb5576` ("fscache: Add tracepoints") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/162431203107.2908479.3259582550347000088.stgit@warthog.procyon.org.uk/	2021-08-27 13:34:02 +01:00
David Howells	35b72573e9	fscache: Fix cookie key hashing The current hash algorithm used for hashing cookie keys is really bad, producing almost no dispersion (after a test kernel build, ~30000 files were split over just 18 out of the 32768 hash buckets). Borrow the full_name_hash() hash function into fscache to do the hashing for cookie keys and, in the future, volume keys. I don't want to use full_name_hash() as-is because I want the hash value to be consistent across arches and over time as the hash value produced may get used on disk. I can also optimise parts of it away as the key will always be a padded array of aligned 32-bit words. Fixes: `ec0328e46d` ("fscache: Maintain a catalogue of allocated cookies") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/162431201844.2908479.8293647220901514696.stgit@warthog.procyon.org.uk/	2021-08-27 13:34:02 +01:00
David Howells	8beabdde18	cachefiles: Change %p in format strings to something else Change plain %p in format strings in cachefiles code to something more useful, since %p is now hashed before printing and thus no longer matches the contents of an oops register dump. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/160588476042.3465195.6837847445880367183.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/162431200692.2908479.9253374494073633778.stgit@warthog.procyon.org.uk/	2021-08-27 13:34:02 +01:00
David Howells	c97a72ded9	fscache: Change %p in format strings to something else Change plain %p in format strings in fscache code to something more useful, since %p is now hashed before printing and thus no longer matches the contents of an oops register dump. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/160588474843.3465195.5446072310069374803.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/162431199509.2908479.2950631488219944294.stgit@warthog.procyon.org.uk/	2021-08-27 13:34:02 +01:00
David Howells	58f386a73f	fscache: Remove the object list procfile Remove the object list procfile from fscache as objects will become entirely internal to the cache. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/162431198332.2908479.5847286163455099669.stgit@warthog.procyon.org.uk/	2021-08-27 13:34:02 +01:00
David Howells	6ae9bd8bb0	fscache, cachefiles: Remove the histogram stuff Remove the histogram stuff as it's mostly going to be outdated. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/162431195953.2908479.16770977195634296638.stgit@warthog.procyon.org.uk/	2021-08-27 13:34:02 +01:00
David Howells	884a76881f	fscache: Procfile to display cookies Add /proc/fs/fscache/cookies to display active cookies. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/158861211871.340223.7223853943667440807.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/159465771021.1376105.6933857529128238020.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/160588460994.3465195.16963417803501149328.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/162431194785.2908479.786917990782538164.stgit@warthog.procyon.org.uk/	2021-08-27 13:34:02 +01:00
David Howells	2908f5e101	fscache: Add a cookie debug ID and use that in traces Add a cookie debug ID and use that in traces and in procfiles rather than displaying the (hashed) pointer to the cookie. This is easier to correlate and we don't lose anything when interpreting oops output since that shows unhashed addresses and registers that aren't comparable to the hashed values. Changes: ver #2: - Fix the fscache_op tracepoint to handle a NULL cookie pointer. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/158861210988.340223.11688464116498247790.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/159465769844.1376105.14119502774019865432.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/160588459097.3465195.1273313637721852165.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/162431193544.2908479.17556704572948300790.stgit@warthog.procyon.org.uk/	2021-08-27 13:24:46 +01:00
Christoph Hellwig	bdd3c50d83	dax: remove bdev_dax_supported All callers already have a dax_device obtained from fs_dax_get_by_bdev at hand, so just pass that to dax_supported() insted of doing another lookup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/20210826135510.6293-10-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2021-08-26 16:52:03 -07:00
Christoph Hellwig	a384f088e4	xfs: factor out a xfs_buftarg_is_dax helper Refactor the DAX setup code in preparation of removing bdev_dax_supported. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20210826135510.6293-9-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2021-08-26 16:52:03 -07:00
Christoph Hellwig	6c97ec172a	fsdax: improve the FS_DAX Kconfig description and help text Rename the main option text to clarify it is for file system access, and add a bit of text that explains how to actually switch a nvdimm to a fsdax capable state. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210826135510.6293-2-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2021-08-26 16:52:02 -07:00
Johannes Berg	1568cb0e6d	hostfs: support splice_write There's really no good reason not to, and e.g. trace-cmd currently requires it for the temporary per-CPU files. Hook up splice_write just like everyone else does. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2021-08-26 22:28:02 +02:00
J. Bruce Fields	0bcc7ca40b	nfsd: fix crash on LOCKT on reexported NFSv3 Unlike other filesystems, NFSv3 tries to use fl_file in the GETLK case. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2021-08-26 15:32:29 -04:00
J. Bruce Fields	bb0a55bb71	nfs: don't allow reexport reclaims In the reexport case, nfsd is currently passing along locks with the reclaim bit set. The client sends a new lock request, which is granted if there's currently no conflict--even if it's possible a conflicting lock could have been briefly held in the interim. We don't currently have any way to safely grant reclaim, so for now let's just deny them all. I'm doing this by passing the reclaim bit to nfs and letting it fail the call, with the idea that eventually the client might be able to do something more forgiving here. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2021-08-26 15:32:28 -04:00
J. Bruce Fields	b840be2f00	lockd: don't attempt blocking locks on nfs reexports As in the v4 case, it doesn't work well to block waiting for a lock on an nfs filesystem. As in the v4 case, that means we're depending on the client to poll. It's probably incorrect to depend on that, but I think clients do poll in practice. In any case, it's an improvement over hanging the lockd thread indefinitely as we currently are. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2021-08-26 15:32:18 -04:00
J. Bruce Fields	f657f8eef3	nfs: don't atempt blocking locks on nfs reexports NFS implements blocking locks by blocking inside its lock method. In the reexport case, this blocks the nfs server thread, which could lead to deadlocks since an nfs server thread might be required to unlock the conflicting lock. It also causes a crash, since the nfs server thread assumes it can free the lock when its lm_notify lock callback is called. Ideal would be to make the nfs lock method return without blocking in this case, but for now it works just not to attempt blocking locks. The difference is just that the original client will have to poll (as it does in the v4.0 case) instead of getting a callback when the lock's available. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2021-08-26 15:32:10 -04:00
Linus Torvalds	97d8cc2008	Two memory management fixes for the filesystem. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmEntw0THGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzi8CFB/4/1LVBiC2P9tqIr0S4rLaCBW91Xm4v oYbqoEuzzzl9FPYndSka2hq5x8wg+0nBCXiejafYVbIZsvE/UN+C5+H1mCD5NwyO imXHJ3lqKuZRHrGCkMSM3TJuOijPIU2gqVR+xb0vIfqjr0mU6YgLvvRBcY0QNimQ gLPoMwFGYwGWSLdcBfnHYSGWzmJk4rE94SSkL9Rg1NjkPslBahOrpA/GwNbltGsU +jYIWAZwpfgu2SCWPdUdYpA/Rw518WjGjZ9pOuZmFKg8R2mSJ4LVb5wZ4bsi1b5j CGc4KjNV+koeSMBlBex1EDdVXvqxkviNiWP1jm4FKz/fWcpD5DWv5467 =t1nf -----END PGP SIGNATURE----- Merge tag 'ceph-for-5.14-rc8' of git://github.com/ceph/ceph-client Pull ceph fixes from Ilya Dryomov: "Two memory management fixes for the filesystem" * tag 'ceph-for-5.14-rc8' of git://github.com/ceph/ceph-client: ceph: fix possible null-pointer dereference in ceph_mdsmap_decode() ceph: correctly handle releasing an embedded cap flush	2021-08-26 11:18:30 -07:00
Linus Torvalds	9b49ceb854	for-5.14-rc7-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEmQy4ACgkQxWXV+ddt WDuCRRAAmuO+6Zsl5MSq0hBnpec/VBN6lTi9VPt184BjW1IWsqwR1Ax8dVQEKgCm gzkGYEuVq2L5p+/ugWKKftAbmUU85Jf3AIsv81SCJQosRkxVXAdbrZOv00yUZy6/ 5YOdO+9u61otvtO6LcZz9l+0LcpSmrBwEszluyIS+nArgQyZwX2aZTjcScDJvB9+ 1y7Eo6eIbqbcJOf4mLDIJh0bHaiA7HB6jYJkbsnz51wBU2ETATzNzAoyP5ReTPGc 1s0uxrpY37kHcUUTd6q8VLDTM6Ei4vF2zQm0jWcrw0K3hM6yPuH+GiEADoV/xsls 6pbtss1E81rHEQjcK8brf6CxbOak8/WXV0gRia/3avkFteVlax+NJxRdVhksuJln siGlQqASX3vYdNL0nG+U0ml1Y9C1ZXTXu4lGjS6rtT9oeV+YSccG2UjIT9LEtuON W/zE4bUMqCddcZFEPH5jNK+ChGS8mmfs+UFFR+W/JzMIO8Uji5/K44FZDFBo0Oc/ 3JEgk7ZV4D+8SblBPMxJx0fZqbE8ggKM+IN5CAyscINOWOxrmRiaFFRygRX0TLDB 2uts9owItW6zvaTRY6RclVeCvJ6ARQli4pv7YxZmH85hhtCbn515imWvLWw4+tSg QwrtDnPVMSJTdzFHvsmeE9lM6Vaw0ur70Ysyd29k/XJu3WwRdkM= =jN7s -----END PGP SIGNATURE----- Merge tag 'for-5.14-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more fix that I think qualifies for a late merge. It's a revert of a one-liner fix that meanwhile got backported to stable kernels and we got reports from users. The broken fix prevents creating compressed inline extents, which could be noticeable on space consumption. Technically it's a regression as the patch was merged in 5.14-rc1 but got propagated to several stable kernels and has higher exposure than a 'typical' development cycle bug" * tag 'for-5.14-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: compression: don't try to compress if we don't have enough pages"	2021-08-26 11:05:11 -07:00
Darrick J. Wong	03b8df8d43	iomap: standardize tracepoint formatting and storage Print all the offset, pos, and length quantities in hexadecimal. While we're at it, update the types of the tracepoint structure fields to match the types of the values being recorded in them. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2021-08-26 09:18:53 -07:00
Ronnie Sahlberg	3998f0b8bc	cifs: Do not leak EDEADLK to dgetents64 for STATUS_USER_SESSION_DELETED RHBZ: 1994393 If we hit a STATUS_USER_SESSION_DELETED for the Create part in the Create/QueryDirectory compound that starts a directory scan we will leak EDEADLK back to userspace and surprise glibc and the application. Pick this up initiate_cifs_search() and retry a small number of tries before we return an error to userspace. Cc: stable@vger.kernel.org Reported-by: Xiaoli Feng <xifeng@redhat.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-08-25 16:08:38 -05:00
Steve French	38f4910b8b	cifs: cifs_md4 convert to SPDX identifier Add SPDX license identifier and replace license boilerplate for cifs_md4.c Signed-off-by: Steve French <stfrench@microsoft.com>	2021-08-25 15:51:52 -05:00
Ronnie Sahlberg	42c21973fa	cifs: create a MD4 module and switch cifs.ko to use it MD4 support will likely be removed from the crypto directory, but is needed for compression of NTLMSSP in SMB3 mounts. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-08-25 15:48:00 -05:00
Ronnie Sahlberg	71c0286324	cifs: fork arc4 and create a separate module for it for cifs and other users We can not drop ARC4 and basically destroy CIFS connectivity for almost all CIFS users so create a new forked ARC4 module that CIFS and other subsystems that have a hard dependency on ARC4 can use. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-08-25 15:47:57 -05:00
Ronnie Sahlberg	76a3c92ec9	cifs: remove support for NTLM and weaker authentication algorithms for SMB1. This removes the dependency to DES. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-08-25 15:47:06 -05:00
Shyam Prasad N	18d04062f8	cifs: enable fscache usage even for files opened as rw So far, the fscache implementation we had supports only a small set of use cases. Particularly for files opened with O_RDONLY. This commit enables it even for rw based file opens. It also enables the reuse of cached data in case of mount option (cache=singleclient) where it is guaranteed that this is the only client (and server) which operates on the files. There's also a single line change in fscache.c to get around a bug seen in fscache. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-08-25 15:45:10 -05:00
Steve French	7321be2663	smb3: fix posix extensions mount option We were incorrectly initializing the posix extensions in the conversion to the new mount API. CC: <stable@vger.kernel.org> # 5.11+ Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Suggested-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-08-25 15:43:12 -05:00
Ding Hui	d72c74197b	cifs: fix wrong release in sess_alloc_buffer() failed path smb_buf is allocated by small_smb_init_no_tc(), and buf type is CIFS_SMALL_BUFFER, so we should use cifs_small_buf_release() to release it in failed path. Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-08-25 15:42:18 -05:00
Len Baker	f980d055a0	CIFS: Fix a potencially linear read overflow strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated. Also, the strnlen() call does not avoid the read overflow in the strlcpy function when a not NUL-terminated string is passed. So, replace this block by a call to kstrndup() that avoids this type of overflow and does the same. Fixes: `066ce68994` ("cifs: rename cifs_strlcpy_to_host and make it use new functions") Signed-off-by: Len Baker <len.baker@gmx.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-08-25 15:42:15 -05:00
Hao Xu	0c6e1d7fd5	io_uring: don't free request to slab It's not necessary to free the request back to slab when we fail to get sqe, just move it to state->free_list. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210825175856.194299-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-25 13:04:26 -06:00
Linus Torvalds	fe67f4dd8d	pipe: do FASYNC notifications for every pipe IO, not just state changes It turns out that the SIGIO/FASYNC situation is almost exactly the same as the EPOLLET case was: user space really wants to be notified after every operation. Now, in a perfect world it should be sufficient to only notify user space on "state transitions" when the IO state changes (ie when a pipe goes from unreadable to readable, or from unwritable to writable). User space should then do as much as possible - fully emptying the buffer or what not - and we'll notify it again the next time the state changes. But as with EPOLLET, we have at least one case (stress-ng) where the kernel sent SIGIO due to the pipe being marked for asynchronous notification, but the user space signal handler then didn't actually necessarily read it all before returning (it read more than what was written, but since there could be multiple writes, it could leave data pending). The user space code then expected to get another SIGIO for subsequent writes - even though the pipe had been readable the whole time - and would only then read more. This is arguably a user space bug - and Colin King already fixed the stress-ng code in question - but the kernel regression rules are clear: it doesn't matter if kernel people think that user space did something silly and wrong. What matters is that it used to work. So if user space depends on specific historical kernel behavior, it's a regression when that behavior changes. It's on us: we were silly to have that non-optimal historical behavior, and our old kernel behavior was what user space was tested against. Because of how the FASYNC notification was tied to wakeup behavior, this was first broken by commits `f467a6a664` and `1b6b26ae70` ("pipe: fix and clarify pipe read/write wakeup logic"), but at the time it seems nobody noticed. Probably because the stress-ng problem case ends up being timing-dependent too. It was then unwittingly fixed by commit `3a34b13a88` ("pipe: make pipe writes always wake up readers") only to be broken again when by commit `3b844826b6` ("pipe: avoid unnecessary EPOLLET wakeups under normal loads"). And at that point the kernel test robot noticed the performance refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag below is somewhat ad hoc, but it matches when the issue was noticed. Fix it for good (knock wood) by simply making the kill_fasync() case separate from the wakeup case. FASYNC is quite rare, and we clearly shouldn't even try to use the "avoid unnecessary wakeups" logic for it. Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/ Fixes: `3b844826b6` ("pipe: avoid unnecessary EPOLLET wakeups under normal loads") Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Oliver Sang <oliver.sang@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Colin Ian King <colin.king@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-08-25 10:27:16 -07:00
Tuo Li	a9e6ffbc5b	ceph: fix possible null-pointer dereference in ceph_mdsmap_decode() kcalloc() is called to allocate memory for m->m_info, and if it fails, ceph_mdsmap_destroy() behind the label out_err will be called: ceph_mdsmap_destroy(m); In ceph_mdsmap_destroy(), m->m_info is dereferenced through: kfree(m->m_info[i].export_targets); To fix this possible null-pointer dereference, check m->m_info before the for loop to free m->m_info[i].export_targets. [ jlayton: fix up whitespace damage only kfree(m->m_info) if it's non-NULL ] Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: Tuo Li <islituo@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2021-08-25 16:34:11 +02:00
Xiubo Li	b2f9fa1f3b	ceph: correctly handle releasing an embedded cap flush The ceph_cap_flush structures are usually dynamically allocated, but the ceph_cap_snap has an embedded one. When force umounting, the client will try to remove all the session caps. During this, it will free them, but that should not be done with the ones embedded in a capsnap. Fix this by adding a new boolean that indicates that the cap flush is embedded in a capsnap, and skip freeing it if that's set. At the same time, switch to using list_del_init() when detaching the i_list and g_list heads. It's possible for a forced umount to remove these objects but then handle_cap_flushsnap_ack() races in and does the list_del_init() again, corrupting memory. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/52283 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2021-08-25 16:34:11 +02:00
David Howells	185981958c	cachefiles: Use file_inode() rather than accessing ->f_inode Use the file_inode() helper rather than accessing ->f_inode directly. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/162431192403.2908479.4590814090994846904.stgit@warthog.procyon.org.uk/	2021-08-25 15:20:25 +01:00
David Howells	a7e20e31f6	netfs: Move cookie debug ID to struct netfs_cache_resources Move the cookie debug ID from struct netfs_read_request to struct netfs_cache_resources and drop the 'cookie_' prefix. This makes it available for things that want to use netfs_cache_resources without having a netfs_read_request. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/162431190784.2908479.13386972676539789127.stgit@warthog.procyon.org.uk/	2021-08-25 15:20:25 +01:00
David Howells	4c5e413994	fscache: Select netfs stats if fscache stats are enabled Unconditionally select the stats produced by the netfs lib if fscache stats are enabled as the former are displayed in the latter's procfile. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-cachefs@redhat.com Reviewed-by: Jeff Layton <jlayton@redhat.com> Link: https://lore.kernel.org/r/162280352566.3319242.10615341893991206961.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/162431189627.2908479.9165349423842063755.stgit@warthog.procyon.org.uk/	2021-08-25 15:20:18 +01:00
Gao Xiang	1266b4a7ec	erofs: fix double free of 'copied' Dan reported a new smatch warning [1] "fs/erofs/inode.c:210 erofs_read_inode() error: double free of 'copied'" Due to new chunk-based format handling logic, the error path can be called after kfree(copied). Set "copied = NULL" after kfree(copied) to fix this. [1] https://lore.kernel.org/r/202108251030.bELQozR7-lkp@intel.com Link: https://lore.kernel.org/r/20210825120757.11034-1-hsiangkao@linux.alibaba.com Fixes: `c5aa903a59` ("erofs: support reading chunk-based uncompressed files") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2021-08-25 22:05:58 +08:00
Qu Wenruo	4e9655763b	Revert "btrfs: compression: don't try to compress if we don't have enough pages" This reverts commit `f216562731`. [BUG] It's no longer possible to create compressed inline extent after commit `f216562731` ("btrfs: compression: don't try to compress if we don't have enough pages"). [CAUSE] For compression code, there are several possible reasons we have a range that needs to be compressed while it's no more than one page. - Compressed inline write The data is always smaller than one sector and the test lacks the condition to properly recognize a non-inline extent. - Compressed subpage write For the incoming subpage compressed write support, we require page alignment of the delalloc range. And for 64K page size, we can compress just one page into smaller sectors. For those reasons, the requirement for the data to be more than one page is not correct, and is already causing regression for compressed inline data writeback. The idea of skipping one page to avoid wasting CPU time could be revisited in the future. [FIX] Fix it by reverting the offending commit. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.net Fixes: `f216562731` ("btrfs: compression: don't try to compress if we don't have enough pages") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-25 15:08:19 +02:00
Pavel Begunkov	aaa4db12ef	io_uring: accept directly into fixed file table As done with open opcodes, allow accept to skip installing fd into processes' file tables and put it directly into io_uring's fixed file table. Same restrictions and design as for open. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/6d16163f376fac7ac26a656de6b42199143e9721.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-25 06:36:56 -06:00
Pavel Begunkov	a7083ad5e3	io_uring: hand code io_accept() fd installing Make io_accept() to handle file descriptor allocations and installation. A preparation patch for bypassing file tables. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/5b73d204caa0ce979ccb98136695b60f52a3d98c.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-25 06:36:56 -06:00
Pavel Begunkov	b9445598d8	io_uring: openat directly into fixed fd table Instead of opening a file into a process's file table as usual and then registering the fd within io_uring, some users may want to skip the first step and place it directly into io_uring's fixed file table. This patch adds such a capability for IORING_OP_OPENAT and IORING_OP_OPENAT2. The behaviour is controlled by setting sqe->file_index, where 0 implies the old behaviour using normal file tables. If non-zero value is specified, then it will behave as described and place the file into a fixed file slot sqe->file_index - 1. A file table should be already created, the slot should be valid and empty, otherwise the operation will fail. Keep the error codes consistent with IORING_OP_FILES_UPDATE, ENXIO and EINVAL on inappropriate fixed tables, and return EBADF on collision with already registered file. Note: IOSQE_FIXED_FILE can't be used to switch between modes, because accept takes a file, and it already uses the flag with a different meaning. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/e9b33d1163286f51ea707f87d95bd596dada1e65.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-25 06:36:56 -06:00
Sishuai Gong	c42dd069be	configfs: fix a race in configfs_lookup() When configfs_lookup() is executing list_for_each_entry(), it is possible that configfs_dir_lseek() is calling list_del(). Some unfortunate interleavings of them can cause a kernel NULL pointer dereference error Thread 1 Thread 2 //configfs_dir_lseek() //configfs_lookup() list_del(&cursor->s_sibling); list_for_each_entry(sd, ...) Fix this by grabbing configfs_dirent_lock in configfs_lookup() while iterating ->s_children. Signed-off-by: Sishuai Gong <sishuai@purdue.edu> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-25 07:58:49 +02:00
Christoph Hellwig	d07f132a22	configfs: fold configfs_attach_attr into configfs_lookup This makes it more clear what gets added to the dcache and prepares for an additional locking fix. Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-25 07:58:46 +02:00
Christoph Hellwig	899587c8d0	configfs: simplify the configfs_dirent_is_ready Return the error directly instead of using a goto. Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-25 07:43:55 +02:00
Christoph Hellwig	417b962dde	configfs: return -ENAMETOOLONG earlier in configfs_lookup Just like most other file systems: get the simple checks out of the way first. Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-25 07:42:44 +02:00
Dave Chinner	f38a032b16	xfs: fix I_DONTCACHE Yup, the VFS hoist broke it, and nobody noticed. Bulkstat workloads make it clear that it doesn't work as it should. Fixes: `dae2f8ed79` ("fs: Lift XFS_IDONTCACHE to the VFS layer") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2021-08-24 19:13:04 -07:00
Christoph Hellwig	158ee7b656	block: mark blkdev_fsync static blkdev_fsync is only used inside of block_dev.c since the removal of the raw drіver. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210824151823.1575100-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-24 10:10:33 -06:00
Lukas Bulwahn	2949e8427a	fs: clean up after mandatory file locking support removal Commit 3efee0567b4a ("fs: remove mandatory file locking support") removes some operations in functions rw_verify_area(). As these functions are now simplified, do some syntactic clean-up as follow-up to the removal as well, which was pointed out by compiler warnings and static analysis. No functional change. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>	2021-08-24 07:52:45 -04:00
Darrick J. Wong	72a048c105	xfs: only set IOMAP_F_SHARED when providing a srcmap to a write While prototyping a free space defragmentation tool, I observed an unexpected IO error while running a sequence of commands that can be recreated by the following sequence of commands: # xfs_io -f -c "pwrite -S 0x58 -b 10m 0 10m" file1 # cp --reflink=always file1 file2 # punch-alternating -o 1 file2 # xfs_io -c "funshare 0 10m" file2 fallocate: Input/output error I then scraped this (abbreviated) stack trace from dmesg: WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450 CPU: 0 PID: 30788 Comm: xfs_io Not tainted 5.14.0-rc6-xfsx #rc6 5ef57b62a900814b3e4d885c755e9014541c8732 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:iomap_write_begin+0x376/0x450 RSP: 0018:ffffc90000c0fc20 EFLAGS: 00010297 RAX: 0000000000000001 RBX: ffffc90000c0fd10 RCX: 0000000000001000 RDX: ffffc90000c0fc54 RSI: 000000000000000c RDI: 000000000000000c RBP: ffff888005d5dbd8 R08: 0000000000102000 R09: ffffc90000c0fc50 R10: 0000000000b00000 R11: 0000000000101000 R12: ffffea0000336c40 R13: 0000000000001000 R14: ffffc90000c0fd10 R15: 0000000000101000 FS: 00007f4b8f62fe40(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056361c554108 CR3: 000000000524e004 CR4: 00000000001706f0 Call Trace: iomap_unshare_actor+0x95/0x140 iomap_apply+0xfa/0x300 iomap_file_unshare+0x44/0x60 xfs_reflink_unshare+0x50/0x140 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195] xfs_file_fallocate+0x27c/0x610 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195] vfs_fallocate+0x133/0x330 __x64_sys_fallocate+0x3e/0x70 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f4b8f79140a Looking at the iomap tracepoints, I saw this: iomap_iter: dev 8:64 ino 0x100 pos 0 length 0 flags WRITE\|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 0 length 131072 type DELALLOC flags SHARED iomap_iter_srcmap: dev 8:64 ino 0x100 bdev 8:64 addr 147456 offset 0 length 4096 type MAPPED flags iomap_iter: dev 8:64 ino 0x100 pos 0 length 4096 flags WRITE\|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 4096 length 4096 type DELALLOC flags SHARED console: WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450 The first time funshare calls ->iomap_begin, xfs sees that the first block is shared and creates a 128k delalloc reservation in the COW fork. The delalloc reservation is returned as dstmap, and the shared block is returned as srcmap. So far so good. funshare calls ->iomap_begin to try the second block. This time there's no srcmap (punch-alternating punched it out!) but we still have the delalloc reservation in the COW fork. Therefore, we again return the reservation as dstmap and the hole as srcmap. iomap_unshare_iter incorrectly tries to unshare the hole, which __iomap_write_begin rejects because shared regions must be fully written and therefore cannot require zeroing. Therefore, change the buffered write iomap_begin function not to set IOMAP_F_SHARED when there isn't a source mapping to read from for the unsharing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>	2021-08-23 17:32:51 -07:00
J. Bruce Fields	7f024fcd5c	Keep read and write fds with each nlm_file We shouldn't really be using a read-only file descriptor to take a write lock. Most filesystems will put up with it. But NFS, for example, won't. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2021-08-23 18:05:31 -04:00
Dmitry Kadashev	cf30da90bc	io_uring: add support for IORING_OP_LINKAT IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and arguments. In some internal places 'hardlink' is used instead of 'link' to avoid confusion with the SQE links. Name 'link' conflicts with the existing 'link' member of io_kiocb. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-12-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:48:52 -06:00
Dmitry Kadashev	7a8721f84f	io_uring: add support for IORING_OP_SYMLINKAT IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags and arguments. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-11-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:48:33 -06:00
Christoph Hellwig	01cfa28af4	block: use the percpu bio cache in __blkdev_direct_IO Use bio_alloc_kiocb to dip into the percpu cache of bios when the caller asks for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:45:00 -06:00
Jens Axboe	394918ebb8	io_uring: enable use of bio alloc cache Mark polled IO as being safe for dipping into the bio allocation cache, in case the targeted bio_set has it enabled. This brings an IOPOLL gen2 Optane QD=128 workload from ~3.2M IOPS to ~3.5M IOPS. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:44:55 -06:00
Pavel Begunkov	dadebc350d	io_uring: fix io_try_cancel_userdata race for iowq WARNING: CPU: 1 PID: 5870 at fs/io_uring.c:5975 io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975 CPU: 0 PID: 5870 Comm: iou-wrk-5860 Not tainted 5.14.0-rc6-next-20210820-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975 Call Trace: io_async_cancel fs/io_uring.c:6014 [inline] io_issue_sqe+0x22d5/0x65a0 fs/io_uring.c:6407 io_wq_submit_work+0x1dc/0x300 fs/io_uring.c:6511 io_worker_handle_work+0xa45/0x1840 fs/io-wq.c:533 io_wqe_worker+0x2cc/0xbb0 fs/io-wq.c:582 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 io_try_cancel_userdata() can be called from io_async_cancel() executing in the io-wq context, so the warning fires, which is there to alert anyone accessing task->io_uring->io_wq in a racy way. However, io_wq_put_and_exit() always first waits for all threads to complete, so the only detail left is to zero tctx->io_wq after the context is removed. note: one little assumption is that when IO_WQ_WORK_CANCEL, the executor won't touch ->io_wq, because io_wq_destroy() might cancel left pending requests in such a way. Cc: stable@vger.kernel.org Reported-by: syzbot+b0c9d1588ae92866515f@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dfdd37a80cfa9ffd3e59538929c99cdd55d8699e.1629721757.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:41:56 -06:00
Dmitry Kadashev	e34a02dc40	io_uring: add support for IORING_OP_MKDIRAT IORING_OP_MKDIRAT behaves like mkdirat(2) and takes the same flags and arguments. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-10-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:41:26 -06:00
Dmitry Kadashev	45f30dab39	namei: update do_() helpers to return ints Update the following to return int rather than long, for uniformity with the rest of the do_ helpers in namei.c: * do_rmdir() * do_unlinkat() * do_mkdirat() * do_mknodat() * do_symlinkat() Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210514143202.dmzfcgz5hnauy7ze@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-9-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:41:26 -06:00
Dmitry Kadashev	020250f31c	namei: make do_linkat() take struct filename Pass in the struct filename pointers instead of the user string, for uniformity with do_renameat2, do_unlinkat, do_mknodat, etc. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-8-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:41:26 -06:00
Dmitry Kadashev	8228e2c313	namei: add getname_uflags() There are a couple of places where we already open-code the (flags & AT_EMPTY_PATH) check and io_uring will likely add another one in the future. Let's just add a simple helper getname_uflags() that handles this directly and use it. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210415100815.edrn4a7cy26wkowe@wittgenstein/ Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-7-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:41:26 -06:00
Dmitry Kadashev	da2d0cede3	namei: make do_symlinkat() take struct filename Pass in the struct filename pointers instead of the user string, for uniformity with the recently converted do_mkdnodat(), do_unlinkat(), do_renameat(), do_mkdirat(). Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-6-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:41:26 -06:00
Dmitry Kadashev	7797251bb5	namei: make do_mknodat() take struct filename Pass in the struct filename pointers instead of the user string, for uniformity with the recently converted do_unlinkat(), do_renameat(), do_mkdirat(). Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-5-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:41:26 -06:00
Dmitry Kadashev	584d3226d6	namei: make do_mkdirat() take struct filename Pass in the struct filename pointers instead of the user string, and update the three callers to do the same. This is heavily based on commit dbea8d345177 ("fs: make do_renameat2() take struct filename"). This behaves like do_unlinkat() and do_renameat2(). Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-4-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:41:26 -06:00
Dmitry Kadashev	0ee50b4753	namei: change filename_parentat() calling conventions Since commit `5c31b6cedb` ("namei: saner calling conventions for filename_parentat()") filename_parentat() had the following behavior WRT the passed in struct filename : On error the name is consumed (putname() is called on it); * On success the name is returned back as the return value; Now there is a need for filename_create() and filename_lookup() variants that do not consume the passed filename, and following the same "consume the name only on error" semantics is proven to be hard to reason about and result in confusing code. Hence this preparation change splits filename_parentat() into two: one that always consumes the name and another that never consumes the name. This will allow to implement two filename_create() variants in the same way, and is a consistent and hopefully easier to reason about approach. Link: https://lore.kernel.org/io-uring/CAOKbgA7MiqZAq3t-HDCpSGUFfco4hMA9ArAE-74fTpU+EkvKPw@mail.gmail.com/ Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Link: https://lore.kernel.org/r/20210708063447.3556403-3-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:41:26 -06:00
Dmitry Kadashev	91ef658fb8	namei: ignore ERR/NULL names in putname() Supporting ERR/NULL names in putname() makes callers code cleaner, and is what some other path walking functions already support for the same reason. This also removes a few existing IS_ERR checks before putname(). Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/CAHk-=wgCac9hBsYzKMpHk0EbLgQaXR=OUAjHaBtaY+G8A9KhFg@mail.gmail.com/ Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Link: https://lore.kernel.org/r/20210708063447.3556403-2-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:41:26 -06:00
Pavel Begunkov	126180b95f	io_uring: IRQ rw completion batching Employ inline completion logic for read/write completions done via io_req_task_complete(). If ->uring_lock is contended, just do normal request completion, but if not, make tctx_task_work() to grab the lock and do batched inline completions in io_req_task_complete(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/94589c3ce69eaed86a21bb1ec696407a54fab1aa.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:13:04 -06:00
Pavel Begunkov	f237c30a56	io_uring: batch task work locking Many task_work handlers either grab ->uring_lock, or may benefit from having it. Move locking logic out of individual handlers to a lazy approach controlled by tctx_task_work(), so we don't keep doing tons of mutex lock/unlock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d6a34e147f2507a2f3e2fa1e38a9c541dcad3929.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:13:04 -06:00
Pavel Begunkov	5636c00d3e	io_uring: flush completions for fallbacks io_fallback_req_func() doesn't expect anyone creating inline completions, and no one currently does that. Teach the function to flush completions preparing for further changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8b941516921f72e1a64d58932d671736892d7fff.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:13:04 -06:00
Pavel Begunkov	26578cda3d	io_uring: add ->splice_fd_in checks ->splice_fd_in is used only by splice/tee, but no other request checks it for validity. Add the check for most of request types excluding reads/writes/sends/recvs, we don't want overhead for them and can leave them be as is until the field is actually used. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f44bc2acd6777d932de3d71a5692235b5b2b7397.1629451684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:13:00 -06:00
Jens Axboe	2c5d763c19	io_uring: add clarifying comment for io_cqring_ev_posted() We've previously had an issue where overflow flush unconditionally calls io_cqring_ev_posted() even if it didn't flush any events to the ring, causing wake and eventfd increment where no new events are available. Some applications don't like that, see commit `b18032bb0a` for details. This came up in discussion for another patch recently, hence add a comment detailing what the relationship between calling the events posted helper and CQ ring entries is. Link: https://lore.kernel.org/io-uring/77a44fce-c831-16a6-8e80-9aee77f496a2@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:47 -06:00
Pavel Begunkov	0bea96f59b	io_uring: place fixed tables under memcg limits Fixed tables may be large enough, place all of them together with allocated tags under memcg limits. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b3ac9f5da9821bb59837b5fe25e8ef4be982218c.1629451684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:47 -06:00
Pavel Begunkov	3a1b8a4e84	io_uring: limit fixed table size by RLIMIT_NOFILE Limit the number of files in io_uring fixed tables by RLIMIT_NOFILE, that's the first and the simpliest restriction that we should impose. Cc: stable@vger.kernel.org Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b2756c340aed7d6c0b302c26dab50c6c5907f4ce.1629451684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:46 -06:00
Hao Xu	99c8bc52d1	io_uring: fix lack of protection for compl_nr coml_nr in ctx_flush_and_put() is not protected by uring_lock, this may cause problems when accessing in parallel: say coml_nr > 0 ctx_flush_and put other context if (compl_nr) get mutex coml_nr > 0 do flush coml_nr = 0 release mutex get mutex do flush () release mutex in () place, we call io_cqring_ev_posted() and users likely get no events there. To avoid spurious events, re-check the value when under the lock. Fixes: `2c32395d81` ("io_uring: fix __tctx_task_work() ctx race") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210820221954.61815-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:46 -06:00
wangyangbo	187f08c12c	io_uring: Add register support for non-4k PAGE_SIZE Now allocated rsrc table uses PAGE_SIZE as the size of 2nd-level, and accessing this table relies on each level index from fixed TABLE_SHIFT (12 - 3) in 4k page case. In order to correctly work in non-4k page, define TABLE_SHIFT as non-fixed (PAGE_SHIFT - shift of data) for 2nd-level table entry number. Signed-off-by: wangyangbo <wangyangbo@uniontech.com> Link: https://lore.kernel.org/r/20210819055657.27327-1-wangyangbo@uniontech.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:46 -06:00
Pavel Begunkov	e98e49b2bb	io_uring: extend task put optimisations Now with IRQ completions done via IRQ, almost all requests freeing are done from the context of submitter task, so it makes sense to extend task_put optimisation from io_req_free_batch_finish() to cover all the cases including task_work by moving it into io_put_task(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/824a7cbd745ddeee4a0f3ff85c558a24fd005872.1629302453.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:46 -06:00
Jens Axboe	316319e82f	io_uring: add comments on why PF_EXITING checking is safe We have two checks of task->flags & PF_EXITING left: 1) In io_req_task_submit(), which is called in task_work and hence always in the context of the original task. That means that req->task == current, and hence checking ->flags is totally fine. 2) In io_poll_rewait(), where we need to stop re-arming poll to prevent it interfering with cancelation. This is only run from task_work as well, and hence for this case too req->task == current. Add a comment to both spots detailing that. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Hao Xu	79dca1846f	io-wq: move nr_running and worker_refs out of wqe->lock protection We don't need to protect nr_running and worker_refs by wqe->lock, so narrow the range of raw_spin_lock_irq - raw_spin_unlock_irq Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210810125554.99229-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	ec3c3d0f3a	io_uring: fix io_timeout_remove locking io_timeout_cancel() posts CQEs so needs ->completion_lock to be held, so grab it in io_timeout_remove(). Fixes: 48ecb6369f1f2 ("io_uring: run timeouts from task_work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d6f03d653a4d7bf693ef6f39b6a426b6d97fd96f.1629280204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	23a65db83b	io_uring: improve same wq polling Move earlier the check for whether __io_queue_proc() tries to poll already polled waitqueue, and do the same for the second poll entry, if any. Shouldn't really matter, but at least it would have a more predictable behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8cb428cfe8ade0fd055859fabb878db8777d4c2f.1629228203.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	505657bc6c	io_uring: reuse io_req_complete_post() We have io_req_complete_post() to post a CQE and put the request. It takes care of all synchronisation and is more concise and efficent, so replace all hancoded occurrences of "lock; post CQE; unlock; + put_req()" with io_req_complete_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2c83463458a613f9d870e5147eb134da2aa70779.1629228203.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	ae421d9350	io_uring: better encapsulate buffer select for rw Make io_put_rw_kbuf() to do the REQ_F_BUFFER_SELECTED check, so all the callers don't need to hand code it. The number of places where we call io_put_rw_kbuf() is growing, so saves some pain. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3df3919e5e7efe03420c44ab4d9317a81a9cf398.1629228203.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	906c6caaf5	io_uring: optimise io_prep_linked_timeout() Linked timeout handling during issuing is heavy, it adds extra instructions and forces to save the next linked timeout before io_issue_sqe(). Follwing the same reasoning as in refcounting patches, a request can't be freed by the time it returns from io_issue_sqe(), so now we don't need to do io_prep_linked_timeout() in advance, and it can be delayed to colder paths optimising the generic path. Also, it should also save quite a lot for requests with linked timeouts and completed inline on timeout spinlocking + hrtimer_start() + hrtimer_try_to_cancel() and so on. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/19bfc9a0d26c5c5f1e359f7650afe807ca8ef879.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	0756a86910	io_uring: cancel not-armed linked touts separately Adjust io_disarm_next(), so it can detect if there is a linked but not-yet-armed timeout and complete/cancel it separately. Will be used in the following patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ae228cde2c0df3d92d29d5e4852ed9fa8a2a97db.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	4d13d1a4d1	io_uring: simplify io_prep_linked_timeout The link test in io_prep_linked_timeout() is pretty bulky, replace it with a flag. It's better for normal path and linked requests, and also will be used further for request failing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3703770bfae8bc1ff370e43ef5767940202cab42.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	b97e736a4b	io_uring: kill REQ_F_LTIMEOUT_ACTIVE Instead of handling double consecutive linked timeouts through tricky flag combinations, just check the submit_state.link during timeout_prep and fail that case in advance. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/04150760b0dc739522264b8abd309409f7421a06.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	fd08e5309b	io_uring: optimise hot path of ltimeout prep io_prep_linked_timeout() grew too heavy and compiler now refuse to inline the function. Help it by splitting in two and annotating with inline. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/560636717a32e9513724f09b9ecaace942dde4d4.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	8cb01fac98	io_uring: deduplicate cancellation code IORING_OP_ASYNC_CANCEL and IORING_OP_LINK_TIMEOUT have enough of overlap, so extract a helper for request cancellation and use in both. Also, removes some amount of ugliness because of success_ret. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/900122b588e65b637e71bfec80a260726c6a54d6.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	a8576af9d1	io_uring: kill not necessary resubmit switch `773af69121` ("io_uring: always reissue from task_work context") makes all resubmission to be made from task_work, so we don't need that hack with resubmit/not-resubmit switch anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/47fa177cca04e5ffd308a35227966c8e15d8525b.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	fb6820998f	io_uring: optimise initial ltimeout refcounting Linked timeouts are never refcounted when it comes to the first call to __io_prep_linked_timeout(), so save an io_ref_get() and set the desired value directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/177b24cc62ffbb42d915d6eb9e8876266e4c0d5a.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	761bcac157	io_uring: don't inflight-track linked timeouts Tracking linked timeouts as infligh was needed to make sure that io-wq is not destroyed by io_uring_cancel_generic() racing with io_async_cancel_one() accessing it. Now, cancellations issued by linked timeouts are done in the task context, so it's already synchronised. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e1b05cf47cb69df2305efdbee8cf7ba36f46c1a3.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	48dcd38d73	io_uring: optimise iowq refcounting If a requests is forwarded into io-wq, there is a good chance it hasn't been refcounted yet and we can save one req_ref_get() by setting the refcount number to the right value directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2d53f4449faaf73b4a4c5de667fc3c176d974860.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Jens Axboe	a141dd896f	io_uring: correct __must_hold annotation io_req_free_batch() has a __must_hold annotation referencing a request being passed in, but we're passing in the context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Hao Xu	41a5169c23	io_uring: code clean for completion_lock in io_arm_poll_handler() We can merge two spin_unlock() operations to one since we removed some code not long ago. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Hao Xu	f552a27afe	io_uring: remove files pointer in cancellation functions When doing cancellation, we use a parameter to indicate where it's from do_exit or exec. So a boolean value is good enough for this, remove the struct files* as it is not necessary. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> [axboe: fixup io_uring_files_cancel for !CONFIG_IO_URING] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	20e60a3832	io_uring: skip request refcounting As submission references are gone, there is only one initial reference left. Instead of actually doing atomic refcounting, add a flag indicating whether we're going to take more refs or doing any other sync magic. The flag should be set before the request may get used in parallel. Together with the previous patch it saves 2 refcount atomics per request for IOPOLL and IRQ completions, and 1 atomic per req for inline completions, with some exceptions. In particular, currently, there are three cases, when the refcounting have to be enabled: - Polling, including apoll. Because double poll entries takes a ref. Might get relaxed in the near future. - Link timeouts, enabled for both, the timeout and the request it's bound to, because they work in-parallel and we need to synchronise to cancel one of them on completion. - When a request gets in io-wq, because it doesn't hold uring_lock and we need guarantees of submission references. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8b204b6c5f6643062270a1913d6d3a7f8f795fd9.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	5d5901a343	io_uring: remove submission references Requests are by default given with two references, submission and completion. Completion references are straightforward, they represent request ownership and are put when a request is completed or so. Submission references are a bit more trickier. They're needed when io_issue_sqe() followed deep into the submission stack (e.g. in fs, block, drivers, etc.), request may have given away for concurrent execution or already completed, and the code unwinding back to io_issue_sqe() may be accessing some pieces of our requests, e.g. file or iov. Now, we prevent such async/in-depth completions by pushing requests through task_work. Punting to io-wq is also done through task_works, apart from a couple of cases with a pretty well known context. So, there're two cases: 1) io_issue_sqe() from the task context and protected by ->uring_lock. Either requests return back to io_uring or handed to task_work, which won't be executed because we're currently controlling that task. So, we can be sure that requests are staying alive all the time and we don't need submission references to pin them. 2) io_issue_sqe() from io-wq, which doesn't hold the mutex. The role of submission reference is played by io-wq reference, which is put by io_wq_submit_work(). Hence, it should be fine. Considering that, we can carefully kill the submission reference. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6b68f1c763229a590f2a27148aee77767a8d7750.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	91c2f69783	io_uring: remove req_ref_sub_and_test() Soon, we won't need to put several references at once, remove req_ref_sub_and_test() and @nr argument from io_put_req_deferred(), and put the rest of the references by hand. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1868c7554108bff9194fb5757e77be23fadf7fc0.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	21c843d582	io_uring: move req_ref_get() and friends Move all request refcount helpers to avoid forward declarations in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/89fd36f6f3fe5b733dfe4546c24725eee40df605.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Jens Axboe	79ebeaee8a	io_uring: remove IRQ aspect of io_ring_ctx completion lock We have no hard/soft IRQ users of this lock left, remove any IRQ disabling/saving and restoring when grabbing this lock. This is straight forward with no users entering with IRQs disabled anymore, the only thing to look out for is the waitqueue poll head lock which nests inside the completion lock. That needs IRQs disabled, and hence we have to do that now instead of relying on the outer lock doing so. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Jens Axboe	8ef12efe26	io_uring: run regular file completions from task_work This is in preparation to making the completion lock work outside of hard/soft IRQ context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Jens Axboe	89b263f6d5	io_uring: run linked timeouts from task_work This is in preparation to making the completion lock work outside of hard/soft IRQ context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Jens Axboe	89850fce16	io_uring: run timeouts from task_work This is in preparation to making the completion lock work outside of hard/soft IRQ context. Add a timeout_lock to handle the ordering of timeout completions or cancelations with the timeouts actually triggering. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	62906e89e6	io_uring: remove file batch-get optimisation For requests with non-fixed files, instead of grabbing just one reference, we get by the number of left requests, so the following requests using the same file can take it without atomics. However, it's not all win. If there is one request in the middle not using files or having a fixed file, we'll need to put back the left references. Even worse if an application submits requests dealing with different files, it will do a put for each new request, so doubling the number of atomics needed. Also, even if not used, it's still takes some cycles in the submission path. If a file used many times, it rather makes sense to pre-register it, if not, we may fall in the described pitfall. So, this optimisation is a matter of use case. Go with the simpliest code-wise way, remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	6294f3686b	io_uring: clean up tctx_task_work() After recent fixes, tctx_task_work() always does proper spinlocking before looking into ->task_list, so now we don't need atomics for ->task_state, replace it with non-atomic task_running using the critical section. Tide it up, combine two separate block with spinlocking, and always try to splice in there, so we do less locking when new requests are arriving during the function execution. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: fix missing ->task_running reset on task_work_add() failure] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	5d70904367	io_uring: inline io_poll_remove_waitqs Inline io_poll_remove_waitqs() into its only user and clean it up. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2f1a91a19ffcd591531dc4c61e2f11c64a2d6a6d.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:26 -06:00
Pavel Begunkov	90f67366cb	io_uring: remove extra argument for overflow flush Unlike __io_cqring_overflow_flush(), nobody does forced flushing with io_cqring_overflow_flush(), so removed the argument from it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7594f869ca41b7cfb5a35a3c7c2d402242834e9e.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:19 -06:00
Pavel Begunkov	cd0ca2e048	io_uring: inline struct io_comp_state Inline struct io_comp_state into struct io_submit_state. They are already coupled tightly, together with mixed responsibilities it only brings confusion having them separately. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e55bba77426b399e3a2e54e3c6c267c6a0fc4b57.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:48 -06:00
Pavel Begunkov	bb943b8265	io_uring: use inflight_entry instead of compl.list req->compl.list is used to cache freed requests, and so can't overlap in time with req->inflight_entry. So, use inflight_entry to link requests and remove compl.list. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e430e79d22d70a190d718831bda7bfed1daf8976.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:43 -06:00
Pavel Begunkov	7255834ed6	io_uring: remove redundant args from cache_free We don't use @tsk argument of io_req_cache_free(), remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6a28b4a58ee0aaf0db98e2179b9c9f06f9b0cca1.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:43 -06:00
Pavel Begunkov	c34b025f2d	io_uring: cache __io_free_req()'d requests Don't kfree requests in __io_free_req() but put them back into the internal request cache. That makes allocations more sustainable and will be used for refcounting optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9f4950fbe7771c8d41799366d0a3a08ac3040236.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:43 -06:00
Pavel Begunkov	f56165e62f	io_uring: move io_fallback_req_func() Move io_fallback_req_func() to kill yet another forward declaration. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d0a8f9d9a0057ed761d6237167d51c9378798d2d.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:22 -06:00
Pavel Begunkov	e9dbe221f5	io_uring: optimise putting task struct We cache all the reference to task + tctx, so if io_put_task() is called by the corresponding task itself, we can save on atomics and return the refs right back into the cache. It's beneficial for all inline completions, and also iopolling, when polling and submissions are done by the same task, including SQPOLL\|IOPOLL. Note: io_uring_cancel_generic() can return refs to the cache as well, so those should be flushed in the loop for tctx_inflight() to work right. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6fe9646b3cb70e46aca1f58426776e368c8926b3.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:08:06 -06:00
Pavel Begunkov	af066f31eb	io_uring: drop exec checks from io_req_task_submit In case of on-exec io_uring cancellations, tasks already wait for all submitted requests to get completed/cancelled, so we don't need to check for ->in_execve separately. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/be8707049f10df9d20ca03dc4ca3316239b5e8e0.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:08:06 -06:00
Pavel Begunkov	bbbca09489	io_uring: kill unused IO_IOPOLL_BATCH IO_IOPOLL_BATCH is not used, delete it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b2bdf19dbee2c9fc8865bbab9412135a14e24a64.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:08:06 -06:00
Pavel Begunkov	58d3be2c60	io_uring: improve ctx hang handling If io_ring_exit_work() can't get it done in 5 minutes, something is going very wrong, don't keep spinning at HZ / 20 rate, it doesn't help and it may take much of CPU time if there is a lot of workers stuck as such. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9e2d1ca81d569f6bc628af1a42ff6663bff7ce9c.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	d3fddf6ddd	io_uring: deduplicate open iopoll check Move IORING_SETUP_IOPOLL check into __io_openat_prep(), so both openat and openat2 reuse it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9a73ce83e4ee60d011180ef177eecef8e87ff2a2.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	543af3a13d	io_uring: inline io_free_req_deferred Inline io_free_req_deferred(), there is no reason to keep it separated. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ce04b7180d4eac0d69dd00677b227eefe80c2cc5.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	b9bd2bea0f	io_uring: move io_rsrc_node_alloc() definition Move the function together with io_rsrc_node_ref_zero() in the source file as it is to get rid of forward declarations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d81f6f833e7d017860b24463a9a68b14a8a5ed2.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	6a290a1442	io_uring: move io_put_task() definition Move the function in the source file as it is to get rid of forward declarations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/33d917d69e4206557c75a5b98fe22bcdf77ce47d.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	e73c5c7cd3	io_uring: extract a helper for ctx quiesce Refactor __io_uring_register() by extracting a helper responsible for ctx queisce. Looks better and will make it easier to add more optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0339e0027504176be09237eefa7945bf9a6f153d.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	90291099f2	io_uring: optimise io_cqring_wait() hot path Turns out we always init struct io_wait_queue in io_cqring_wait(), even if it's not used after, i.e. there are already enough of CQEs. And often it's exactly what happens, for instance, requests may have been completed inline, or in case of io_uring_enter(submit=N, wait=1). It shows up in my profiler, so optimise it by delaying the struct init. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6f1b81c60b947d165583dc333947869c3d85d037.1628471125.git.asml.silence@gmail.com [axboe: fixed up for new cqring wait] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	282cdc8693	io_uring: add more locking annotations for submit Add more annotations for submission path functions holding ->uring_lock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/128ec4185e26fbd661dd3a424aa66108ee8ff951.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	a2416e1ec2	io_uring: don't halt iopoll too early IOPOLL users should care more about getting completions for requests they submitted, but not in "device did/completed something". Currently, io_do_iopoll() may return a positive number, which will instruct io_iopoll_check() to break the loop and end the syscall, even if there is not enough CQEs or none at all. Don't return positive numbers, so io_iopoll_check() exits only when it gets an actual error, need reschedule or got enough CQEs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/641a88f751623b6758303b3171f0a4141f06726e.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	864ea921b0	io_uring: refactor io_alloc_req Replace the main if of io_flush_cached_reqs() with inverted condition + goto, so all the cases are handled in the same way. And also extract io_preinit_req() to make it cleaner and easier to refer to. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1abcba1f7b55dc53bf1dbe95036e345ffb1d5b01.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	8724dd8c83	io-wq: improve wq_list_add_tail() Prepare nodes that we're going to add before actually linking them, it's always safer and costs us nothing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f7e53f0c84c02ed6748c488ed0789b98f8cc6185.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	2215bed924	io_uring: remove unnecessary PF_EXITING check We prefer nornal task_works even if it would fail requests inside. Kill a PF_EXITING check in io_req_task_work_add(), task_work_add() handles well dying tasks, i.e. return error when can't enqueue due to late stages of do_exit(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fc14297e8441cd8f5d1743a2488cf0df09bf48ac.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	ebc11b6c6b	io_uring: clean io-wq callbacks Move io-wq callbacks closer to each other, so it's easier to work with them, and rename io_free_work() into io_wq_free_work() for consistency. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/851bbc7f0f86f206d8c1333efee8bcb9c26e419f.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	c97d8a0f68	io_uring: avoid touching inode in rw prep If we use fixed files, we can be sure (almost) that REQ_F_ISREG is set. However, for non-reg files io_prep_rw() still will look into inode to double check, and that's expensive and can be avoided. The only caveat is that it only currently works with 64+ bit architectures, see FFS_ISREG, so we should consider that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0a62780c491ca2522cd52db4ae3f16e03aafed0f.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	b191e2dfe5	io_uring: rename io_file_supports_async() io_file_supports_async() checks whether a file supports nowait operations, so "async" in the name is misleading. Rename it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/33d55b5ce43aa1884c637c1957f1e30d30dc3bec.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	ac177053bb	io_uring: inline fixed part of io_file_get() Optimise io_file_get() with registered files, which is in a hot path, by inlining parts of the function. Saves a function call, and inefficiencies of passing arguments, e.g. evaluating (sqe_flags & IOSQE_FIXED_FILE). It couldn't have been done before as compilers were refusing to inline it because of the function size. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/52115cd6ce28f33bd0923149c0e6cb611084a0b1.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	042b0d85ea	io_uring: use kvmalloc for fixed files Instead of hand-coded two-level tables for registered files, allocate them with kvmalloc(). In many cases small enough tables are enough, and so can be kmalloc()'ed removing an extra memory load and a bunch of bit logic instructions from the hot path. If the table is larger, we trade off all the pros with a TLB-assisted memory lookup. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/280421d3b48775dabab773006bb5588c7b2dabc0.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Jens Axboe	5fd4617840	io_uring: be smarter about waking multiple CQ ring waiters Currently we only wake the first waiter, even if we have enough entries posted to satisfy multiple waiters. Improve that situation so that every waiter knows how much the CQ tail has to advance before they can be safely woken up. With this change, if we have N waiters each asking for 1 event and we get 4 completions, then we wake up 4 waiters. If we have N waiters asking for 2 completions and we get 4 completions, then we wake up the first two. Previously, only the first waiter would've been woken up. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Jens Axboe	d3e9f732c4	io-wq: remove GFP_ATOMIC allocation off schedule out path Daniel reports that the v5.14-rc4-rt4 kernel throws a BUG when running stress-ng: \| [ 90.202543] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:35 \| [ 90.202549] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2047, name: iou-wrk-2041 \| [ 90.202555] CPU: 5 PID: 2047 Comm: iou-wrk-2041 Tainted: G W 5.14.0-rc4-rt4+ #89 \| [ 90.202559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 \| [ 90.202561] Call Trace: \| [ 90.202577] dump_stack_lvl+0x34/0x44 \| [ 90.202584] ___might_sleep.cold+0x87/0x94 \| [ 90.202588] rt_spin_lock+0x19/0x70 \| [ 90.202593] ___slab_alloc+0xcb/0x7d0 \| [ 90.202598] ? newidle_balance.constprop.0+0xf5/0x3b0 \| [ 90.202603] ? dequeue_entity+0xc3/0x290 \| [ 90.202605] ? io_wqe_dec_running.isra.0+0x98/0xe0 \| [ 90.202610] ? pick_next_task_fair+0xb9/0x330 \| [ 90.202612] ? __schedule+0x670/0x1410 \| [ 90.202615] ? io_wqe_dec_running.isra.0+0x98/0xe0 \| [ 90.202618] kmem_cache_alloc_trace+0x79/0x1f0 \| [ 90.202621] io_wqe_dec_running.isra.0+0x98/0xe0 \| [ 90.202625] io_wq_worker_sleeping+0x37/0x50 \| [ 90.202628] schedule+0x30/0xd0 \| [ 90.202630] schedule_timeout+0x8f/0x1a0 \| [ 90.202634] ? __bpf_trace_tick_stop+0x10/0x10 \| [ 90.202637] io_wqe_worker+0xfd/0x320 \| [ 90.202641] ? finish_task_switch.isra.0+0xd3/0x290 \| [ 90.202644] ? io_worker_handle_work+0x670/0x670 \| [ 90.202646] ? io_worker_handle_work+0x670/0x670 \| [ 90.202649] ret_from_fork+0x22/0x30 which is due to the RT kernel not liking a GFP_ATOMIC allocation inside a raw spinlock. Besides that not working on RT, doing any kind of allocation from inside schedule() is kind of nasty and should be avoided if at all possible. This particular path happens when an io-wq worker goes to sleep, and we need a new worker to handle pending work. We currently allocate a small data item to hold the information we need to create a new worker, but we can instead include this data in the io_worker struct itself and just protect it with a single bit lock. We only really need one per worker anyway, as we will have run pending work between to sleep cycles. https://lore.kernel.org/lkml/20210804082418.fbibprcwtzyt5qax@beryllium.lan/ Reported-by: Daniel Wagner <dwagner@suse.de> Tested-by: Daniel Wagner <dwagner@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Chao Yu	94c821fb28	f2fs: rebuild nat_bits during umount If all free_nat_bitmap are available, we can rebuild nat_bits from free_nat_bitmap entirely during umount, let's make another chance to reenable nat_bits for image. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-08-23 10:25:52 -07:00
Daeho Jeong	a4b6817625	f2fs: introduce periodic iostat io latency traces Whenever we notice some sluggish issues on our machines, we are always curious about how well all types of I/O in the f2fs filesystem are handled. But, it's hard to get this kind of real data. First of all, we need to reproduce the issue while turning on the profiling tool like blktrace, but the issue doesn't happen again easily. Second, with the intervention of any tools, the overall timing of the issue will be slightly changed and it sometimes makes us hard to figure it out. So, I added the feature printing out IO latency statistics tracepoint events, which are minimal things to understand filesystem's I/O related behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs node on, we can get this statistics info in a periodic way and it would cause the least overhead. [samples] f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency: dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count], rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4], wr_sync_data [164/16/3331], wr_sync_node [152/3/648], wr_sync_meta [160/2/4243], wr_async_data [24/13/15], wr_async_node [0/0/0], wr_async_meta [0/0/0] f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency: dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count], rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1], wr_sync_data [120/12/2285], wr_sync_node [88/5/428], wr_sync_meta [52/6/2990], wr_async_data [4/1/3], wr_async_node [0/0/0], wr_async_meta [0/0/0] Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-08-23 10:25:51 -07:00
Daeho Jeong	521187439a	f2fs: separate out iostat feature Added F2FS_IOSTAT config option to support getting IO statistics through sysfs and printing out periodic IO statistics tracepoint events and moved I/O statistics related codes into separate files for better maintenance. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> [Jaegeuk Kim: set default=y] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-08-23 10:25:51 -07:00
J. Bruce Fields	b661601a9f	lockd: update nlm_lookup_file reexport comment Update comment to reflect that we do allow reexport, whether it's a good idea or not.... Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2021-08-23 12:56:17 -04:00
J. Bruce Fields	a81041b7d8	nlm: minor refactoring Make this lookup slightly more concise, and prepare for changing how we look this up in a following patch. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2021-08-23 12:56:17 -04:00
J. Bruce Fields	2dc6f19e4f	nlm: minor nlm_lookup_file argument change It'll come in handy to get the whole nlm_lock. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2021-08-23 12:56:03 -04:00
aalexandrovich	11e4e66efd	Merge branch 'torvalds:master' into master	2021-08-23 16:44:22 +03:00
Desmond Cheong Zhi Xi	0d977e0eba	btrfs: reset replace target device to allocation state on close This crash was observed with a failed assertion on device close: BTRFS: Transaction aborted (error -28) WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs] Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs] RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388 R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c FS: 0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0 Call Trace: flush_space+0x197/0x2f0 [btrfs] btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs] process_one_work+0x262/0x5e0 worker_thread+0x4c/0x320 ? process_one_work+0x5e0/0x5e0 kthread+0x144/0x170 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 irq event stamp: 19334989 hardirqs last enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400 hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400 softirqs last enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574 softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140 ---[ end trace 45939e308e0dd3c7 ]--- BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left BTRFS info (device vdd): forced readonly BTRFS warning (device vdd): failed setting block group ro: -30 BTRFS info (device vdd): suspending dev_replace for unmount assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3431! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 1 PID: 3982 Comm: umount Tainted: G W 5.14.0-rc5-default+ #1532 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs] RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246 RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18 R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003 FS: 00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0 Call Trace: btrfs_close_one_device.cold+0x11/0x55 [btrfs] close_fs_devices+0x44/0xb0 [btrfs] btrfs_close_devices+0x48/0x160 [btrfs] generic_shutdown_super+0x69/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x2c/0xa0 cleanup_mnt+0x144/0x1b0 task_work_run+0x59/0xa0 exit_to_user_mode_loop+0xe7/0xf0 exit_to_user_mode_prepare+0xaf/0xf0 syscall_exit_to_user_mode+0x19/0x50 do_syscall_64+0x4a/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae This happens when close_ctree is called while a dev_replace hasn't completed. In close_ctree, we suspend the dev_replace, but keep the replace target around so that we can resume the dev_replace procedure when we mount the root again. This is the call trace: close_ctree(): btrfs_dev_replace_suspend_for_unmount(); btrfs_close_devices(): btrfs_close_fs_devices(): btrfs_close_one_device(): ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)); However, since the replace target sticks around, there is a device with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the assertion in btrfs_close_one_device. To fix this, if we come across the replace target device when closing, we should properly reset it back to allocation state. This fix also ensures that if a non-target device has a corrupted state and has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still catch the error. Reported-by: David Sterba <dsterba@suse.com> Fixes: `b2a6166768` ("btrfs: fix rw device counting in __btrfs_free_extra_devids") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:57:18 +02:00
Stian Skjelstad	58bc6d1be2	udf_get_extendedattr() had no boundary checks. When parsing the ExtendedAttr data, malicous or corrupt attribute length could cause kernel hangs and buffer overruns in some special cases. Link: https://lore.kernel.org/r/20210822093332.25234-1-stian.skjelstad@gmail.com Signed-off-by: Stian Skjelstad <stian.skjelstad@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2021-08-23 13:35:19 +02:00
Naohiro Aota	939c7feb19	btrfs: zoned: fix ordered extent boundary calculation btrfs_lookup_ordered_extent() is supposed to query the offset in a file instead of the logical address. Pass the file offset from submit_extent_page() to calc_bio_boundaries(). Also, calc_bio_boundaries() relies on the bio's operation flag, so move the call site after setting it. Fixes: `390ed29b81` ("btrfs: refactor submit_extent_page() to make bio and its flag tracing easier") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:16 +02:00
Josef Bacik	1146239794	btrfs: do not do preemptive flushing if the majority is global rsv A common characteristic of the bug report where preemptive flushing was going full tilt was the fact that the vast majority of the free metadata space was used up by the global reserve. The hard 90% threshold would cover the majority of these cases, but to be even smarter we should take into account how much of the outstanding reservations are covered by the global block reserve. If the global block reserve accounts for the vast majority of outstanding reservations, skip preemptive flushing, as it will likely just cause churn and pain. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185 Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:16 +02:00
Josef Bacik	93c60b17f2	btrfs: reduce the preemptive flushing threshold to 90% The preemptive flushing code was added in order to avoid needing to synchronously wait for ENOSPC flushing to recover space. Once we're almost full however we can essentially flush constantly. We were using 98% as a threshold to determine if we were simply full, however in practice this is a really high bar to hit. For example reports of systems running into this problem had around 94% usage and thus continued to flush. Fix this by lowering the threshold to 90%, which is a more sane value, especially for smaller file systems. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185 CC: stable@vger.kernel.org # 5.12+ Fixes: `576fa34830` ("btrfs: improve preemptive background space flushing") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:15 +02:00
Marcos Paulo de Souza	3736127a3a	btrfs: tree-log: check btrfs_lookup_data_extent return value Function btrfs_lookup_data_extent calls btrfs_search_slot to verify if the EXTENT_ITEM exists in the extent tree. btrfs_search_slot can return values bellow zero if an error happened. Function replay_one_extent currently checks if the search found something (0 returned) and increments the reference, and if not, it seems to evaluate as 'not found'. Fix the condition by checking if the value was bellow zero and return early. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:15 +02:00
Filipe Manana	8be2ba2e0e	btrfs: avoid unnecessarily logging directories that had no changes There are several cases where when logging an inode we need to log its parent directories or logging subdirectories when logging a directory. There are cases however where we end up logging a directory even if it was not changed in the current transaction, no dentries added or removed since the last transaction. While this is harmless from a functional point of view, it is a waste time as it brings no advantage. One example where this is triggered is the following: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/A $ mkdir /mnt/B $ mkdir /mnt/C $ touch /mnt/A/foo $ ln /mnt/A/foo /mnt/B/bar $ ln /mnt/A/foo /mnt/C/baz $ sync $ rm -f /mnt/A/foo $ xfs_io -c "fsync" /mnt/B/bar This last fsync ends up logging directories A, B and C, however we only need to log directory A, as B and C were not changed since the last transaction commit. So fix this by changing need_log_inode(), to return false in case the given inode is a directory and has a ->last_trans value smaller than the current transaction's ID. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:15 +02:00
Christian Brauner	5b9b26f5d0	btrfs: allow idmapped mount Now that we converted btrfs internally to account for idmapped mounts allow the creation of idmapped mounts on by setting the FS_ALLOW_IDMAP flag. We only need to raise this flag on the btrfs_root_fs_type filesystem since btrfs_mount_root() is ultimately responsible for allocating the superblock and is called into from btrfs_mount() associated with btrfs_fs_type. The conversion of the btrfs inode operations was straightforward. Regarding btrfs specific ioctls that perform checks based on inode permissions only those have been allowed that are not filesystem wide operations and hence can be reasonably charged against a specific mount. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:15 +02:00
Christian Brauner	4a8b34afa9	btrfs: handle ACLs on idmapped mounts Make the ACL code idmapped mount aware. The POSIX default and POSIX access ACLs are the only ACLs other than some specific xattrs that take DAC permissions into account. On an idmapped mount they need to be translated according to the mount's userns. The main change is done to __btrfs_set_acl() which is responsible for translating POSIX ACLs to their final on-disk representation. The btrfs_init_acl() helper does not need to take the idmapped mount into account since it is called in the context of file creation operations (mknod, create, mkdir, symlink, tmpfile) and is used for btrfs_init_inode_security() to copy POSIX default and POSIX access permissions from the parent directory. These ACLs need to be inherited unmodified from the parent directory. This is identical to what we do for ext4 and xfs. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:15 +02:00
Christian Brauner	6623d9a0b0	btrfs: allow idmapped INO_LOOKUP_USER ioctl The INO_LOOKUP_USER is an unprivileged version of the INO_LOOKUP ioctl and has the following restrictions. The main difference between the two is that INO_LOOKUP is filesystem wide operation wheres INO_LOOKUP_USER is scoped beneath the file descriptor passed with the ioctl. Specifically, INO_LOOKUP_USER must adhere to the following restrictions: - The caller must be privileged over each inode of each path component for the path they are trying to lookup. - The path for the subvolume the caller is trying to lookup must be reachable from the inode associated with the file descriptor passed with the ioctl. The second condition makes it possible to scope the lookup of the path to the mount identified by the file descriptor passed with the ioctl. This allows us to enable this ioctl on idmapped mounts. Specifically, this is possible because all child subvolumes of a parent subvolume are reachable when the parent subvolume is mounted. So if the user had access to open the parent subvolume or has been given the fd then they can lookup the path if they had access to it provided they were privileged over each path component. Note, the INO_LOOKUP_USER ioctl allows a user to learn the path and name of a subvolume even though they would otherwise be restricted from doing so via regular VFS-based lookup. So think about a parent subvolume with multiple child subvolumes. Someone could mount he parent subvolume and restrict access to the child subvolumes by overmounting them with empty directories. At this point the user can't traverse the child subvolumes and they can't open files in the child subvolumes. However, they can still learn the path of child subvolumes as long as they have access to the parent subvolume by using the INO_LOOKUP_USER ioctl. The underlying assumption here is that it's ok that the lookup ioctls can't really take mounts into account other than the original mount the fd belongs to during lookup. Since this assumption is baked into the original INO_LOOKUP_USER ioctl we can extend it to idmapped mounts. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:15 +02:00
Christian Brauner	39e1674ff0	btrfs: allow idmapped SUBVOL_SETFLAGS ioctl Setting flags on subvolumes or snapshots are core features of btrfs. The SUBVOL_SETFLAGS ioctl is especially important as it allows to make subvolumes and snapshots read-only or read-write. Allow setting flags on btrfs subvolumes and snapshots on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:14 +02:00
Christian Brauner	e4fed17a32	btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls The SET_RECEIVED_SUBVOL ioctls are used to set information about a received subvolume. Make it possible to set information about a received subvolume on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:14 +02:00
Christian Brauner	aabb34e7a3	btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids So far we prevented the deletion of subvolumes and snapshots using subvolume ids possible with the BTRFS_SUBVOL_SPEC_BY_ID flag. This restriction is necessary on idmapped mounts as this allows filesystem wide subvolume and snapshot deletions and thus can escape the scope of what's exposed under the mount identified by the fd passed with the ioctl. Deletion by subvolume id works by looking for an alias of the parent of the subvolume or snapshot to be deleted. The parent alias can be anywhere in the filesystem. However, as long as the alias of the parent that is found is the same as the one identified by the file descriptor passed through the ioctl we can allow the deletion. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:14 +02:00
Christian Brauner	c4ed533bdc	btrfs: allow idmapped SNAP_DESTROY ioctls Destroying subvolumes and snapshots are important features of btrfs. Both operations are available to unprivileged users if the filesystem has been mounted with the "user_subvol_rm_allowed" mount option. Allow subvolume and snapshot deletion on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Subvolumes and snapshots can either be deleted by specifying their name or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set. This feature is blocked on idmapped mounts as this allows filesystem wide subvolume deletions and thus can escape the scope of what's exposed under the mount identified by the fd passed with the ioctl. This means that even the root or CAP_SYS_ADMIN capable user can't delete a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional. The root user is currently already subject to permission checks in btrfs_may_delete() including whether the inode's i_uid/i_gid of the directory the subvolume is located in have a mapping in the caller's idmapping. For this to fail isn't currently possible since a btrfs filesystem can't be mounted with a non-initial idmapping but it shows that even the root user would fail to delete a subvolume if the relevant inode isn't mapped in their idmapping. The idmapped mount case is the same in principle. This isn't a huge problem a root user wanting to delete arbitrary subvolumes can just always create another (even detached) mount without an idmapping attached. In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the subvolume to delete is directly located under inode referenced by the fd passed for the ioctl() in a follow-up commit. Here is an example where a btrfs subvolume is deleted through a subvolume mount that does not expose the subvolume to be delete but it can still be deleted by using the subvolume id: /* Compile the following program as "delete_by_spec". / #define _GNU_SOURCE #include <fcntl.h> #include <inttypes.h> #include <linux/btrfs.h> #include <stdio.h> #include <stdlib.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> static int rm_subvolume_by_id(int fd, uint64_t subvolid) { struct btrfs_ioctl_vol_args_v2 args = {}; int ret; args.flags = BTRFS_SUBVOL_SPEC_BY_ID; args.subvolid = subvolid; ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args); if (ret < 0) return -1; return 0; } int main(int argc, char argv[]) { int subvolid = 0; if (argc < 3) exit(1); fprintf(stderr, "Opening %s\n", argv[1]); int fd = open(argv[1], O_CLOEXEC \| O_DIRECTORY); if (fd < 0) exit(2); subvolid = atoi(argv[2]); fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid); int ret = rm_subvolume_by_id(fd, subvolid); if (ret < 0) exit(3); exit(0); } #include <stdio.h>" #include <stdlib.h>" #include <linux/btrfs.h" truncate -s 10G btrfs.img mkfs.btrfs btrfs.img export LOOPDEV=$(sudo losetup -f --show btrfs.img) mount ${LOOPDEV} /mnt sudo chown $(id -u):$(id -g) /mnt btrfs subvolume create /mnt/A btrfs subvolume create /mnt/B/C # Get subvolume id via: sudo btrfs subvolume show /mnt/A # Save subvolid SUBVOLID=<nr> sudo umount /mnt sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt ./delete_by_spec /mnt ${SUBVOLID} With idmapped mounts this can potentially be used by users to delete subvolumes/snapshots they would otherwise not have access to as the idmapping would be applied to an inode that is not exposed in the mount of the subvolume. The fact that this is a filesystem wide operation suggests it might be a good idea to expose this under a separate ioctl that clearly indicates this. In essence, the file descriptor passed with the ioctl is merely used to identify the filesystem on which to operate when BTRFS_SUBVOL_SPEC_BY_ID is used. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:14 +02:00
Christian Brauner	4d4340c912	btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls Creating subvolumes and snapshots is one of the core features of btrfs and is even available to unprivileged users. Make it possible to use subvolume and snapshot creation on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:14 +02:00
Christian Brauner	5474bf400f	btrfs: check whether fsgid/fsuid are mapped during subvolume creation When a new subvolume is created btrfs currently doesn't check whether the fsgid/fsuid of the caller actually have a mapping in the user namespace attached to the filesystem. The VFS always checks this to make sure that the caller's fsgid/fsuid can be represented on-disk. This is most relevant for filesystems that can be mounted inside user namespaces but it is in general a good hardening measure to prevent unrepresentable gid/uid from being written to disk. Since we want to support idmapped mounts for btrfs ioctls to create subvolumes in follow-up patches this becomes important since we want to make sure the fsgid/fsuid of the caller as mapped according to the idmapped mount can be represented on-disk. Simply add the missing fsuidgid_has_mapping() line from the VFS may_create() version to btrfs_may_create(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:14 +02:00
Christian Brauner	3bc71ba02c	btrfs: allow idmapped permission inode op Enable btrfs_permission() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	d4d0946461	btrfs: allow idmapped setattr inode op Enable btrfs_setattr() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	98b6ab5fc0	btrfs: allow idmapped tmpfile inode op Enable btrfs_tmpfile() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	5a0521086e	btrfs: allow idmapped symlink inode op Enable btrfs_symlink() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	b0b3e44d34	btrfs: allow idmapped mkdir inode op Enable btrfs_mkdir() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	e93ca491d0	btrfs: allow idmapped create inode op Enable btrfs_create() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:13 +02:00
Christian Brauner	72105277dc	btrfs: allow idmapped mknod inode op Enable btrfs_mknod() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Christian Brauner	c020d2eaf1	btrfs: allow idmapped getattr inode op Enable btrfs_getattr() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Christian Brauner	ca07274c3d	btrfs: allow idmapped rename inode op Enable btrfs_rename() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Christian Brauner	b3b6f5b922	btrfs: handle idmaps in btrfs_new_inode() Extend btrfs_new_inode() to take the idmapped mount into account when initializing a new inode. This is just a matter of passing down the mount's userns. The rest is taken care of in inode_init_owner(). This is a preliminary patch to make the individual btrfs inode operations idmapped mount aware. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Christian Brauner	c2fd68b6b2	namei: add mapping aware lookup helper Various filesystems rely on the lookup_one_len() helper to lookup a single path component relative to a well-known starting point. Allow such filesystems to support idmapped mounts by adding a version of this helper to take the idmap into account when calling inode_permission(). This change is a required to let btrfs (and other filesystems) support idmapped mounts. Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Anand Jain	e7849e33cf	btrfs: sysfs: document structures and their associated files Sysfs file has grown big. It takes some time to locate the correct struct attribute to add new files. Create a table and map the struct attribute to its sysfs path. Also, fix the comment about the debug sysfs path. And add the comments to the attributes instead of attribute group, where sysfs file names are defined. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:12 +02:00
Qu Wenruo	e4571b8c5e	btrfs: fix NULL pointer dereference when deleting device by invalid id [BUG] It's easy to trigger NULL pointer dereference, just by removing a non-existing device id: # mkfs.btrfs -f -m single -d single /dev/test/scratch1 \ /dev/test/scratch2 # mount /dev/test/scratch1 /mnt/btrfs # btrfs device remove 3 /mnt/btrfs Then we have the following kernel NULL pointer dereference: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs] btrfs_ioctl+0x18bb/0x3190 [btrfs] ? lock_is_held_type+0xa5/0x120 ? find_held_lock.constprop.0+0x2b/0x80 ? do_user_addr_fault+0x201/0x6a0 ? lock_release+0xd2/0x2d0 ? __x64_sys_ioctl+0x83/0xb0 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae [CAUSE] Commit `a27a94c2b0` ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly") moves the "missing" device path check into btrfs_rm_device(). But btrfs_rm_device() itself can have case where it only receives @devid, with NULL as @device_path. In that case, calling strcmp() on NULL will trigger the NULL pointer dereference. Before that commit, we handle the "missing" case inside btrfs_find_device_by_devspec(), which will not check @device_path at all if @devid is provided, thus no way to trigger the bug. [FIX] Before calling strcmp(), also make sure @device_path is not NULL. Fixes: `a27a94c2b0` ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly") CC: stable@vger.kernel.org # 5.4+ Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:11 +02:00
Naohiro Aota	63fb5879db	btrfs: zoned: add asserts on splitting extent_map We call split_zoned_em() on an extent_map on submitting a bio for it. Thus, we can assume the extent_map is PINNED, not LOGGING, and in the modified list. Add ASSERT()s to ensure the extent_maps after the split also has the proper flags set and are in the modified list. Suggested-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:11 +02:00
Naohiro Aota	0ae79c6fe7	btrfs: zoned: fix block group alloc_offset calculation alloc_offset is offset from the start of a block group and @offset is actually an address in logical space. Thus, we need to consider block_group->start when calculating them. Fixes: `011b41bffa` ("btrfs: zoned: advance allocation pointer after tree log node") CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:11 +02:00
Naohiro Aota	ba86dd9fe6	btrfs: zoned: suppress reclaim error message on EAGAIN btrfs_relocate_chunk() can fail with -EAGAIN when e.g. send operations are running. The message can fail btrfs/187 and it's unnecessary because we anyway add it back to the reclaim list. btrfs_reclaim_bgs_work() `-> btrfs_relocate_chunk() `-> btrfs_relocate_block_group() `-> reloc_chunk_start() `-> if (fs_info->send_in_progress) `-> return -EAGAIN CC: stable@vger.kernel.org # 5.13+ Fixes: `18bb8bbf13` ("btrfs: zoned: automatically reclaim zones") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:11 +02:00
Johannes Thumshirn	77233c2d2e	btrfs: zoned: allow disabling of zone auto reclaim Automatically reclaiming dirty zones might not always be desired for all workloads, especially as there are currently still some rough edges with the relocation code on zoned filesystems. Allow disabling zone auto reclaim on a per filesystem basis by writing 0 as the threshold value. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:11 +02:00
Filipe Manana	1f29537302	btrfs: update comment at log_conflicting_inodes() A comment at log_conflicting_inodes() mentions that we check the inode's logged_trans field instead of using btrfs_inode_in_log() because the field last_log_commit is not updated when we log that an inode exists and the inode has the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) set. The part about the full sync flag is not true anymore since commit `9acc8103ab` ("btrfs: fix unpersisted i_size on fsync after expanding truncate"), so update the comment to not mention that part anymore. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:11 +02:00
Filipe Manana	d135a53396	btrfs: remove no longer needed full sync flag check at inode_logged() Now that we are checking if the inode's logged_trans is 0 to detect the possibility of the inode having been evicted and reloaded, the test for the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) is no longer needed at tree-log.c:inode_logged(). Its purpose was to detect the possibility of a previous eviction as well, since when an inode is loaded the full sync flag is always set on it (and only cleared after the inode is logged). So just remove the check and update the comment. The check for the inode's logged_trans being 0 was added recently by the patch with the subject "btrfs: eliminate some false positives when checking if inode was logged". Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:10 +02:00
Filipe Manana	1c167b87f4	btrfs: remove unnecessary NULL check for the new inode during rename exchange At the very end of btrfs_rename_exchange(), in case an error happened, we are checking if 'new_inode' is NULL, but that is not needed since during a rename exchange, unlike regular renames, 'new_inode' can never be NULL, and if it were, we would have a crashed much earlier when we dereference it multiple times. So remove the check because it is not necessary and because it is causing static checkers to emit a warning. I probably introduced the check by copy-pasting similar code from btrfs_rename(), where 'new_inode' can be NULL, in commit `86e8aa0e77` ("Btrfs: unpin logs if rename exchange operation fails"). Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:10 +02:00
Goldwyn Rodrigues	dce2815039	btrfs: allocate backref_ctx on stack in find_extent_clone Instead of using kmalloc() to allocate backref_ctx, allocate backref_ctx on stack. The size is reasonably small. sizeof(backref_ctx) = 48 Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:10 +02:00
Goldwyn Rodrigues	c853a5783e	btrfs: allocate btrfs_ioctl_defrag_range_args on stack Instead of using kmalloc() to allocate btrfs_ioctl_defrag_range_args, allocate btrfs_ioctl_defrag_range_args on stack, the size is reasonably small and ioctls are called in process context. sizeof(btrfs_ioctl_defrag_range_args) = 48 Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:10 +02:00
Goldwyn Rodrigues	0afb603afc	btrfs: allocate btrfs_ioctl_quota_rescan_args on stack Instead of using kmalloc() to allocate btrfs_ioctl_quota_rescan_args, allocate btrfs_ioctl_quota_rescan_args on stack, the size is reasonably small and ioctls are called in process context. sizeof(btrfs_ioctl_quota_rescan_args) = 64 Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:10 +02:00
Goldwyn Rodrigues	98caf9531e	btrfs: allocate file_ra_state on stack in readahead_cache Instead of allocating file_ra_state using kmalloc, allocate on stack. sizeof(struct readahead) = 32 bytes. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:10 +02:00
Marcos Paulo de Souza	0ff40a910f	btrfs: introduce btrfs_search_backwards function It's a common practice to start a search using offset (u64)-1, which is the u64 maximum value, meaning that we want the search_slot function to be set in the last item with the same objectid and type. Once we are in this position, it's a matter to start a search backwards by calling btrfs_previous_item, which will check if we'll need to go to a previous leaf and other necessary checks, only to be sure that we are in last offset of the same object and type. The new btrfs_search_backwards function does the all these steps when necessary, and can be used to avoid code duplication. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
David Sterba	ea3dc7d2d1	btrfs: print if fsverity support is built in when loading module As fsverity support depends on a config option, print that at module load time like we do for similar features. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
Boris Burkov	705242538f	btrfs: verity metadata orphan items Writing out the verity data is too large of an operation to do in a single transaction. If we are interrupted before we finish creating fsverity metadata for a file, or fail to clean up already created metadata after a failure, we could leak the verity items that we already committed. To address this issue, we use the orphan mechanism. When we start enabling verity on a file, we also add an orphan item for that inode. When we are finished, we delete the orphan. However, if we are interrupted midway, the orphan will be present at mount and we can cleanup the half-formed verity state. There is a possible race with a normal unlink operation: if unlink and verity run on the same file in parallel, it is possible for verity to succeed and delete the still legitimate orphan added by unlink. Then, if we are interrupted and mount in that state, we will never clean up the inode properly. This is also possible for a file created with O_TMPFILE. Check nlink==0 before deleting to avoid this race. A final thing to note is that this is a resurrection of using orphans to signal an operation besides "delete this inode". The old case was to signal the need to do a truncate. That case still technically applies for mounting very old file systems, so we need to take some care to not clobber it. To that end, we just have to be careful that verity orphan cleanup is a no-op for non-verity files. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
Boris Burkov	146054090b	btrfs: initial fsverity support Add support for fsverity in btrfs. To support the generic interface in fs/verity, we add two new item types in the fs tree for inodes with verity enabled. One stores the per-file verity descriptor and btrfs verity item and the other stores the Merkle tree data itself. Verity checking is done in end_page_read just before a page is marked uptodate. This naturally handles a variety of edge cases like holes, preallocated extents, and inline extents. Some care needs to be taken to not try to verity pages past the end of the file, which are accessed by the generic buffered file reading code under some circumstances like reading to the end of the last page and trying to read again. Direct IO on a verity file falls back to buffered reads. Verity relies on PageChecked for the Merkle tree data itself to avoid re-walking up shared paths in the tree. For this reason, we need to cache the Merkle tree data. Since the file is immutable after verity is turned on, we can cache it at an index past EOF. Use the new inode ro_flags to store verity on the inode item, so that we can enable verity on a file, then rollback to an older kernel and still mount the file system and read the file. Since we can't safely write the file anymore without ruining the invariants of the Merkle tree, we mark a ro_compat flag on the file system when a file has verity enabled. Acked-by: Eric Biggers <ebiggers@google.com> Co-developed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
Boris Burkov	77eea05e78	btrfs: add ro compat flags to inodes Currently, inode flags are fully backwards incompatible in btrfs. If we introduce a new inode flag, then tree-checker will detect it and fail. This can even cause us to fail to mount entirely. To make it possible to introduce new flags which can be read-only compatible, like VERITY, we add new ro flags to btrfs without treating them quite so harshly in tree-checker. A read-only file system can survive an unexpected flag, and can be mounted. As for the implementation, it unfortunately gets a little complicated. The on-disk representation of the inode, btrfs_inode_item, has an __le64 for flags but the in-memory representation, btrfs_inode, uses a u32. David Sterba had the nice idea that we could reclaim those wasted 32 bits on disk and use them for the new ro_compat flags. It turns out that the tree-checker code which checks for unknown flags is broken, and ignores the upper 32 bits we are hoping to use. The issue is that the flags use the literal 1 rather than 1ULL, so the flags are signed ints, and one of them is specifically (1 << 31). As a result, the mask which ORs the flags is a negative integer on machines where int is 32 bit twos complement. When tree-checker evaluates the expression: btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK) The mask is something like 0x80000abc, which gets promoted to u64 with sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves all the upper bits zeroed, and we can't detect unexpected flags. This suggests that we can't use those bits after all. Luckily, we have good reason to believe that they are zero anyway. Inode flags are metadata, which is always checksummed, so any bit flips that would introduce 1s would cause a checksum failure anyway (excluding the improbable case of the checksum getting corrupted exactly badly). Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit inode flag should preserve its value and not add leading zeroes (at least for twos complement). The only place that flag (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in the root item, and indeed for that inode we see 0xffffffff80000000 as the flags on disk. However, that inode is never seen by tree checker, nor is it used in a context where verity might be meaningful. Theoretically, a future ro flag might cause trouble on that inode, so we should proactively clean up that mess before it does. With the introduction of the new ro flags, keep two separate unsigned masks and check them against the appropriate u32. Since we no longer run afoul of sign extension, this also stops writing out 0xffffffff80000000 in root_item inodes going forward. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
Anand Jain	efc222f8d7	btrfs: simplify return values in btrfs_check_raid_min_devices Function btrfs_check_raid_min_devices() returns error code from the enum btrfs_err_code and it starts from 1. So there is no need to check if ret is > 0. So drop this check and also drop the local variable ret. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:09 +02:00
Qu Wenruo	7361b4ae03	btrfs: remove the dead comment in writepage_delalloc() When btrfs_run_delalloc_range() failed, we will error out. But there is a strange comment mentioning that btrfs_run_delalloc_range() could have returned value >0 to indicate the IO has already started. Commit `40f765805f` ("Btrfs: split up __extent_writepage to lower stack usage") introduced the comment, but unfortunately at that time, we were already using @page_started to indicate that case, and still return 0. Furthermore, even if that comment was right (which is not), we would return -EIO if the IO had already started. By all means the comment is incorrect, just remove the comment along with the dead check. Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to make sure we either return 0 or error, no positive return value. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:08 +02:00
David Sterba	b2f78e8805	btrfs: allow degenerate raid0/raid10 The data on raid0 and raid10 are supposed to be spread over multiple devices, so the minimum constraints are set to 2 and 4 respectively. This is an artificial limit and there's some interest to remove it. Change this to allow raid0 on one device and raid10 on two devices. This works as expected eg. when converting or removing devices. The only difference is when raid0 on two devices gets one device removed. Unpatched would silently create a single profile, while newly it would be raid0. The motivation is to allow to preserve the profile type as long as it possible for some intermediate state (device removal, conversion), or when there are disks of different size, with raid0 the otherwise unusable space of the last device will be used too. Similarly for raid10, though the two largest devices would need to be the same. Unpatched kernel will mount and use the degenerate profiles just fine but won't allow any operation that would not satisfy the stricter device number constraints, eg. not allowing to go from 3 to 2 devices for raid10 or various profile conversions. Example output: # btrfs fi us -T . Overall: Device size: 10.00GiB Device allocated: 1.01GiB Device unallocated: 8.99GiB Device missing: 0.00B Used: 200.61MiB Free (estimated): 9.79GiB (min: 9.79GiB) Free (statfs, df): 9.79GiB Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 3.25MiB (used: 0.00B) Multiple profiles: no Data Metadata System Id Path RAID0 single single Unallocated -- ---------- --------- --------- -------- ----------- 1 /dev/sda10 1.00GiB 8.00MiB 1.00MiB 8.99GiB -- ---------- --------- --------- -------- ----------- Total 1.00GiB 8.00MiB 1.00MiB 8.99GiB Used 200.25MiB 352.00KiB 16.00KiB # btrfs dev us . /dev/sda10, ID: 1 Device size: 10.00GiB Device slack: 0.00B Data,RAID0/1: 1.00GiB Metadata,single: 8.00MiB System,single: 1.00MiB Unallocated: 8.99GiB Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per profile is printed. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:08 +02:00
Filipe Manana	bd54f381a1	btrfs: do not pin logs too early during renames During renames we pin the logs of the roots a bit too early, before the calls to btrfs_insert_inode_ref(). We can pin the logs after those calls, since those will not change anything in a log tree. In a scenario where we have multiple and diverse filesystem operations running in parallel, those calls can take a significant amount of time, due to lock contention on extent buffers, and delay log commits from other tasks for longer than necessary. So just pin logs after calls to btrfs_insert_inode_ref() and right before the first operation that can update a log tree. The following script that uses dbench was used for testing: $ cat dbench-test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-m single -d single" echo "performance" \| tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 120 16 umount $MNT The tests were run on a machine with 12 cores, 64G of RAN, a NVMe device and using a non-debug kernel config (Debian's default config). The results compare a branch without this patch and without the previous patch in the series, that has the subject: "btrfs: eliminate some false positives when checking if inode was logged" Versus the same branch with these two patches applied. dbench with 8 clients, results before: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 4391359 0.009 249.745 Close 3225882 0.001 3.243 Rename 185953 0.065 240.643 Unlink 886669 0.049 249.906 Deltree 112 2.455 217.433 Mkdir 56 0.002 0.004 Qpathinfo 3980281 0.004 3.109 Qfileinfo 697579 0.001 0.187 Qfsinfo 729780 0.002 2.424 Sfileinfo 357764 0.004 1.415 Find 1538861 0.016 4.863 WriteX `2189666` 0.010 3.327 ReadX 6883443 0.002 0.729 LockX 14298 0.002 0.073 UnlockX 14298 0.001 0.042 Flush 307777 2.447 303.663 Throughput 1149.6 MB/sec 8 clients 8 procs max_latency=303.666 ms dbench with 8 clients, results after: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 4269920 0.009 213.532 Close 3136653 0.001 0.690 Rename 180805 0.082 213.858 Unlink 862189 0.050 172.893 Deltree 112 2.998 218.328 Mkdir 56 0.002 0.003 Qpathinfo 3870158 0.004 5.072 Qfileinfo 678375 0.001 0.194 Qfsinfo 709604 0.002 0.485 Sfileinfo 347850 0.004 1.304 Find 1496310 0.017 5.504 WriteX 2129613 0.010 2.882 ReadX 6693066 0.002 1.517 LockX 13902 0.002 0.075 UnlockX 13902 0.001 0.055 Flush 299276 2.511 220.189 Throughput 1187.33 MB/sec 8 clients 8 procs max_latency=220.194 ms +3.2% throughput, -31.8% max latency dbench with 16 clients, results before: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 5978334 0.028 156.507 Close 4391598 0.001 1.345 Rename 253136 0.241 155.057 Unlink 1207220 0.182 257.344 Deltree 160 6.123 36.277 Mkdir 80 0.003 0.005 Qpathinfo 5418817 0.012 6.867 Qfileinfo 949929 0.001 0.941 Qfsinfo 993560 0.002 1.386 Sfileinfo 486904 0.004 2.829 Find 2095088 0.059 8.164 WriteX 2982319 0.017 9.029 ReadX `9371484` 0.002 4.052 LockX 19470 0.002 0.461 UnlockX 19470 0.001 0.990 Flush 418936 2.740 347.902 Throughput 1495.31 MB/sec 16 clients 16 procs max_latency=347.909 ms dbench with 16 clients, results after: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 5711833 0.029 131.240 Close 4195897 0.001 1.732 Rename 241849 0.204 147.831 Unlink 1153341 0.184 231.322 Deltree 160 6.086 30.198 Mkdir 80 0.003 0.021 Qpathinfo 5177011 0.012 7.150 Qfileinfo 907768 0.001 0.793 Qfsinfo 949205 0.002 1.431 Sfileinfo 465317 0.004 2.454 Find 2001541 0.058 7.819 WriteX 2850661 0.017 9.110 ReadX 8952289 0.002 3.991 LockX 18596 0.002 0.655 UnlockX 18596 0.001 0.179 Flush 400342 2.879 293.607 Throughput 1565.73 MB/sec 16 clients 16 procs max_latency=293.611 ms +4.6% throughput, -16.9% max latency Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:08 +02:00
Filipe Manana	6e8e777deb	btrfs: eliminate some false positives when checking if inode was logged When checking if an inode was previously logged in the current transaction through the helper inode_logged(), we can return some false positives that can be easily eliminated. These correspond to the cases where an inode has a ->logged_trans value that is not zero and its value is smaller then the ID of the current transaction. This means we know exactly that the inode was never logged before in the current transaction, so we can return false and avoid the callers to do extra work: 1) Having btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() unnecessarily join a log transaction and do deletion searches in a log tree that will not find anything. This just adds unnecessary contention on extent buffer locks; 2) Having btrfs_log_new_name() unnecessarily log an inode when it is not needed. If the inode was not logged before, we don't need to log it in LOG_INODE_EXISTS mode. So just make sure that any false positive only happens when ->logged_trans has a value of 0. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:08 +02:00
Naohiro Aota	42b5d73b5d	btrfs: drop unnecessary ASSERT from btrfs_submit_direct() When on SINGLE block group, btrfs_get_io_geometry() will return "the size of the block group - the offset of the logical address within the block group" as geom.len. Since we allow up to 8 GiB zone size on zoned filesystem, we can have up to 8 GiB block group, so can have up to 8 GiB geom.len as well. With this setup, we easily hit the "ASSERT(geom.len <= INT_MAX);". The ASSERT looks like to guard btrfs_bio_clone_partial() and bio_trim() which both take "int" (now u64 due to the previous patch). So to be precise the ASSERT should check if clone_len <= UINT_MAX. But actually, clone_len is already capped by bio.bi_iter.bi_size which is unsigned int. So the ASSERT is not necessary. Drop the ASSERT and properly compare submit_len and geom.len in u64. Then, let the implicit casting to convert it to u64. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:08 +02:00
Chaitanya Kulkarni	21dda654d4	btrfs: fix argument type of btrfs_bio_clone_partial() The offset and can never be negative use unsigned int instead of int type for them. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:08 +02:00
Josef Bacik	5662c967c6	fs: kill sync_inode Now that all users of sync_inode() have been deleted, remove sync_inode(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:07 +02:00
Josef Bacik	25d23cd016	9p: migrate from sync_inode to filemap_fdatawrite_wbc We're going to remove sync_inode, so migrate to filemap_fdatawrite_wbc instead. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:07 +02:00
Josef Bacik	b377630527	btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking sync_inode() has some holes that can cause problems if we're under heavy ENOSPC pressure. If there's writeback running on a separate thread sync_inode() will skip writing the inode altogether. What we really want is to make sure writeback has been started on all the pages to make sure we can see the ordered extents and wait on them if appropriate. Switch to this new helper which will allow us to accomplish this and avoid ENOSPC'ing early. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:07 +02:00
Josef Bacik	e16460707e	btrfs: wait on async extents when flushing delalloc I've been debugging an early ENOSPC problem in production and finally root caused it to this problem. When we switched to the per-inode in `38d715f494` ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc") I pulled out the async extent handling, because we were doing the correct thing by calling filemap_flush() if we had async extents set. This would properly wait on any async extents by locking the page in the second flush, thus making sure our ordered extents were properly set up. However when I switched us back to page based flushing, I used sync_inode(), which allows us to pass in our own wbc. The problem here is that sync_inode() is smarter than the filemap_* helpers, it tries to avoid calling writepages at all. This means that our second call could skip calling do_writepages altogether, and thus not wait on the pagelock for the async helpers. This means we could come back before any ordered extents were created and then simply continue on in our flushing mechanisms and ENOSPC out when we have plenty of space to use. Fix this by putting back the async pages logic in shrink_delalloc. This allows us to bulk write out everything that we need to, and then we can wait in one place for the async helpers to catch up, and then wait on any ordered extents that are created. Fixes: `e076ab2a2c` ("btrfs: shrink delalloc pages instead of full inodes") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:07 +02:00
Josef Bacik	03fe78cc29	btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc We have been hitting some early ENOSPC issues in production with more recent kernels, and I tracked it down to us simply not flushing delalloc as aggressively as we should be. With tracing I was seeing us failing all tickets with all of the block rsvs at or around 0, with very little pinned space, but still around 120MiB of outstanding bytes_may_used. Upon further investigation I saw that we were flushing around 14 pages per shrink call for delalloc, despite having around 2GiB of delalloc outstanding. Consider the example of a 8 way machine, all CPUs trying to create a file in parallel, which at the time of this commit requires 5 items to do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim size waiting on reservations. Now assume we have 128MiB of delalloc outstanding. With our current math we would set items to 20, and then set to_reclaim to 20 * 256k, or 5MiB. Assuming that we went through this loop all 3 times, for both FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop twice, we'd only flush 60MiB of the 128MiB delalloc space. This could leave a fair bit of delalloc reservations still hanging around by the time we go to ENOSPC out all the remaining tickets. Fix this two ways. First, change the calculations to be a fraction of the total delalloc bytes on the system. Prior to this change we were calculating based on dirty inodes so our math made more sense, now it's just completely unrelated to what we're actually doing. Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've gone through the flush states at least once. This will empty the system of all delalloc so we're sure to be truly out of space when we start failing tickets. I'm tagging stable 5.10 and forward, because this is where we started using the page stuff heavily again. This affects earlier kernel versions as well, but would be a pain to backport to them as the flushing mechanisms aren't the same. CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:07 +02:00

... 3 4 5 6 7 ...

72782 Commits