Commit Graph

82736 Commits

Author SHA1 Message Date
Lu Hongfei 302efbef9d fs: iomap: Change the type of blocksize from 'int' to 'unsigned int' in iomap_file_buffered_write_punch_delalloc
The return value type of i_blocksize() is 'unsigned int', so the
type of blocksize has been modified from 'int' to 'unsigned int'
to ensure data type consistency.

Signed-off-by: Lu Hongfei <luhongfei@vivo.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-06-29 09:22:50 -07:00
Shyam Prasad N 61986a58bc cifs: new dynamic tracepoint to track ses not found errors
It is perfectly valid to not find session not found errors
when a reconnect of a session happens when requests for the
same session are happening in parallel.

We had these log messages as VFS logs. My last change dumped
these logs as FYI logs.

This change just creates a new dynamic tracepoint to capture
events of this type, just in case it is useful while
debugging issues in the future.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-29 09:58:09 -05:00
Shyam Prasad N ac615db03b cifs: log session id when a matching ses is not found
We do not log the session id in crypt_setup when a matching
session is not found. Printing the session id helps debugging
here. This change does just that.

This change also changes this log to FYI, since it is normal to
see then during a reconnect. Doing the same for a similar log
in case of signed connections.

The plan is to have a tracepoint for this event, so that we will
be able to see this event if need be. That will be done as
another change.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-29 09:58:06 -05:00
Arnd Bergmann 9cedc58bdb ksmbd: avoid field overflow warning
clang warns about a possible field overflow in a memcpy:

In file included from fs/smb/server/smb_common.c:7:
include/linux/fortify-string.h:583:4: error: call to '__write_overflow_field' declared with 'warning' attribute: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror,-Wattribute-warning]
                        __write_overflow_field(p_size_field, size);

It appears to interpret the "&out[baselen + 4]" as referring to a single
byte of the character array, while the equivalen "out + baselen + 4" is
seen as an offset into the array.

I don't see that kind of warning elsewhere, so just go with the simple
rework.

Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-28 22:58:28 -05:00
Linus Torvalds 9471f1f2f5 Merge branch 'expand-stack'
This modifies our user mode stack expansion code to always take the
mmap_lock for writing before modifying the VM layout.

It's actually something we always technically should have done, but
because we didn't strictly need it, we were being lazy ("opportunistic"
sounds so much better, doesn't it?) about things, and had this hack in
place where we would extend the stack vma in-place without doing the
proper locking.

And it worked fine.  We just needed to change vm_start (or, in the case
of grow-up stacks, vm_end) and together with some special ad-hoc locking
using the anon_vma lock and the mm->page_table_lock, it all was fairly
straightforward.

That is, it was all fine until Ruihan Li pointed out that now that the
vma layout uses the maple tree code, we *really* don't just change
vm_start and vm_end any more, and the locking really is broken.  Oops.

It's not actually all _that_ horrible to fix this once and for all, and
do proper locking, but it's a bit painful.  We have basically three
different cases of stack expansion, and they all work just a bit
differently:

 - the common and obvious case is the page fault handling. It's actually
   fairly simple and straightforward, except for the fact that we have
   something like 24 different versions of it, and you end up in a maze
   of twisty little passages, all alike.

 - the simplest case is the execve() code that creates a new stack.
   There are no real locking concerns because it's all in a private new
   VM that hasn't been exposed to anybody, but lockdep still can end up
   unhappy if you get it wrong.

 - and finally, we have GUP and page pinning, which shouldn't really be
   expanding the stack in the first place, but in addition to execve()
   we also use it for ptrace(). And debuggers do want to possibly access
   memory under the stack pointer and thus need to be able to expand the
   stack as a special case.

None of these cases are exactly complicated, but the page fault case in
particular is just repeated slightly differently many many times.  And
ia64 in particular has a fairly complicated situation where you can have
both a regular grow-down stack _and_ a special grow-up stack for the
register backing store.

So to make this slightly more manageable, the bulk of this series is to
first create a helper function for the most common page fault case, and
convert all the straightforward architectures to it.

Thus the new 'lock_mm_and_find_vma()' helper function, which ends up
being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon,
loongarch, nios2, sh, sparc32, and xtensa.  So we not only convert more
than half the architectures, we now have more shared code and avoid some
of those twisty little passages.

And largely due to this common helper function, the full diffstat of
this series ends up deleting more lines than it adds.

That still leaves eight architectures (ia64, m68k, microblaze, openrisc,
parisc, s390, sparc64 and um) that end up doing 'expand_stack()'
manually because they are doing something slightly different from the
normal pattern.  Along with the couple of special cases in execve() and
GUP.

So there's a couple of patches that first create 'locked' helper
versions of the stack expansion functions, so that there's a obvious
path forward in the conversion.  The execve() case is then actually
pretty simple, and is a nice cleanup from our old "grow-up stackls are
special, because at execve time even they grow down".

The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because
it's just more straightforward to write out the stack expansion there
manually, instead od having get_user_pages_remote() do it for us in some
situations but not others and have to worry about locking rules for GUP.

And the final step is then to just convert the remaining odd cases to a
new world order where 'expand_stack()' is called with the mmap_lock held
for reading, but where it might drop it and upgrade it to a write, only
to return with it held for reading (in the success case) or with it
completely dropped (in the failure case).

In the process, we remove all the stack expansion from GUP (where
dropping the lock wouldn't be ok without special rules anyway), and add
it in manually to __access_remote_vm() for ptrace().

Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases.
Everything else here felt pretty straightforward, but the ia64 rules for
stack expansion are really quite odd and very different from everything
else.  Also thanks to Vegard Nossum who caught me getting one of those
odd conditions entirely the wrong way around.

Anyway, I think I want to actually move all the stack expansion code to
a whole new file of its own, rather than have it split up between
mm/mmap.c and mm/memory.c, but since this will have to be backported to
the initial maple tree vma introduction anyway, I tried to keep the
patches _fairly_ minimal.

Also, while I don't think it's valid to expand the stack from GUP, the
final patch in here is a "warn if some crazy GUP user wants to try to
expand the stack" patch.  That one will be reverted before the final
release, but it's left to catch any odd cases during the merge window
and release candidates.

Reported-by: Ruihan Li <lrh2000@pku.edu.cn>

* branch 'expand-stack':
  gup: add warning if some caller would seem to want stack expansion
  mm: always expand the stack with the mmap write lock held
  execve: expand new process stack manually ahead of time
  mm: make find_extend_vma() fail if write lock not held
  powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma()
  mm/fault: convert remaining simple cases to lock_mm_and_find_vma()
  arm/mm: Convert to using lock_mm_and_find_vma()
  riscv/mm: Convert to using lock_mm_and_find_vma()
  mips/mm: Convert to using lock_mm_and_find_vma()
  powerpc/mm: Convert to using lock_mm_and_find_vma()
  arm64/mm: Convert to using lock_mm_and_find_vma()
  mm: make the page fault mmap locking killable
  mm: introduce new 'lock_mm_and_find_vma()' page fault helper
2023-06-28 20:35:21 -07:00
Linus Torvalds 3a8a670eee Networking changes for 6.5.
Core
 ----
 
  - Rework the sendpage & splice implementations. Instead of feeding
    data into sockets page by page extend sendmsg handlers to support
    taking a reference on the data, controlled by a new flag called
    MSG_SPLICE_PAGES. Rework the handling of unexpected-end-of-file
    to invoke an additional callback instead of trying to predict what
    the right combination of MORE/NOTLAST flags is.
    Remove the MSG_SENDPAGE_NOTLAST flag completely.
 
  - Implement SCM_PIDFD, a new type of CMSG type analogous to
    SCM_CREDENTIALS, but it contains pidfd instead of plain pid.
 
  - Enable socket busy polling with CONFIG_RT.
 
  - Improve reliability and efficiency of reporting for ref_tracker.
 
  - Auto-generate a user space C library for various Netlink families.
 
 Protocols
 ---------
 
  - Allow TCP to shrink the advertised window when necessary, prevent
    sk_rcvbuf auto-tuning from growing the window all the way up to
    tcp_rmem[2].
 
  - Use per-VMA locking for "page-flipping" TCP receive zerocopy.
 
  - Prepare TCP for device-to-device data transfers, by making sure
    that payloads are always attached to skbs as page frags.
 
  - Make the backoff time for the first N TCP SYN retransmissions
    linear. Exponential backoff is unnecessarily conservative.
 
  - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO).
 
  - Avoid waking up applications using TLS sockets until we have
    a full record.
 
  - Allow using kernel memory for protocol ioctl callbacks, paving
    the way to issuing ioctls over io_uring.
 
  - Add nolocalbypass option to VxLAN, forcing packets to be fully
    encapsulated even if they are destined for a local IP address.
 
  - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
    in-kernel ECMP implementation (e.g. Open vSwitch) select the same
    link for all packets. Support L4 symmetric hashing in Open vSwitch.
 
  - PPPoE: make number of hash bits configurable.
 
  - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
    (ipconfig).
 
  - Add layer 2 miss indication and filtering, allowing higher layers
    (e.g. ACL filters) to make forwarding decisions based on whether
    packet matched forwarding state in lower devices (bridge).
 
  - Support matching on Connectivity Fault Management (CFM) packets.
 
  - Hide the "link becomes ready" IPv6 messages by demoting their
    printk level to debug.
 
  - HSR: don't enable promiscuous mode if device offloads the proto.
 
  - Support active scanning in IEEE 802.15.4.
 
  - Continue work on Multi-Link Operation for WiFi 7.
 
 BPF
 ---
 
  - Add precision propagation for subprogs and callbacks. This allows
    maintaining verification efficiency when subprograms are used,
    or in fact passing the verifier at all for complex programs,
    especially those using open-coded iterators.
 
  - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
    assumed the length is always equal to the amount of written data.
    But some protos allow passing a NULL buffer to discover what
    the output buffer *should* be, without writing anything.
 
  - Accept dynptr memory as memory arguments passed to helpers.
 
  - Add routing table ID to bpf_fib_lookup BPF helper.
 
  - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands.
 
  - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
    maps as read-only).
 
  - Show target_{obj,btf}_id in tracing link fdinfo.
 
  - Addition of several new kfuncs (most of the names are self-explanatory):
    - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
      bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
      and bpf_dynptr_clone().
    - bpf_task_under_cgroup()
    - bpf_sock_destroy() - force closing sockets
    - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
 
 Netfilter
 ---------
 
  - Relax set/map validation checks in nf_tables. Allow checking
    presence of an entry in a map without using the value.
 
  - Increase ip_vs_conn_tab_bits range for 64BIT builds.
 
  - Allow updating size of a set.
 
  - Improve NAT tuple selection when connection is closing.
 
 Driver API
 ----------
 
  - Integrate netdev with LED subsystem, to allow configuring HW
    "offloaded" blinking of LEDs based on link state and activity
    (i.e. packets coming in and out).
 
  - Support configuring rate selection pins of SFP modules.
 
  - Factor Clause 73 auto-negotiation code out of the drivers, provide
    common helper routines.
 
  - Add more fool-proof helpers for managing lifetime of MDIO devices
    associated with the PCS layer.
 
  - Allow drivers to report advanced statistics related to Time Aware
    scheduler offload (taprio).
 
  - Allow opting out of VF statistics in link dump, to allow more VFs
    to fit into the message.
 
  - Split devlink instance and devlink port operations.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Synopsys EMAC4 IP support (stmmac)
    - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
    - Marvell 88E6250 7 port switches
    - Microchip LAN8650/1 Rev.B0 PHYs
    - MediaTek MT7981/MT7988 built-in 1GE PHY driver
 
  - WiFi:
    - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
    - Realtek RTL8723DS (SDIO variant)
    - Realtek RTL8851BE
 
  - CAN:
    - Fintek F81604
 
 Drivers
 -------
 
  - Ethernet NICs:
    - Intel (100G, ice):
      - support dynamic interrupt allocation
      - use meta data match instead of VF MAC addr on slow-path
    - nVidia/Mellanox:
      - extend link aggregation to handle 4, rather than just 2 ports
      - spawn sub-functions without any features by default
    - OcteonTX2:
      - support HTB (Tx scheduling/QoS) offload
      - make RSS hash generation configurable
      - support selecting Rx queue using TC filters
    - Wangxun (ngbe/txgbe):
      - add basic Tx/Rx packet offloads
      - add phylink support (SFP/PCS control)
    - Freescale/NXP (enetc):
      - report TAPRIO packet statistics
    - Solarflare/AMD:
      - support matching on IP ToS and UDP source port of outer header
      - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
      - add devlink dev info support for EF10
 
  - Virtual NICs:
    - Microsoft vNIC:
      - size the Rx indirection table based on requested configuration
      - support VLAN tagging
    - Amazon vNIC:
      - try to reuse Rx buffers if not fully consumed, useful for ARM
        servers running with 16kB pages
    - Google vNIC:
      - support TCP segmentation of >64kB frames
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - enable USXGMII (88E6191X)
    - Microchip:
     - lan966x: add support for Egress Stage 0 ACL engine
     - lan966x: support mapping packet priority to internal switch
       priority (based on PCP or DSCP)
 
  - Ethernet PHYs:
    - Broadcom PHYs:
      - support for Wake-on-LAN for BCM54210E/B50212E
      - report LPI counter
    - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
    - Micrel PHYs: receive timestamp in the frame (LAN8841)
    - Realtek PHYs: support optional external PHY clock
    - Altera TSE PCS: merge the driver into Lynx PCS which it is
      a variant of
 
  - CAN: Kvaser PCIEcan:
    - support packet timestamping
 
  - WiFi:
    - Intel (iwlwifi):
      - major update for new firmware and Multi-Link Operation (MLO)
      - configuration rework to drop test devices and split
        the different families
      - support for segmented PNVM images and power tables
      - new vendor entries for PPAG (platform antenna gain) feature
    - Qualcomm 802.11ax (ath11k):
      - Multiple Basic Service Set Identifier (MBSSID) and
        Enhanced MBSSID Advertisement (EMA) support in AP mode
      - support factory test mode
    - RealTek (rtw89):
      - add RSSI based antenna diversity
      - support U-NII-4 channels on 5 GHz band
    - RealTek (rtl8xxxu):
      - AP mode support for 8188f
      - support USB RX aggregation for the newer chips
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmSbJM4ACgkQMUZtbf5S
 IrtoDhAAhEim1+LBIKf4lhPcVdZ2p/TkpnwTz5jsTwSeRBAxTwuNJ2fQhFXg13E3
 MnRq6QaEp8G4/tA/gynLvQop+FEZEnv+horP0zf/XLcC8euU7UrKdrpt/4xxdP07
 IL/fFWsoUGNO+L9LNaHwBo8g7nHvOkPscHEBHc2Xrvzab56TJk6vPySfLqcpKlNZ
 CHWDwTpgRqNZzSKiSpoMVd9OVMKUXcPYHpDmfEJ5l+e8vTXmZzOLHrSELHU5nP5f
 mHV7gxkDCTshoGcaed7UTiOvgu1p6E5EchDJxiLaSUbgsd8SZ3u4oXwRxgj33RK/
 fB2+UaLrRt/DdlHvT/Ph8e8Ygu77yIXMjT49jsfur/zVA0HEA2dFb7V6QlsYRmQp
 J25pnrdXmE15llgqsC0/UOW5J1laTjII+T2T70UOAqQl4LWYAQDG4WwsAqTzU0KY
 dueydDouTp9XC2WYrRUEQxJUzxaOaazskDUHc5c8oHp/zVBT+djdgtvVR9+gi6+7
 yy4elI77FlEEqL0ItdU/lSWINayAlPLsIHkMyhSGKX0XDpKjeycPqkNx4UterXB/
 JKIR5RBWllRft+igIngIkKX0tJGMU0whngiw7d1WLw25wgu4sB53hiWWoSba14hv
 tXMxwZs5iGaPcT38oRVMZz8I1kJM4Dz3SyI7twVvi4RUut64EG4=
 =9i4I
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking changes from Jakub Kicinski:
 "WiFi 7 and sendpage changes are the biggest pieces of work for this
  release. The latter will definitely require fixes but I think that we
  got it to a reasonable point.

  Core:

   - Rework the sendpage & splice implementations

     Instead of feeding data into sockets page by page extend sendmsg
     handlers to support taking a reference on the data, controlled by a
     new flag called MSG_SPLICE_PAGES

     Rework the handling of unexpected-end-of-file to invoke an
     additional callback instead of trying to predict what the right
     combination of MORE/NOTLAST flags is

     Remove the MSG_SENDPAGE_NOTLAST flag completely

   - Implement SCM_PIDFD, a new type of CMSG type analogous to
     SCM_CREDENTIALS, but it contains pidfd instead of plain pid

   - Enable socket busy polling with CONFIG_RT

   - Improve reliability and efficiency of reporting for ref_tracker

   - Auto-generate a user space C library for various Netlink families

  Protocols:

   - Allow TCP to shrink the advertised window when necessary, prevent
     sk_rcvbuf auto-tuning from growing the window all the way up to
     tcp_rmem[2]

   - Use per-VMA locking for "page-flipping" TCP receive zerocopy

   - Prepare TCP for device-to-device data transfers, by making sure
     that payloads are always attached to skbs as page frags

   - Make the backoff time for the first N TCP SYN retransmissions
     linear. Exponential backoff is unnecessarily conservative

   - Create a new MPTCP getsockopt to retrieve all info
     (MPTCP_FULL_INFO)

   - Avoid waking up applications using TLS sockets until we have a full
     record

   - Allow using kernel memory for protocol ioctl callbacks, paving the
     way to issuing ioctls over io_uring

   - Add nolocalbypass option to VxLAN, forcing packets to be fully
     encapsulated even if they are destined for a local IP address

   - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
     in-kernel ECMP implementation (e.g. Open vSwitch) select the same
     link for all packets. Support L4 symmetric hashing in Open vSwitch

   - PPPoE: make number of hash bits configurable

   - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
     (ipconfig)

   - Add layer 2 miss indication and filtering, allowing higher layers
     (e.g. ACL filters) to make forwarding decisions based on whether
     packet matched forwarding state in lower devices (bridge)

   - Support matching on Connectivity Fault Management (CFM) packets

   - Hide the "link becomes ready" IPv6 messages by demoting their
     printk level to debug

   - HSR: don't enable promiscuous mode if device offloads the proto

   - Support active scanning in IEEE 802.15.4

   - Continue work on Multi-Link Operation for WiFi 7

  BPF:

   - Add precision propagation for subprogs and callbacks. This allows
     maintaining verification efficiency when subprograms are used, or
     in fact passing the verifier at all for complex programs,
     especially those using open-coded iterators

   - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
     assumed the length is always equal to the amount of written data.
     But some protos allow passing a NULL buffer to discover what the
     output buffer *should* be, without writing anything

   - Accept dynptr memory as memory arguments passed to helpers

   - Add routing table ID to bpf_fib_lookup BPF helper

   - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands

   - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
     maps as read-only)

   - Show target_{obj,btf}_id in tracing link fdinfo

   - Addition of several new kfuncs (most of the names are
     self-explanatory):
      - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
        bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
        and bpf_dynptr_clone().
      - bpf_task_under_cgroup()
      - bpf_sock_destroy() - force closing sockets
      - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs

  Netfilter:

   - Relax set/map validation checks in nf_tables. Allow checking
     presence of an entry in a map without using the value

   - Increase ip_vs_conn_tab_bits range for 64BIT builds

   - Allow updating size of a set

   - Improve NAT tuple selection when connection is closing

  Driver API:

   - Integrate netdev with LED subsystem, to allow configuring HW
     "offloaded" blinking of LEDs based on link state and activity
     (i.e. packets coming in and out)

   - Support configuring rate selection pins of SFP modules

   - Factor Clause 73 auto-negotiation code out of the drivers, provide
     common helper routines

   - Add more fool-proof helpers for managing lifetime of MDIO devices
     associated with the PCS layer

   - Allow drivers to report advanced statistics related to Time Aware
     scheduler offload (taprio)

   - Allow opting out of VF statistics in link dump, to allow more VFs
     to fit into the message

   - Split devlink instance and devlink port operations

  New hardware / drivers:

   - Ethernet:
      - Synopsys EMAC4 IP support (stmmac)
      - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
      - Marvell 88E6250 7 port switches
      - Microchip LAN8650/1 Rev.B0 PHYs
      - MediaTek MT7981/MT7988 built-in 1GE PHY driver

   - WiFi:
      - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
      - Realtek RTL8723DS (SDIO variant)
      - Realtek RTL8851BE

   - CAN:
      - Fintek F81604

  Drivers:

   - Ethernet NICs:
      - Intel (100G, ice):
         - support dynamic interrupt allocation
         - use meta data match instead of VF MAC addr on slow-path
      - nVidia/Mellanox:
         - extend link aggregation to handle 4, rather than just 2 ports
         - spawn sub-functions without any features by default
      - OcteonTX2:
         - support HTB (Tx scheduling/QoS) offload
         - make RSS hash generation configurable
         - support selecting Rx queue using TC filters
      - Wangxun (ngbe/txgbe):
         - add basic Tx/Rx packet offloads
         - add phylink support (SFP/PCS control)
      - Freescale/NXP (enetc):
         - report TAPRIO packet statistics
      - Solarflare/AMD:
         - support matching on IP ToS and UDP source port of outer
           header
         - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
         - add devlink dev info support for EF10

   - Virtual NICs:
      - Microsoft vNIC:
         - size the Rx indirection table based on requested
           configuration
         - support VLAN tagging
      - Amazon vNIC:
         - try to reuse Rx buffers if not fully consumed, useful for ARM
           servers running with 16kB pages
      - Google vNIC:
         - support TCP segmentation of >64kB frames

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - enable USXGMII (88E6191X)
      - Microchip:
         - lan966x: add support for Egress Stage 0 ACL engine
         - lan966x: support mapping packet priority to internal switch
           priority (based on PCP or DSCP)

   - Ethernet PHYs:
      - Broadcom PHYs:
         - support for Wake-on-LAN for BCM54210E/B50212E
         - report LPI counter
      - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
      - Micrel PHYs: receive timestamp in the frame (LAN8841)
      - Realtek PHYs: support optional external PHY clock
      - Altera TSE PCS: merge the driver into Lynx PCS which it is a
        variant of

   - CAN: Kvaser PCIEcan:
      - support packet timestamping

   - WiFi:
      - Intel (iwlwifi):
         - major update for new firmware and Multi-Link Operation (MLO)
         - configuration rework to drop test devices and split the
           different families
         - support for segmented PNVM images and power tables
         - new vendor entries for PPAG (platform antenna gain) feature
      - Qualcomm 802.11ax (ath11k):
         - Multiple Basic Service Set Identifier (MBSSID) and Enhanced
           MBSSID Advertisement (EMA) support in AP mode
         - support factory test mode
      - RealTek (rtw89):
         - add RSSI based antenna diversity
         - support U-NII-4 channels on 5 GHz band
      - RealTek (rtl8xxxu):
         - AP mode support for 8188f
         - support USB RX aggregation for the newer chips"

* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
  net: scm: introduce and use scm_recv_unix helper
  af_unix: Skip SCM_PIDFD if scm->pid is NULL.
  net: lan743x: Simplify comparison
  netlink: Add __sock_i_ino() for __netlink_diag_dump().
  net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
  Revert "af_unix: Call scm_recv() only after scm_set_cred()."
  phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
  libceph: Partially revert changes to support MSG_SPLICE_PAGES
  net: phy: mscc: fix packet loss due to RGMII delays
  net: mana: use vmalloc_array and vcalloc
  net: enetc: use vmalloc_array and vcalloc
  ionic: use vmalloc_array and vcalloc
  pds_core: use vmalloc_array and vcalloc
  gve: use vmalloc_array and vcalloc
  octeon_ep: use vmalloc_array and vcalloc
  net: usb: qmi_wwan: add u-blox 0x1312 composition
  perf trace: fix MSG_SPLICE_PAGES build error
  ipvlan: Fix return value of ipvlan_queue_xmit()
  netfilter: nf_tables: fix underflow in chain reference counter
  netfilter: nf_tables: unbind non-anonymous set if rule construction fails
  ...
2023-06-28 16:43:10 -07:00
Linus Torvalds 6a8cbd9253 v6.5-rc1-sysctl-next
The changes queued up for v6.5-rc1 for sysctl are in line with
 prior efforts to stop usage of deprecated routines which incur
 recursion and also make it hard to remove the empty array element
 in each sysctl array declaration. The most difficult user to modify
 was parport which required a bit of re-thinking of how to declare shared
 sysctls there, Joel Granados has stepped up to the plate to do most of
 this work and eventual removal of register_sysctl_table(). That work
 ended up saving us about 1465 bytes according to bloat-o-meter. Since
 we gained a few bloat-o-meter karma points I moved two rather small
 sysctl arrays from kernel/sysctl.c leaving us only two more sysctl
 arrays to move left.
 
 Most changes have been tested on linux-next for about a month. The last
 straggler patches are a minor parport fix, changes to the sysctl
 kernel selftest so to verify correctness and prevent regressions for
 the future change he made to provide an alternative solution for the
 special sysctl mount point target which was using the now deprecated
 sysctl child element.
 
 This is all prep work to now finally be able to remove the empty
 array element in all sysctl declarations / registrations which is
 expected to save us a bit of bytes all over the kernel. That work
 will be tested early after v6.5-rc1 is out.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmSceh0SHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinBFQQAK9WdpcU8ODoDzoSls4jsCQpZUCfZ+ED
 pbCgQqUqu9VPs6bnJ+aXVa6Fh3uCr6+TIfNFM55qI/Sbo2issZ7bm0nvKmGgc6/m
 giqDP7btvHqiAsEootci8DVdbBXKkdH4dx3pSwleyN8pdinewH0hrKImaPpahyo6
 1mB1du0iI89yjsZmheHVVSyfXXYAnP0PqRVy5Y+qxY7yYlIegQ5uAZmwRE62lfTf
 TuiV7OFuDZ2DBYOmqIhfGKGRnfOL5ZVF3iHCrfUpX3p+fEFzDmwvm3vr73PTSrFw
 /aRRLa/hOWr5ilw1bvnMcazgQzFEOlQb3DMhBKH7gLl3XHVrM+TaaqYHjUia1+6Y
 e2axz/duA2q9uLMW81daRApvHMCgy0exkpC7prfOxF5bgTe4TjA7ZWvGpqG1kPKT
 PPSxw80XvG5hLZm4tB0ZWJ5rOfFpiUGGneSeRQwyuClBt73SIO+F03jyGpt83slU
 jFE50ac14Zwh1oxpCQtYoR1+bXWdq1QwM5vQBNEuaoTSnJfVjrXqBz/BnqJChtjr
 m1vA27+4/dfki2P3gVWF1lGx43ir3uJvqk+BjWXm2CDDJqpRi3N0qcUwZwLuqAAz
 /LEgFqK61bpHi/C8c2NWAxIoeWRU4NUOaoiKmZwyt0sKAWU1Yzg70xssYeg7VYqZ
 3pvFNVBqkV+F
 =sXUU
 -----END PGP SIGNATURE-----

Merge tag 'v6.5-rc1-sysctl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull sysctl updates from Luis Chamberlain:
 "The changes for sysctl are in line with prior efforts to stop usage of
  deprecated routines which incur recursion and also make it hard to
  remove the empty array element in each sysctl array declaration.

  The most difficult user to modify was parport which required a bit of
  re-thinking of how to declare shared sysctls there, Joel Granados has
  stepped up to the plate to do most of this work and eventual removal
  of register_sysctl_table(). That work ended up saving us about 1465
  bytes according to bloat-o-meter. Since we gained a few bloat-o-meter
  karma points I moved two rather small sysctl arrays from
  kernel/sysctl.c leaving us only two more sysctl arrays to move left.

  Most changes have been tested on linux-next for about a month. The
  last straggler patches are a minor parport fix, changes to the sysctl
  kernel selftest so to verify correctness and prevent regressions for
  the future change he made to provide an alternative solution for the
  special sysctl mount point target which was using the now deprecated
  sysctl child element.

  This is all prep work to now finally be able to remove the empty array
  element in all sysctl declarations / registrations which is expected
  to save us a bit of bytes all over the kernel. That work will be
  tested early after v6.5-rc1 is out"

* tag 'v6.5-rc1-sysctl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  sysctl: replace child with an enumeration
  sysctl: Remove debugging dump_stack
  test_sysclt: Test for registering a mount point
  test_sysctl: Add an option to prevent test skip
  test_sysctl: Add an unregister sysctl test
  test_sysctl: Group node sysctl test under one func
  test_sysctl: Fix test metadata getters
  parport: plug a sysctl register leak
  sysctl: move security keys sysctl registration to its own file
  sysctl: move umh sysctl registration to its own file
  signal: move show_unhandled_signals sysctl to its own file
  sysctl: remove empty dev table
  sysctl: Remove register_sysctl_table
  sysctl: Refactor base paths registrations
  sysctl: stop exporting register_sysctl_table
  parport: Removed sysctl related defines
  parport: Remove register_sysctl_table from parport_default_proc_register
  parport: Remove register_sysctl_table from parport_device_proc_register
  parport: Remove register_sysctl_table from parport_proc_register
  parport: Move magic number "15" to a define
2023-06-28 16:05:21 -07:00
Paulo Alcantara 5f2a0afa98 smb: client: improve DFS mount check
Some servers may return error codes from REQ_GET_DFS_REFERRAL requests
that are unexpected by the client, so to make it easier, assume
non-DFS mounts when the client can't get the initial DFS referral of
@ctx->UNC in dfs_mount_share().

Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-28 15:49:02 -05:00
Paulo Alcantara 3ae872de41 smb: client: fix shared DFS root mounts with different prefixes
When having two DFS root mounts that are connected to same namespace,
same mount options but different prefix paths, we can't really use the
shared @server->origin_fullpath when chasing DFS links in them.

Move the origin_fullpath field to cifs_tcon structure so when having
shared DFS root mounts with different prefix paths, and we need to
chase any DFS links, dfs_get_automount_devname() will pick up the
correct full path out of the @tcon that will be used for the new
mount.

Before patch

  mount.cifs //dom/dfs/dir /mnt/1 -o ...
  mount.cifs //dom/dfs /mnt/2 -o ...
  # shared server, ses, tcon
  # server: origin_fullpath=//dom/dfs/dir

  # @server->origin_fullpath + '/dir/link1'
  $ ls /mnt/2/dir/link1
  ls: cannot open directory '/mnt/2/dir/link1': No such file or directory

After patch

  mount.cifs //dom/dfs/dir /mnt/1 -o ...
  mount.cifs //dom/dfs /mnt/2 -o ...
  # shared server & ses
  # tcon_1: origin_fullpath=//dom/dfs/dir
  # tcon_2: origin_fullpath=//dom/dfs

  # @tcon_2->origin_fullpath + '/dir/link1'
  $ ls /mnt/2/dir/link1
  dir0  dir1  dir10  dir3  dir5  dir6  dir7  dir9  target2_file.txt  tsub

Fixes: 8e3554150d ("cifs: fix sharing of DFS connections")
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-28 15:48:43 -05:00
Linus Torvalds 77b1a7f7a0 - Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in
top-level directories.
 
 - Douglas Anderson has added a new "buddy" mode to the hardlockup
   detector.  It permits the detector to work on architectures which
   cannot provide the required interrupts, by having CPUs periodically
   perform checks on other CPUs.
 
 - Zhen Lei has enhanced kexec's ability to support two crash regions.
 
 - Petr Mladek has done a lot of cleanup on the hard lockup detector's
   Kconfig entries.
 
 - And the usual bunch of singleton patches in various places.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJelTAAKCRDdBJ7gKXxA
 juDkAP0VXWynzkXoojdS/8e/hhi+htedmQ3v2dLZD+vBrctLhAEA7rcH58zAVoWa
 2ejqO6wDrRGUC7JQcO9VEjT0nv73UwU=
 =F293
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-mm updates from Andrew Morton:

 - Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in top-level
   directories

 - Douglas Anderson has added a new "buddy" mode to the hardlockup
   detector. It permits the detector to work on architectures which
   cannot provide the required interrupts, by having CPUs periodically
   perform checks on other CPUs

 - Zhen Lei has enhanced kexec's ability to support two crash regions

 - Petr Mladek has done a lot of cleanup on the hard lockup detector's
   Kconfig entries

 - And the usual bunch of singleton patches in various places

* tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits)
  kernel/time/posix-stubs.c: remove duplicated include
  ocfs2: remove redundant assignment to variable bit_off
  watchdog/hardlockup: fix typo in config HARDLOCKUP_DETECTOR_PREFER_BUDDY
  powerpc: move arch_trigger_cpumask_backtrace from nmi.h to irq.h
  devres: show which resource was invalid in __devm_ioremap_resource()
  watchdog/hardlockup: define HARDLOCKUP_DETECTOR_ARCH
  watchdog/sparc64: define HARDLOCKUP_DETECTOR_SPARC64
  watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific
  watchdog/hardlockup: declare arch_touch_nmi_watchdog() only in linux/nmi.h
  watchdog/hardlockup: make the config checks more straightforward
  watchdog/hardlockup: sort hardlockup detector related config values a logical way
  watchdog/hardlockup: move SMP barriers from common code to buddy code
  watchdog/buddy: simplify the dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY
  watchdog/buddy: don't copy the cpumask in watchdog_next_cpu()
  watchdog/buddy: cleanup how watchdog_buddy_check_hardlockup() is called
  watchdog/hardlockup: remove softlockup comment in touch_nmi_watchdog()
  watchdog/hardlockup: in watchdog_hardlockup_check() use cpumask_copy()
  watchdog/hardlockup: don't use raw_cpu_ptr() in watchdog_hardlockup_kick()
  watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement watchdog_hardlockup_probe()
  watchdog/hardlockup: keep kernel.nmi_watchdog sysctl as 0444 if probe fails
  ...
2023-06-28 10:59:38 -07:00
Linus Torvalds 6e17c6de3d - Yosry Ahmed brought back some cgroup v1 stats in OOM logs.
- Yosry has also eliminated cgroup's atomic rstat flushing.
 
 - Nhat Pham adds the new cachestat() syscall.  It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability.
 
 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning.
 
 - Lorenzo Stoakes has done some maintanance work on the get_user_pages()
   interface.
 
 - Liam Howlett continues with cleanups and maintenance work to the maple
   tree code.  Peng Zhang also does some work on maple tree.
 
 - Johannes Weiner has done some cleanup work on the compaction code.
 
 - David Hildenbrand has contributed additional selftests for
   get_user_pages().
 
 - Thomas Gleixner has contributed some maintenance and optimization work
   for the vmalloc code.
 
 - Baolin Wang has provided some compaction cleanups,
 
 - SeongJae Park continues maintenance work on the DAMON code.
 
 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting.
 
 - Christoph Hellwig has some cleanups for the filemap/directio code.
 
 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the provided
   APIs rather than open-coding accesses.
 
 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings.
 
 - John Hubbard has a series of fixes to the MM selftesting code.
 
 - ZhangPeng continues the folio conversion campaign.
 
 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock.
 
 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
   128 to 8.
 
 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management.
 
 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code.
 
 - Vishal Moola also has done some folio conversion work.
 
 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
 joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
 J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
 =B7yQ
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm updates from Andrew Morton:

 - Yosry Ahmed brought back some cgroup v1 stats in OOM logs

 - Yosry has also eliminated cgroup's atomic rstat flushing

 - Nhat Pham adds the new cachestat() syscall. It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability

 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning

 - Lorenzo Stoakes has done some maintanance work on the
   get_user_pages() interface

 - Liam Howlett continues with cleanups and maintenance work to the
   maple tree code. Peng Zhang also does some work on maple tree

 - Johannes Weiner has done some cleanup work on the compaction code

 - David Hildenbrand has contributed additional selftests for
   get_user_pages()

 - Thomas Gleixner has contributed some maintenance and optimization
   work for the vmalloc code

 - Baolin Wang has provided some compaction cleanups,

 - SeongJae Park continues maintenance work on the DAMON code

 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting

 - Christoph Hellwig has some cleanups for the filemap/directio code

 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the
   provided APIs rather than open-coding accesses

 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings

 - John Hubbard has a series of fixes to the MM selftesting code

 - ZhangPeng continues the folio conversion campaign

 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock

 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
   from 128 to 8

 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management

 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code

 - Vishal Moola also has done some folio conversion work

 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch

* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
  mm/hugetlb: remove hugetlb_set_page_subpool()
  mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
  hugetlb: revert use of page_cache_next_miss()
  Revert "page cache: fix page_cache_next/prev_miss off by one"
  mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
  mm: memcg: rename and document global_reclaim()
  mm: kill [add|del]_page_to_lru_list()
  mm: compaction: convert to use a folio in isolate_migratepages_block()
  mm: zswap: fix double invalidate with exclusive loads
  mm: remove unnecessary pagevec includes
  mm: remove references to pagevec
  mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
  mm: remove struct pagevec
  net: convert sunrpc from pagevec to folio_batch
  i915: convert i915_gpu_error to use a folio_batch
  pagevec: rename fbatch_count()
  mm: remove check_move_unevictable_pages()
  drm: convert drm_gem_put_pages() to use a folio_batch
  i915: convert shmem_sg_free_table() to use a folio_batch
  scatterlist: add sg_set_folio()
  ...
2023-06-28 10:28:11 -07:00
Paulo Alcantara 49024ec879 smb: client: fix parsing of source mount option
Handle trailing and leading separators when parsing UNC and prefix
paths in smb3_parse_devname().  Then, store the sanitised paths in
smb3_fs_context::source.

This fixes the following cases

$ mount //srv/share// /mnt/1 -o ...
$ cat /mnt/1/d0/f0
cat: /mnt/1/d0/f0: Invalid argument

The -EINVAL was returned because the client sent SMB2_CREATE "\\d0\f0"
rather than SMB2_CREATE "\d0\f0".

$ mount //srv//share /mnt/1 -o ...
mount: Invalid argument

The -EINVAL was returned correctly although the client only realised
it after sending a couple of bad requests rather than bailing out
earlier when parsing mount options.

Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-28 11:48:56 -05:00
Paulo Alcantara d439b29057 smb: client: fix broken file attrs with nodfs mounts
*_get_inode_info() functions expect -EREMOTE when query path info
calls find a DFS link, regardless whether !CONFIG_CIFS_DFS_UPCALL or
'nodfs' mount option.  Otherwise, those files will miss the fake DFS
file attributes.

Before patch

  $ mount.cifs //srv/dfs /mnt/1 -o ...,nodfs
  $ ls -l /mnt/1
  ls: cannot access '/mnt/1/link': Operation not supported
  total 0
  -rwxr-xr-x 1 root root 0 Jul 26  2022 dfstest2_file1.txt
  drwxr-xr-x 2 root root 0 Aug  8  2022 dir1
  d????????? ? ?    ?    ?            ? link

After patch

  $ mount.cifs //srv/dfs /mnt/1 -o ...,nodfs
  $ ls -l /mnt/1
  total 0
  -rwxr-xr-x 1 root root 0 Jul 26  2022 dfstest2_file1.txt
  drwxr-xr-x 2 root root 0 Aug  8  2022 dir1
  drwx--x--x 2 root root 0 Jun 26 20:29 link

Fixes: c877ce47e1 ("cifs: reduce roundtrips on create/qinfo requests")
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-28 11:47:53 -05:00
Shyam Prasad N 380958ac4f cifs: print client_guid in DebugData
Having the ClientGUID info makes it easier to debug
issues related to a client on a server that serves a
number of clients.

This change prints the ClientGUID in DebugData.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-28 11:47:50 -05:00
Winston Wen 66be5c48ee cifs: fix session state check in smb2_find_smb_ses
Chech the session state and skip it if it's exiting.

Signed-off-by: Winston Wen <wentao@uniontech.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-28 11:47:46 -05:00
Winston Wen 99f280700b cifs: fix session state check in reconnect to avoid use-after-free issue
Don't collect exiting session in smb2_reconnect_server(), because it
will be released soon.

Note that the exiting session will stay in server->smb_ses_list until
it complete the cifs_free_ipc() and logoff() and then delete itself
from the list.

Signed-off-by: Winston Wen <wentao@uniontech.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-28 11:47:43 -05:00
Shyam Prasad N 326a8d04f1 cifs: do all necessary checks for credits within or before locking
All the server credits and in-flight info is protected by req_lock.
Once the req_lock is held, and we've determined that we have enough
credits to continue, this lock cannot be dropped till we've made the
changes to credits and in-flight count.

However, we used to drop the lock in order to avoid deadlock with
the recent srv_lock. This could cause the checks already made to be
invalidated.

Fixed it by moving the server status check to before locking req_lock.

Fixes: d7d7a66aac ("cifs: avoid use of global locks for high contention data")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-28 11:46:58 -05:00
Shyam Prasad N 33f736187d cifs: prevent use-after-free by freeing the cfile later
In smb2_compound_op we have a possible use-after-free
which can cause hard to debug problems later on.

This was revealed during stress testing with KASAN enabled
kernel. Fixing it by moving the cfile free call to
a few lines below, after the usage.

Fixes: 76894f3e2f ("cifs: improve symlink handling for smb2+")
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-28 11:46:32 -05:00
Linus Torvalds 582c161cf3 hardening updates for v6.5-rc1
- Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko)
 
 - Convert strreplace() to return string start (Andy Shevchenko)
 
 - Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook)
 
 - Add missing function prototypes seen with W=1 (Arnd Bergmann)
 
 - Fix strscpy() kerndoc typo (Arne Welzel)
 
 - Replace strlcpy() with strscpy() across many subsystems which were
   either Acked by respective maintainers or were trivial changes that
   went ignored for multiple weeks (Azeem Shaikh)
 
 - Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers)
 
 - Add KUnit tests for strcat()-family
 
 - Enable KUnit tests of FORTIFY wrappers under UML
 
 - Add more complete FORTIFY protections for strlcat()
 
 - Add missed disabling of FORTIFY for all arch purgatories.
 
 - Enable -fstrict-flex-arrays=3 globally
 
 - Tightening UBSAN_BOUNDS when using GCC
 
 - Improve checkpatch to check for strcpy, strncpy, and fake flex arrays
 
 - Improve use of const variables in FORTIFY
 
 - Add requested struct_size_t() helper for types not pointers
 
 - Add __counted_by macro for annotating flexible array size members
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmSbftQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJj0MD/9X9jzJzCmsAU+yNldeoAzC84Sk
 GVU3RBxGcTNysL1gZXynkIgigw7DWc4htMGeSABHHwQRVP65JCH1Kw/VqIkyumbx
 9LdX6IklMJb4pRT4PVU3azebV4eNmSjlur2UxMeW54Czm91/6I8RHbJOyAPnOUmo
 2oomGdP/hpEHtKR7hgy8Axc6w5ySwQixh2V5sVZG3VbvCS5WKTmTXbs6puuRT5hz
 iHt7v+7VtEg/Qf1W7J2oxfoghvVBsaRrSLrExWT/oZYh1ZxM7DsCAAoG/IsDgHGA
 9LBXiRECgAFThbHVxLvvKZQMXdVk0i8iXLX43XMKC0wTA+NTyH7wlcQQ4RWNMuo8
 sfA9Qm9gMArXaf64aymr3Uwn20Zan0391HdlbhOJZAE6v3PPJbleUnM58AzD2d3r
 5Lz6AIFBxDImy+3f9iDWgacCT5/PkeiXTHzk9QnKhJyKKtRA58XJxj4q2+rPnGJP
 n4haXqoxD5FJbxdXiGKk31RS0U5HBug7wkOcUrTqDHUbc/QNU2b7dxTKUx+zYtCU
 uV5emPzpF4H4z+91WpO47n9gkMAfwV0lt9S2dwS8pxsgqctbmIan+Jgip7rsqZ2G
 OgLXBsb43eEs+6WgO8tVt/ZHYj9ivGMdrcNcsIfikzNs/xweUJ53k2xSEn2xEa5J
 cwANDmkL6QQK7yfeeg==
 =s0j1
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:
 "There are three areas of note:

  A bunch of strlcpy()->strscpy() conversions ended up living in my tree
  since they were either Acked by maintainers for me to carry, or got
  ignored for multiple weeks (and were trivial changes).

  The compiler option '-fstrict-flex-arrays=3' has been enabled
  globally, and has been in -next for the entire devel cycle. This
  changes compiler diagnostics (though mainly just -Warray-bounds which
  is disabled) and potential UBSAN_BOUNDS and FORTIFY _warning_
  coverage. In other words, there are no new restrictions, just
  potentially new warnings. Any new FORTIFY warnings we've seen have
  been fixed (usually in their respective subsystem trees). For more
  details, see commit df8fc4e934.

  The under-development compiler attribute __counted_by has been added
  so that we can start annotating flexible array members with their
  associated structure member that tracks the count of flexible array
  elements at run-time. It is possible (likely?) that the exact syntax
  of the attribute will change before it is finalized, but GCC and Clang
  are working together to sort it out. Any changes can be made to the
  macro while we continue to add annotations.

  As an example of that last case, I have a treewide commit waiting with
  such annotations found via Coccinelle:

    https://git.kernel.org/linus/adc5b3cb48a049563dc673f348eab7b6beba8a9b

  Also see commit dd06e72e68 for more details.

  Summary:

   - Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko)

   - Convert strreplace() to return string start (Andy Shevchenko)

   - Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook)

   - Add missing function prototypes seen with W=1 (Arnd Bergmann)

   - Fix strscpy() kerndoc typo (Arne Welzel)

   - Replace strlcpy() with strscpy() across many subsystems which were
     either Acked by respective maintainers or were trivial changes that
     went ignored for multiple weeks (Azeem Shaikh)

   - Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers)

   - Add KUnit tests for strcat()-family

   - Enable KUnit tests of FORTIFY wrappers under UML

   - Add more complete FORTIFY protections for strlcat()

   - Add missed disabling of FORTIFY for all arch purgatories.

   - Enable -fstrict-flex-arrays=3 globally

   - Tightening UBSAN_BOUNDS when using GCC

   - Improve checkpatch to check for strcpy, strncpy, and fake flex
     arrays

   - Improve use of const variables in FORTIFY

   - Add requested struct_size_t() helper for types not pointers

   - Add __counted_by macro for annotating flexible array size members"

* tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (54 commits)
  netfilter: ipset: Replace strlcpy with strscpy
  uml: Replace strlcpy with strscpy
  um: Use HOST_DIR for mrproper
  kallsyms: Replace all non-returning strlcpy with strscpy
  sh: Replace all non-returning strlcpy with strscpy
  of/flattree: Replace all non-returning strlcpy with strscpy
  sparc64: Replace all non-returning strlcpy with strscpy
  Hexagon: Replace all non-returning strlcpy with strscpy
  kobject: Use return value of strreplace()
  lib/string_helpers: Change returned value of the strreplace()
  jbd2: Avoid printing outside the boundary of the buffer
  checkpatch: Check for 0-length and 1-element arrays
  riscv/purgatory: Do not use fortified string functions
  s390/purgatory: Do not use fortified string functions
  x86/purgatory: Do not use fortified string functions
  acpi: Replace struct acpi_table_slit 1-element array with flex-array
  clocksource: Replace all non-returning strlcpy with strscpy
  string: use __builtin_memcpy() in strlcpy/strlcat
  staging: most: Replace all non-returning strlcpy with strscpy
  drm/i2c: tda998x: Replace all non-returning strlcpy with strscpy
  ...
2023-06-27 21:24:18 -07:00
Linus Torvalds 8ad78685ff pstore updates for v6.5-rc1
- Check for out-of-memory condition (Jiasheng Jiang)
 
 - Convert to platform remove callback returning void (Uwe Kleine-König)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmSbdsYWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJrMrD/wNH1XD1sIb9p6x45hdhoXsz21u
 v5SKptYz0gLPZFbP6O67G24f1jHi2GC498sbbdaETYbJjhqEvdCJ4W1ZccjOPJbS
 pRRejpQ7FLCYUyG107q4C6QtTl0F5Ln1H0PY9e8X47zO9T2l71Qpm9GVXs1CH0JQ
 nuOP+5an6/MXNJd7Xo2p6MoRQJdWx9VSgktRahaQKjGhtTOy1JQWoxDPj0tAa/fw
 Sdcav7QWs+qYW1/X0sOEYkWtAOZLQ4XiHDNORZjcj8wtddT7N8GP2FI29NPxBXzw
 JRnmoqWeVwfRSm90Fn7vMggk+P5/fW/mlXxUp8kWWtlw5aA5oMK28M1C/bfFarxG
 73z630j0gxP+Hhv0/sV/IIcJxpQMD1yl1iE3OOhUDekrjm2OhFKBO15mBAKqWhdR
 ovq2Ho0PEBLfDb1DHC8uiaSlDlhBIDf4bL1NkOdQqVVcYvHtvrh1WYwumcRUsFcM
 6eQS92JeODXtN89aIDVpamQb6bd/0n2Tx/fxW8o81ZisquvInIj+bAJ+GkJkM8nn
 yXXVfPxw6lbHrEhRTxV+U5AYwWZHLRWz5efxp5uCBy7UNIxop14sI+Pm1+pom195
 YJKrK2Jat+9PTn7x7nFMcFescMXMKJyR4CDDox3Jmuz5YTWJ8X8vz0L2wqDFXZm1
 KlCZ4YXgaWvuKMMyFg==
 =ep0z
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore updates from Kees Cook:

 - Check for out-of-memory condition (Jiasheng Jiang)

 - Convert to platform remove callback returning void (Uwe Kleine-König)

* tag 'pstore-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore/ram: Add check for kstrdup
  pstore/ram: Convert to platform remove callback returning void
2023-06-27 21:21:32 -07:00
Linus Torvalds d416a46c95 execve updates for v6.5-rc1
- Fix a few comments for correctness and typos (Baruch Siach)
 
 - Small simplifications for binfmt (Christophe JAILLET)
 
 - Set p_align to 4 for PT_NOTE in core dump (Fangrui Song)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmSbc94WHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkWaEACb5idNLPHL5lb7zr1HGXWVnpo9
 3a4gwNpZpDpODRSL5VQ93WUwzN+74hN15h85orXbjQRZRXJVM4gNmgHargqs4LIb
 LrbB04ylCoAnz+MYPAPkD0Yw4WZXqwK33V11V8luUf9q+zFemjQ0rOzFDmrX5a+y
 qxq8CvY/jETmi5GPzKn35w8gJs8X++8GPIQ0ZDeQzYGZsZZ4m0+f1tqsC9bOmn/F
 LRn0ePaSTrYPQILQ2xVjlCv9HHk8MUsJu7+eyOI0NemWlITookofkycjMDe+9LpS
 hRKhdNbni24BEN2eDdJNC5TeXlvAOEQv4n4GFZ37sUQGdDhjdQTYF9sySluX5+Bo
 fx0qCQyWJMoskYBbBTAERv9hBxZctU94k82XaUxZA1bx0f9h+0USqwL9YrOnF5p/
 4734FUjeotGd8uFpUe/B6/SjRQ6WwYzwoEME/5Q/EKuotLrk3SfLu3fH/rl3cGH8
 mlt4vwD0xDfFgb4Pj5wf8lqxpK+mMCusExqvzKVVV4L7Q5gDIntA2BvSRykjQ7VD
 vwRaQDM9mnDD/t0FgDVVHl91UF5Fsctf81UvfoI5eIMKqFo3NRGCV8znFf8G8//i
 8Qmjj4kpxN+zMOi4nKsTjNDFmauRmOrGaDrWZnyO3m/VlYrYodnCKJbdLsqWH4zx
 a2O3oVICYLUeM0NEkQ==
 =01FE
 -----END PGP SIGNATURE-----

Merge tag 'execve-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve updates from Kees Cook:

 - Fix a few comments for correctness and typos (Baruch Siach)

 - Small simplifications for binfmt (Christophe JAILLET)

 - Set p_align to 4 for PT_NOTE in core dump (Fangrui Song)

* tag 'execve-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  binfmt_elf: fix comment typo s/reset/regset/
  elf: correct note name comment
  binfmt: Slightly simplify elf_fdpic_map_file()
  binfmt: Use struct_size()
  coredump, vmcore: Set p_align to 4 for PT_NOTE
2023-06-27 21:12:41 -07:00
Linus Torvalds 21953eb16c lsm/stable-6.5 PR 20230626
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmSZuh0UHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNpuxAAxChGqME9nE7iITx1TaFRrbK49mDF
 1RZh/5cwzde72lLLFkTFKB6ErMSQkrrtA+jFH7vKsrOslBel1+yO80vkXmhYCeZU
 P3m0FeREUpuU4QV0tbQamPeR+SWohmKi2dYWd+VdpLA+1aTK3KNYsi2NFkDIreap
 BqeRq4S0Rqc4u3/5juk6JCGFhTRWaH16YJQrzIKHF/K3DK+gMhAY5sjuAWzFc6ma
 /5bbD55kdVVDfnsxNSe+lzJ7zEf7TYedLG6BN+R9cVrU+El12a38M29kASaAof5w
 vpb92a27hA9Q5EyQ2O9QXnr2L5CShT4bvAZCGkK4cmZerGNTdM0iojhYj1s7FAV/
 USkWgkDmEuSatp0+DdXlfQyUmZZWlw1W0oiEfZwR8w7TY7q9CU7aD8K7+GDSIazB
 g89nYznVjlaC/oA4/owMraoWP3eiDiAcsQdO052Vv63TVyJtTiRiKyBq5EFLrX8L
 iaUCa4cBaYFc94kN1PZeNXZKwqRc2F6oAFT1YuXnFWBGmixN0kUL023C0xjl/J7P
 02jYYSVzLm22aU39GU0DSnaLfAwl3muazOB3XuyGOhUWHFYzjkc9UhmGp0W50DkK
 qigW3ONA8s8CKUS/q7QSGq+Vf+CVZA5f+daDDPGYstPfCTk61eu0wjwfwek3W0o+
 xKzBr2Od3vTOzAs=
 =3nWy
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20230626' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull lsm updates from Paul Moore:

 - A SafeSetID patch to correct what appears to be a cut-n-paste typo in
   the code causing a UID to be printed where a GID was desired.

   This is coming via the LSM tree because we haven't been able to get a
   response from the SafeSetID maintainer (Micah Morton) in several
   months. Hopefully we are able to get in touch with Micah, but until
   we do I'm going to pick them up in the LSM tree.

 - A small fix to the reiserfs LSM xattr code.

   We're continuing to work through some issues with the reiserfs code
   as we try to fixup the LSM xattr handling, but in the process we're
   uncovering some ugly problems in reiserfs and we may just end up
   removing the LSM xattr support in reiserfs prior to reiserfs'
   removal.

   For better or worse, this shouldn't impact any of the reiserfs users,
   as we discovered that LSM xattrs on reiserfs were completely broken,
   meaning no one is currently using the combo of reiserfs and a file
   labeling LSM.

 - A tweak to how the cap_user_data_t struct/typedef is declared in the
   header file to appease the Sparse gods.

 - In the process of trying to sort out the SafeSetID lost-maintainer
   problem I realized that I needed to update the labeled networking
   entry to "Supported".

 - Minor comment/documentation and spelling fixes.

* tag 'lsm-pr-20230626' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  device_cgroup: Fix kernel-doc warnings in device_cgroup
  SafeSetID: fix UID printed instead of GID
  MAINTAINERS: move labeled networking to "supported"
  capability: erase checker warnings about struct __user_cap_data_struct
  lsm: fix a number of misspellings
  reiserfs: Initialize sec->length in reiserfs_security_init().
  capability: fix kernel-doc warnings in capability.c
2023-06-27 17:24:26 -07:00
Linus Torvalds 26642864f8 Landlock updates for v6.5-rc1
-----BEGIN PGP SIGNATURE-----
 
 iIYEABYIAC4WIQSVyBthFV4iTW/VU1/l49DojIL20gUCZJlO/xAcbWljQGRpZ2lr
 b2QubmV0AAoJEOXj0OiMgvbS/FwA/A5GFGtKnLzFpIAXgc1G3kr8c+J/WF7RVUD+
 PJNEsH6PAP41l3BTqVSeVZ+tdLSC7NbesoX0MTd+rCgAWnr+Pko2Cw==
 =KqpG
 -----END PGP SIGNATURE-----

Merge tag 'landlock-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux

Pull landlock updates from Mickaël Salaün:
 "Add support for Landlock to UML.

  To do this, this fixes the way hostfs manages inodes according to the
  underlying filesystem [1]. They are now properly handled as for other
  filesystems, which enables Landlock support (and probably other
  features).

  This also extends Landlock's tests with 6 pseudo filesystems,
  including hostfs"

[1] https://lore.kernel.org/all/20230612191430.339153-1-mic@digikod.net/

* tag 'landlock-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
  selftests/landlock: Add hostfs tests
  selftests/landlock: Add tests for pseudo filesystems
  selftests/landlock: Make mounts configurable
  selftests/landlock: Add supports_filesystem() helper
  selftests/landlock: Don't create useless file layouts
  hostfs: Fix ephemeral inodes
2023-06-27 17:10:27 -07:00
Linus Torvalds 8d7071af89 mm: always expand the stack with the mmap write lock held
This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.

For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks.  Let's see if any
strange users really wanted that.

It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma.  This makes it fairly
straightforward to convert the remaining architectures.

As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid.  So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.

Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64
Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-27 09:41:30 -07:00
Tavian Barnes d7dbed457c nfsd: Fix creation time serialization order
In nfsd4_encode_fattr(), TIME_CREATE was being written out after all
other times.  However, they should be written out in an order that
matches the bit flags in bmval1, which in this case are

    #define FATTR4_WORD1_TIME_ACCESS        (1UL << 15)
    #define FATTR4_WORD1_TIME_CREATE        (1UL << 18)
    #define FATTR4_WORD1_TIME_DELTA         (1UL << 19)
    #define FATTR4_WORD1_TIME_METADATA      (1UL << 20)
    #define FATTR4_WORD1_TIME_MODIFY        (1UL << 21)

so TIME_CREATE should come second.

I noticed this on a FreeBSD NFSv4.2 client, which supports creation
times.  On this client, file times were weirdly permuted.  With this
patch applied on the server, times looked normal on the client.

Fixes: e377a3e698 ("nfsd: Add support for the birth time attribute")
Link: https://unix.stackexchange.com/q/749605/56202
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-27 12:10:47 -04:00
Theodore Ts'o 2ef6c32a91 ext4: avoid updating the superblock on a r/o mount if not needed
This was noticed by a user who noticied that the mtime of a file
backing a loopback device was getting bumped when the loopback device
is mounted read/only.  Note: This doesn't show up when doing a
loopback mount of a file directly, via "mount -o ro /tmp/foo.img
/mnt", since the loop device is set read-only when mount automatically
creates loop device.  However, this is noticeable for a LUKS loop
device like this:

% cryptsetup luksOpen /tmp/foo.img test
% mount -o ro /dev/loop0 /mnt ; umount /mnt

or, if LUKS is not in use, if the user manually creates the loop
device like this:

% losetup /dev/loop0 /tmp/foo.img
% mount -o ro /dev/loop0 /mnt ; umount /mnt

The modified mtime causes rsync to do a rolling checksum scan of the
file on the local and remote side, incrementally increasing the time
to rsync the not-modified-but-touched image file.

Fixes: eee00237fa ("ext4: commit super block if fs record error when journal record without error")
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/ZIauBR7YiV3rVAHL@glitch
Reported-by: Sean Greenslade <sean@seangreenslade.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:36:45 -04:00
Zhang Yi 31464ab01f jbd2: skip reading super block if it has been verified
We got a NULL pointer dereference issue below while running generic/475
I/O failure pressure test.

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0
 Oops: 0002 [#1] PREEMPT SMP PTI
 CPU: 1 PID: 15600 Comm: fsstress Not tainted 6.4.0-rc5-xfstests-00055-gd3ab1bca26b4 #190
 RIP: 0010:jbd2_journal_set_features+0x13d/0x430
 ...
 Call Trace:
  <TASK>
  ? __die+0x23/0x60
  ? page_fault_oops+0xa4/0x170
  ? exc_page_fault+0x67/0x170
  ? asm_exc_page_fault+0x26/0x30
  ? jbd2_journal_set_features+0x13d/0x430
  jbd2_journal_revoke+0x47/0x1e0
  __ext4_forget+0xc3/0x1b0
  ext4_free_blocks+0x214/0x2f0
  ext4_free_branches+0xeb/0x270
  ext4_ind_truncate+0x2bf/0x320
  ext4_truncate+0x1e4/0x490
  ext4_handle_inode_extension+0x1bd/0x2a0
  ? iomap_dio_complete+0xaf/0x1d0

The root cause is the journal super block had been failed to write out
due to I/O fault injection, it's uptodate bit was cleared by
end_buffer_write_sync() and didn't reset yet in jbd2_write_superblock().
And it raced by journal_get_superblock()->bh_read(), unfortunately, the
read IO is also failed, so the error handling in
journal_fail_superblock() unexpectedly clear the journal->j_sb_buffer,
finally lead to above NULL pointer dereference issue.

If the journal super block had been read and verified, there is no need
to call bh_read() read it again even if it has been failed to written
out. So the fix could be simply move buffer_verified(bh) in front of
bh_read(). Also remove a stale comment left in
jbd2_journal_check_used_features().

Fixes: 51bacdba23d8 ("jbd2: factor out journal initialization from journal_get_superblock()")
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230616015547.3155195-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:36:45 -04:00
Chao Yu c4d13222af ext4: fix to check return value of freeze_bdev() in ext4_shutdown()
freeze_bdev() can fail due to a lot of reasons, it needs to check its
reason before later process.

Fixes: 783d948544 ("ext4: add EXT4_IOC_GOINGDOWN ioctl")
Cc: stable@kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230606073203.1310389-1-chao@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:36:44 -04:00
Baokun Li f3c1c42e0c ext4: refactoring to use the unified helper ext4_quotas_off()
Rename ext4_quota_off_umount() to ext4_quotas_off(), and add type
parameter to replace open code in ext4_enable_quotas().

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230327141630.156875-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:36:44 -04:00
Baokun Li d13f996327 ext4: turn quotas off if mount failed after enabling quotas
Yi found during a review of the patch "ext4: don't BUG on inconsistent
journal feature" that when ext4_mark_recovery_complete() returns an error
value, the error handling path does not turn off the enabled quotas,
which triggers the following kmemleak:

================================================================
unreferenced object 0xffff8cf68678e7c0 (size 64):
comm "mount", pid 746, jiffies 4294871231 (age 11.540s)
hex dump (first 32 bytes):
00 90 ef 82 f6 8c ff ff 00 00 00 00 41 01 00 00  ............A...
c7 00 00 00 bd 00 00 00 0a 00 00 00 48 00 00 00  ............H...
backtrace:
[<00000000c561ef24>] __kmem_cache_alloc_node+0x4d4/0x880
[<00000000d4e621d7>] kmalloc_trace+0x39/0x140
[<00000000837eee74>] v2_read_file_info+0x18a/0x3a0
[<0000000088f6c877>] dquot_load_quota_sb+0x2ed/0x770
[<00000000340a4782>] dquot_load_quota_inode+0xc6/0x1c0
[<0000000089a18bd5>] ext4_enable_quotas+0x17e/0x3a0 [ext4]
[<000000003a0268fa>] __ext4_fill_super+0x3448/0x3910 [ext4]
[<00000000b0f2a8a8>] ext4_fill_super+0x13d/0x340 [ext4]
[<000000004a9489c4>] get_tree_bdev+0x1dc/0x370
[<000000006e723bf1>] ext4_get_tree+0x1d/0x30 [ext4]
[<00000000c7cb663d>] vfs_get_tree+0x31/0x160
[<00000000320e1bed>] do_new_mount+0x1d5/0x480
[<00000000c074654c>] path_mount+0x22e/0xbe0
[<0000000003e97a8e>] do_mount+0x95/0xc0
[<000000002f3d3736>] __x64_sys_mount+0xc4/0x160
[<0000000027d2140c>] do_syscall_64+0x3f/0x90
================================================================

To solve this problem, we add a "failed_mount10" tag, and call
ext4_quota_off_umount() in this tag to release the enabled qoutas.

Fixes: 11215630aa ("ext4: don't BUG on inconsistent journal feature")
Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230327141630.156875-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:36:30 -04:00
Zhang Yi 7294505824 ext4: add journal cycled recording support
Always enable 'JBD2_CYCLE_RECORD' journal option on ext4, letting the
jbd2 continue to record new journal transactions from the recovered
journal head or the checkpointed transactions in the previous mount.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230322013353.1843306-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:13 -04:00
Zhang Yi c7fc605558 jbd2: continue to record log between each mount
For a newly mounted file system, the journal committing thread always
record new transactions from the start of the journal area, no matter
whether the journal was clean or just has been recovered. So the logdump
code in debugfs cannot dump continuous logs between each mount, it is
disadvantageous to analysis corrupted file system image and locate the
file system inconsistency bugs.

If we get a corrupted file system in the running products and want to
find out what has happened, besides lookup the system log, one effective
way is to backtrack the journal log. But we may not always run e2fsck
before each mount and the default fsck -a mode also cannot always
checkout all inconsistencies, so it could left over some inconsistencies
into the next mount until we detect it. Finally, transactions in the
journal may probably discontinuous and some relatively new transactions
has been covered, it becomes hard to analyse. If we could record
transactions continuously between each mount, we could acquire more
useful info from the journal. Like this:

 |Previous mount checkpointed/recovered logs|Current mount logs         |
 |{------}{---}{--------} ... {------}| ... |{======}{========}...000000|

And yes the journal area is limited and cannot record everything, the
problematic transaction may also be covered even if we do this, but
this is still useful for fuzzy tests and short-running products.

This patch save the head blocknr in the superblock after flushing the
journal or unmounting the file system, let the next mount could continue
to record new transaction behind it. This change is backward compatible
because the old kernel does not care about the head blocknr of the
journal. It is also fine if we mount a clean old image without valid
head blocknr, we fail back to set it to s_first just like before.
Finally, for the case of mount an unclean file system, we could also get
the journal head easily after scanning/replaying the journal, it will
continue to record new transaction after the recovered transactions.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230322013353.1843306-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:13 -04:00
Zhang Yi 04c2e98179 jbd2: remove j_format_version
journal->j_format_version is no longer used, remove it.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-7-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:13 -04:00
Zhang Yi 3e5cf02cfa jbd2: factor out journal initialization from journal_get_superblock()
Current journal_get_superblock() couple journal superblock checking and
partial journal initialization, factor out initialization part from it
to make things clear.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-6-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:13 -04:00
Zhang Yi 5cf036d4f1 jbd2: switch to check format version in superblock directly
We should only check and set extented features if journal format version
is 2, and now we check the in memory copy of the superblock
'journal->j_format_version', which relys on the parameter initialization
sequence, switch to use the h_blocktype in superblock cloud be more
clear.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-5-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:13 -04:00
Zhihao Cheng 93e92cfcc1 ext4: ext4_put_super: Remove redundant checking for 'sbi->s_journal_bdev'
As discussed in [1], 'sbi->s_journal_bdev != sb->s_bdev' will always
become true if sbi->s_journal_bdev exists. Filesystem block device and
journal block device are both opened with 'FMODE_EXCL' mode, so these
two devices can't be same one. Then we can remove the redundant checking
'sbi->s_journal_bdev != sb->s_bdev' if 'sbi->s_journal_bdev' exists.

[1] https://lore.kernel.org/lkml/f86584f6-3877-ff18-47a1-2efaa12d18b2@huawei.com/

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-3-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:13 -04:00
Zhihao Cheng 26fb529024 ext4: Fix reusing stale buffer heads from last failed mounting
Following process makes ext4 load stale buffer heads from last failed
mounting in a new mounting operation:
mount_bdev
 ext4_fill_super
 | ext4_load_and_init_journal
 |  ext4_load_journal
 |   jbd2_journal_load
 |    load_superblock
 |     journal_get_superblock
 |      set_buffer_verified(bh) // buffer head is verified
 |   jbd2_journal_recover // failed caused by EIO
 | goto failed_mount3a // skip 'sb->s_root' initialization
 deactivate_locked_super
  kill_block_super
   generic_shutdown_super
    if (sb->s_root)
    // false, skip ext4_put_super->invalidate_bdev->
    // invalidate_mapping_pages->mapping_evict_folio->
    // filemap_release_folio->try_to_free_buffers, which
    // cannot drop buffer head.
   blkdev_put
    blkdev_put_whole
     if (atomic_dec_and_test(&bdev->bd_openers))
     // false, systemd-udev happens to open the device. Then
     // blkdev_flush_mapping->kill_bdev->truncate_inode_pages->
     // truncate_inode_folio->truncate_cleanup_folio->
     // folio_invalidate->block_invalidate_folio->
     // filemap_release_folio->try_to_free_buffers will be skipped,
     // dropping buffer head is missed again.

Second mount:
ext4_fill_super
 ext4_load_and_init_journal
  ext4_load_journal
   ext4_get_journal
    jbd2_journal_init_inode
     journal_init_common
      bh = getblk_unmovable
       bh = __find_get_block // Found stale bh in last failed mounting
      journal->j_sb_buffer = bh
   jbd2_journal_load
    load_superblock
     journal_get_superblock
      if (buffer_verified(bh))
      // true, skip journal->j_format_version = 2, value is 0
    jbd2_journal_recover
     do_one_pass
      next_log_block += count_tags(journal, bh)
      // According to journal_tag_bytes(), 'tag_bytes' calculating is
      // affected by jbd2_has_feature_csum3(), jbd2_has_feature_csum3()
      // returns false because 'j->j_format_version >= 2' is not true,
      // then we get wrong next_log_block. The do_one_pass may exit
      // early whenoccuring non JBD2_MAGIC_NUMBER in 'next_log_block'.

The filesystem is corrupted here, journal is partially replayed, and
new journal sequence number actually is already used by last mounting.

The invalidate_bdev() can drop all buffer heads even racing with bare
reading block device(eg. systemd-udev), so we can fix it by invalidating
bdev in error handling path in __ext4_fill_super().

Fetch a reproducer in [Link].

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217171
Fixes: 25ed6e8a54 ("jbd2: enable journal clients to enable v2 checksumming")
Cc: stable@vger.kernel.org # v3.5
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-2-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:12 -04:00
Brian Foster 310ee0902b ext4: allow concurrent unaligned dio overwrites
We've had reports of significant performance regression of sub-block
(unaligned) direct writes due to the added exclusivity restrictions
in ext4. The purpose of the exclusivity requirement for unaligned
direct writes is to avoid data corruption caused by unserialized
partial block zeroing in the iomap dio layer across overlapping
writes.

XFS has similar requirements for the same underlying reasons, yet
doesn't suffer the extreme performance regression that ext4 does.
The reason for this is that XFS utilizes IOMAP_DIO_OVERWRITE_ONLY
mode, which allows for optimistic submission of concurrent unaligned
I/O and kicks back writes that require partial block zeroing such
that they can be submitted in a safe, exclusive context. Since ext4
already performs most of these checks pre-submission, it can support
something similar without necessarily relying on the iomap flag and
associated retry mechanism.

Update the dio write submission path to allow concurrent submission
of unaligned direct writes that are purely overwrite and so will not
require block zeroing. To improve readability of the various related
checks, move the unaligned I/O handling down into
ext4_dio_write_checks(), where the dio draining and force wait logic
can immediately follow the locking requirement checks. Finally, the
IOMAP_DIO_OVERWRITE_ONLY flag is set to enable a warning check as a
precaution should the ext4 overwrite logic ever become inconsistent
with the zeroing expectations of iomap dio.

The performance improvement of sub-block direct write I/O is shown
in the following fio test on a 64xcpu guest vm:

Test: fio --name=test --ioengine=libaio --direct=1 --group_reporting
--overwrite=1 --thread --size=10G --filename=/mnt/fio
--readwrite=write --ramp_time=10s --runtime=60s --numjobs=8
--blocksize=2k --iodepth=256 --allow_file_create=0

v6.2:		write: IOPS=4328, BW=8724KiB/s
v6.2 (patched):	write: IOPS=801k, BW=1565MiB/s

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230314130759.642710-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:12 -04:00
Theodore Ts'o 4c0cfebdf3 ext4: clean up mballoc criteria comments
Line wrap and slightly clarify the comments describing mballoc's
cirtiera.

Define EXT4_MB_NUM_CRS as part of the enum, so that it will
automatically get updated when criteria is added or removed.

Also fix a potential unitialized use of 'cr' variable if
CONFIG_EXT4_DEBUG is enabled.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:12 -04:00
Baokun Li ab8627e104 ext4: make ext4_zeroout_es() return void
After ext4_es_insert_extent() returns void, the return value in
ext4_zeroout_es() is also unnecessary, so make it return void too.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-13-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:12 -04:00
Baokun Li 6c120399cd ext4: make ext4_es_insert_extent() return void
Now ext4_es_insert_extent() never return error, so make it return void.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-12-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:12 -04:00
Baokun Li 8782b020cc ext4: make ext4_es_insert_delayed_block() return void
Now it never fails when inserting a delay extent, so the return value in
ext4_es_insert_delayed_block is no longer necessary, let it return void.

[ Fixed bug which caused system hangs during bigalloc test runs.  See
 https://lore.kernel.org/r/20230612030405.GH1436857@mit.edu for more
 details.  -- TYT ]

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-11-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:12 -04:00
Baokun Li ed5d285b3f ext4: make ext4_es_remove_extent() return void
Now ext4_es_remove_extent() never fails, so make it return void.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-10-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:12 -04:00
Baokun Li 2a69c45008 ext4: using nofail preallocation in ext4_es_insert_extent()
Similar to in ext4_es_insert_delayed_block(), we use preallocations that
do not fail to avoid inconsistencies, but we do not care about es that are
not must be kept, and we return 0 even if such es memory allocation fails.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-9-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:12 -04:00
Baokun Li 4a2d98447b ext4: using nofail preallocation in ext4_es_insert_delayed_block()
Similar to in ext4_es_remove_extent(), we use a no-fail preallocation
to avoid inconsistencies, except that here we may have to preallocate
two extent_status.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-8-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:12 -04:00
Baokun Li e9fe2b882b ext4: using nofail preallocation in ext4_es_remove_extent()
If __es_remove_extent() returns an error it means that when splitting
extent, allocating an extent that must be kept failed, where returning
an error directly would cause the extent tree to be inconsistent. So we
use GFP_NOFAIL to pre-allocate an extent_status and pass it to
__es_remove_extent() to avoid this problem.

In addition, since the allocated memory is outside the i_es_lock, the
extent_status tree may change and the pre-allocated extent_status is
no longer needed, so we release the pre-allocated extent_status when
es->es_len is not initialized.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-7-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:12 -04:00
Baokun Li bda3efaf77 ext4: use pre-allocated es in __es_remove_extent()
When splitting extent, if the second extent can not be dropped, we return
-ENOMEM and use GFP_NOFAIL to preallocate an extent_status outside of
i_es_lock and pass it to __es_remove_extent() to be used as the second
extent. This ensures that __es_remove_extent() is executed successfully,
thus ensuring consistency in the extent status tree. If the second extent
is not undroppable, we simply drop it and return 0. Then retry is no longer
necessary, remove it.

Now, __es_remove_extent() will always remove what it should, maybe more.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-6-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:12 -04:00
Baokun Li 95f0b32033 ext4: use pre-allocated es in __es_insert_extent()
Pass a extent_status pointer prealloc to __es_insert_extent(). If the
pointer is non-null, it is used directly when a new extent_status is
needed to avoid memory allocation failures.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:12 -04:00
Baokun Li 73a2f03365 ext4: factor out __es_alloc_extent() and __es_free_extent()
Factor out __es_alloc_extent() and __es_free_extent(), which only allocate
and free extent_status in these two helpers.

The ext4_es_alloc_extent() function is split into __es_alloc_extent()
and ext4_es_init_extent(). In __es_alloc_extent() we allocate memory using
GFP_KERNEL | __GFP_NOFAIL | __GFP_ZERO if the memory allocation cannot
fail, otherwise we use GFP_ATOMIC. and the ext4_es_init_extent() is used to
initialize extent_status and update related variables after a successful
allocation.

This is to prepare for the use of pre-allocated extent_status later.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:12 -04:00
Baokun Li 9649eb18c6 ext4: add a new helper to check if es must be kept
In the extent status tree, we have extents which we can just drop without
issues and extents we must not drop - this depends on the extent's status
- currently ext4_es_is_delayed() extents must stay, others may be dropped.

A helper function is added to help determine if the current extent can
be dropped, although only ext4_es_is_delayed() extents cannot be dropped
currently.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:35:12 -04:00
Baokun Li de25d6e961 ext4: only update i_reserved_data_blocks on successful block allocation
In our fault injection test, we create an ext4 file, migrate it to
non-extent based file, then punch a hole and finally trigger a WARN_ON
in the ext4_da_update_reserve_space():

EXT4-fs warning (device sda): ext4_da_update_reserve_space:369:
ino 14, used 11 with only 10 reserved data blocks

When writing back a non-extent based file, if we enable delalloc, the
number of reserved blocks will be subtracted from the number of blocks
mapped by ext4_ind_map_blocks(), and the extent status tree will be
updated. We update the extent status tree by first removing the old
extent_status and then inserting the new extent_status. If the block range
we remove happens to be in an extent, then we need to allocate another
extent_status with ext4_es_alloc_extent().

       use old    to remove   to add new
    |----------|------------|------------|
              old extent_status

The problem is that the allocation of a new extent_status failed due to a
fault injection, and __es_shrink() did not get free memory, resulting in
a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM,
we map to the same extent again, and the number of reserved blocks is again
subtracted from the number of blocks in that extent. Since the blocks in
the same extent are subtracted twice, we end up triggering WARN_ON at
ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks.

For non-extent based file, we update the number of reserved blocks after
ext4_ind_map_blocks() is executed, which causes a problem that when we call
ext4_ind_map_blocks() to create a block, it doesn't always create a block,
but we always reduce the number of reserved blocks. So we move the logic
for updating reserved blocks to ext4_ind_map_blocks() to ensure that the
number of reserved blocks is updated only after we do succeed in allocating
some new blocks.

Fixes: 5f634d064c ("ext4: Fix quota accounting error with fallocate")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:34:56 -04:00
Ojaswin Mujoo f52f3d2b9f ext4: Give symbolic names to mballoc criterias
mballoc criterias have historically been called by numbers
like CR0, CR1... however this makes it confusing to understand
what each criteria is about.

Change these criterias from numbers to symbolic names and add
relevant comments. While we are at it, also reformat and add some
comments to ext4_seq_mb_stats_show() for better readability.

Additionally, define CR_FAST which signifies the criteria
below which we can make quicker decisions like:
  * quitting early if (free block < requested len)
  * avoiding to scan free extents smaller than required len.
  * avoiding to initialize buddy cache and work with existing cache
  * limiting prefetches

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/a2dc6ec5aea5e5e68cf8e788c2a964ffead9c8b0.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:34:56 -04:00
Ojaswin Mujoo 7e170922f0 ext4: Add allocation criteria 1.5 (CR1_5)
CR1_5 aims to optimize allocations which can't be satisfied in CR1. The
fact that we couldn't find a group in CR1 suggests that it would be
difficult to find a continuous extent to compleltely satisfy our
allocations. So before falling to the slower CR2, in CR1.5 we
proactively trim the the preallocations so we can find a group with
(free / fragments) big enough.  This speeds up our allocation at the
cost of slightly reduced preallocation.

The patch also adds a new sysfs tunable:

* /sys/fs/ext4/<partition>/mb_cr1_5_max_trim_order

This controls how much CR1.5 can trim a request before falling to CR2.
For example, for a request of order 7 and max trim order 2, CR1.5 can
trim this upto order 5.

Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/150fdf65c8e4cc4dba71e020ce0859bcf636a5ff.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:34:56 -04:00
Ojaswin Mujoo 856d865c17 ext4: Abstract out logic to search average fragment list
Make the logic of searching average fragment list of a given order reusable
by abstracting it out to a differnet function. This will also avoid
code duplication in upcoming patches.

No functional changes.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/028c11d95b17ce0285f45456709a0ca922df1b83.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:34:56 -04:00
Ojaswin Mujoo 4f3d1e4533 ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs
Before this patch, the call stack in ext4_run_li_request is as follows:

  /*
   * nr = no. of BGs we want to fetch (=s_mb_prefetch)
   * prefetch_ios = no. of BGs not uptodate after
   * 		    ext4_read_block_bitmap_nowait()
   */
  next_group = ext4_mb_prefetch(sb, group, nr, prefetch_ios);
  ext4_mb_prefetch_fini(sb, next_group prefetch_ios);

ext4_mb_prefetch_fini() will only try to initialize buddies for BGs in
range [next_group - prefetch_ios, next_group). This is incorrect since
sometimes (prefetch_ios < nr), which causes ext4_mb_prefetch_fini() to
incorrectly ignore some of the BGs that might need initialization. This
issue is more notable now with the previous patch enabling "fetching" of
BLOCK_UNINIT BGs which are marked buffer_uptodate by default.

Fix this by passing nr to ext4_mb_prefetch_fini() instead of
prefetch_ios so that it considers the right range of groups.

Similarly, make sure we don't pass nr=0 to ext4_mb_prefetch_fini() in
ext4_mb_regular_allocator() since we might have prefetched BLOCK_UNINIT
groups that would need buddy initialization.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/05e648ae04ec5b754207032823e9c1de9a54f87a.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:34:56 -04:00
Ojaswin Mujoo 3c6296046c ext4: Don't skip prefetching BLOCK_UNINIT groups
Currently, ext4_mb_prefetch() and ext4_mb_prefetch_fini() skip
BLOCK_UNINIT groups since fetching their bitmaps doesn't need disk IO.
As a consequence, we end not initializing the buddy structures and CR0/1
lists for these BGs, even though it can be done without any disk IO
overhead. Hence, don't skip such BGs during prefetch and prefetch_fini.

This improves the accuracy of CR0/1 allocation as earlier, we could have
essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy
not being initialized, leading to slower CR2 allocations. With this patch CR0/1
will be able to discover these groups as well, thus improving performance.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/dc3130b8daf45ffe63d8a3c1edcf00eb8ba70e1f.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:34:56 -04:00
Ojaswin Mujoo 1b42001121 ext4: Avoid scanning smaller extents in BG during CR1
When we are inside ext4_mb_complex_scan_group() in CR1, we can be sure
that this group has atleast 1 big enough continuous free extent to satisfy
our request because (free / fragments) > goal length.

Hence, instead of wasting time looping over smaller free extents, only
try to consider the free extent if we are sure that it has enough
continuous free space to satisfy goal length. This is particularly
useful when scanning highly fragmented BGs in CR1 as, without this
patch, the allocator might stop scanning early before reaching the big
enough free extent (due to ac_found > mb_max_to_scan) which causes us to
uncessarily trim the request.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/a5473df4517c53ec940bc9b603ef83a547032a32.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:34:56 -04:00
Ojaswin Mujoo 3ef5d26387 ext4: Add counter to track successful allocation of goal length
Track number of allocations where the length of blocks allocated is equal to the
length of goal blocks (post normalization). This metric could be useful if
making changes to the allocator logic in the future as it could give us
visibility into how often do we trim our requests.

PS: ac_b_ex.fe_len might get modified due to preallocation efforts and
hence we use ac_f_ex.fe_len instead since we want to compare how much the
allocator was able to actually find.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/343620e2be8a237239ea2613a7a866ee8607e973.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:34:56 -04:00
Ojaswin Mujoo fdd9a00943 ext4: Add per CR extent scanned counter
This gives better visibility into the number of extents scanned in each
particular CR. For example, this information can be used to see how out
block group scanning logic is performing when the BG is fragmented.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/55bb6d80f6e22ed2a5a830aa045572bdffc8b1b9.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:34:56 -04:00
Ojaswin Mujoo 4eb7a4a1a3 ext4: Convert mballoc cr (criteria) to enum
Convert criteria to be an enum so it easier to maintain and
update the tracefiles to use enum names. This change also makes
it easier to insert new criterias in the future.

There is no functional change in this patch.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/5d82fd467bdf70ea45bdaef810af3b146013946c.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:34:56 -04:00
Ritesh Harjani 5730cce353 ext4: Remove unused extern variables declaration
ext4_mb_stats & ext4_mb_max_to_scan are never used. We use
sbi->s_mb_stats and sbi->s_mb_max_to_scan instead.
Hence kill these extern declarations.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/928b3142062172533b6d1b5a94de94700590fef3.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:34:56 -04:00
Ritesh Harjani 569f196f1e ext4: mballoc: Remove useless setting of ac_criteria
There will be changes coming in future patches which will introduce a new
criteria for block allocation. This removes the useless setting of ac_criteria.
AFAIU, this might be only used to differentiate between whether a preallocated
blocks was allocated or was regular allocator called for allocating blocks.
Hence this also adds the debug prints to identify what type of block allocation
was done in ext4_mb_show_ac().

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1dbae05617519cb6202f1b299c9d1be3e7cda763.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:34:56 -04:00
Kemeng Shi 2ec6d0a5ea ext4: fix wrong unit use in ext4_mb_new_blocks
Function ext4_free_blocks_simple needs count in cluster. Function
ext4_free_blocks accepts count in block. Convert count to cluster
to fix the mismatch.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: stable@kernel.org
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-12-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:34:48 -04:00
Kemeng Shi 247c3d214c ext4: fix wrong unit use in ext4_mb_clear_bb
Function ext4_issue_discard need count in cluster. Pass count_clusters
instead of count to fix the mismatch.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: stable@kernel.org
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-11-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:34:36 -04:00
Kemeng Shi ad78b5efe4 ext4: remove unused parameter from ext4_mb_new_blocks_simple()
Two cleanups for ext4_mb_new_blocks_simple:
Remove unused parameter handle of ext4_mb_new_blocks_simple.
Move ext4_mb_new_blocks_simple definition before ext4_mb_new_blocks to
remove unnecessary forward declaration of ext4_mb_new_blocks_simple.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-10-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 19:33:59 -04:00
Linus Torvalds 2c96136a3f - Add support for unaccepted memory as specified in the UEFI spec v2.9.
The gist of it all is that Intel TDX and AMD SEV-SNP confidential
   computing guests define the notion of accepting memory before using it
   and thus preventing a whole set of attacks against such guests like
   memory replay and the like.
 
   There are a couple of strategies of how memory should be accepted
   - the current implementation does an on-demand way of accepting.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZ0f4ACgkQEsHwGGHe
 VUpasw//RKoNW9HSU1csY+XnG9uuaT6QKgji+gIEZWWIGPO9iibvbBj6P5WxJE8T
 fe7yb6CGa6d6thoU0v+mQGVVvCd7OjCFwPD5wAo4mXToD7Ig+4mI6jMkaKifqa2f
 N1Uuy8u/zQnGyWrP5Y//WH5bJYfsmds4UGwXI2nLvKlhE7MG90/ePjt7iqnnwZsy
 waLp6a0Q1VeOvnfRszFLHZw/SoER5RSJ4qeVqttkFNmPPEKMK1Kirrl2poR56OQJ
 nMr6LqVtD7erlSJ36VRXOKzLI443A4iIEIg/wBjIOU6L5ZEWJGNqtCDnIqFJ6+TM
 XatsejfRYkkMZH0qXtX9+M0u+HJHbZPCH5rEcA21P3Nbd7od/ANq91qCGoMjtUZ4
 7pZohMG8M6IDvkLiOb8fQVkR5k/9Jbk8UvdN/8jdPx1ERxYMFO3BDvJpV2gzrW4B
 KYtFTPR7j2nY3eKfDpe3flanqYzKUBsKoTlLnlH7UHaiMZ2idwG8AQjlrhC/erCq
 /Lq1LXt4Mq46FyHABc+PSHytu0WWj1nBUftRt+lviY/Uv7TlkBldOTT7wm7itsfF
 HUCTfLWl0CJXKPq8rbbZhAG/exN6Ay6MO3E3OcNq8A72E5y4cXenuG3ic/0tUuOu
 FfjpiMk35qE2Qb4hnj1YtF3XINtd1MpKcuwzGSzEdv9s3J7hrS0=
 =FS95
 -----END PGP SIGNATURE-----

Merge tag 'x86_cc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 confidential computing update from Borislav Petkov:

 - Add support for unaccepted memory as specified in the UEFI spec v2.9.

   The gist of it all is that Intel TDX and AMD SEV-SNP confidential
   computing guests define the notion of accepting memory before using
   it and thus preventing a whole set of attacks against such guests
   like memory replay and the like.

   There are a couple of strategies of how memory should be accepted -
   the current implementation does an on-demand way of accepting.

* tag 'x86_cc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  virt: sevguest: Add CONFIG_CRYPTO dependency
  x86/efi: Safely enable unaccepted memory in UEFI
  x86/sev: Add SNP-specific unaccepted memory support
  x86/sev: Use large PSC requests if applicable
  x86/sev: Allow for use of the early boot GHCB for PSC requests
  x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  x86/sev: Fix calculation of end address based on number of pages
  x86/tdx: Add unaccepted memory support
  x86/tdx: Refactor try_accept_one()
  x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub
  efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  efi: Add unaccepted memory support
  x86/boot/compressed: Handle unaccepted memory
  efi/libstub: Implement support for unaccepted memory
  efi/x86: Get full memory map in allocate_e820()
  mm: Add support for unaccepted memory
2023-06-26 15:32:39 -07:00
Kemeng Shi 11b6890be0 ext4: get block from bh in ext4_free_blocks for fast commit replay
ext4_free_blocks will retrieve block from bh if block parameter is zero.
Retrieve block before ext4_free_blocks_simple to avoid potentially
passing wrong block to ext4_free_blocks_simple.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: stable@kernel.org
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-9-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26 18:13:05 -04:00
Linus Torvalds a0433f8cae for-6.5/block-2023-06-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw
 zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9
 U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy
 39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f
 apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU
 H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P
 20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R
 Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC
 49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh
 h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe
 n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y
 1ADPteUOTA==
 =zP4Y
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull request via Keith:
      - Various cleanups all around (Irvin, Chaitanya, Christophe)
      - Better struct packing (Christophe JAILLET)
      - Reduce controller error logs for optional commands (Keith)
      - Support for >=64KiB block sizes (Daniel Gomez)
      - Fabrics fixes and code organization (Max, Chaitanya, Daniel
        Wagner)

 - bcache updates via Coly:
      - Fix a race at init time (Mingzhe Zou)
      - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)

 - use page pinning in the block layer for dio (David)

 - convert old block dio code to page pinning (David, Christoph)

 - cleanups for pktcdvd (Andy)

 - cleanups for rnbd (Guoqing)

 - use the unchecked __bio_add_page() for the initial single page
   additions (Johannes)

 - fix overflows in the Amiga partition handling code (Michael)

 - improve mq-deadline zoned device support (Bart)

 - keep passthrough requests out of the IO schedulers (Christoph, Ming)

 - improve support for flush requests, making them less special to deal
   with (Christoph)

 - add bdev holder ops and shutdown methods (Christoph)

 - fix the name_to_dev_t() situation and use cases (Christoph)

 - decouple the block open flags from fmode_t (Christoph)

 - ublk updates and cleanups, including adding user copy support (Ming)

 - BFQ sanity checking (Bart)

 - convert brd from radix to xarray (Pankaj)

 - constify various structures (Thomas, Ivan)

 - more fine grained persistent reservation ioctl capability checks
   (Jingbo)

 - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
   Jordy, Li, Min, Yu, Zhong, Waiman)

* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
  scsi/sg: don't grab scsi host module reference
  ext4: Fix warning in blkdev_put()
  block: don't return -EINVAL for not found names in devt_from_devname
  cdrom: Fix spectre-v1 gadget
  block: Improve kernel-doc headers
  blk-mq: don't insert passthrough request into sw queue
  bsg: make bsg_class a static const structure
  ublk: make ublk_chr_class a static const structure
  aoe: make aoe_class a static const structure
  block/rnbd: make all 'class' structures const
  block: fix the exclusive open mask in disk_scan_partitions
  block: add overflow checks for Amiga partition support
  block: change all __u32 annotations to __be32 in affs_hardblocks.h
  block: fix signed int overflow in Amiga partition support
  block: add capacity validation in bdev_add_partition()
  block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
  block: disallow Persistent Reservation on partitions
  reiserfs: fix blkdev_put() warning from release_journal_dev()
  block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
  block: document the holder argument to blkdev_get_by_path
  ...
2023-06-26 12:47:20 -07:00
Linus Torvalds 3eccc0c886 for-6.5/splice-2023-06-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8QgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpupIEADKEZvpxDyaxHjYZFFeoSJRkh+AEJHe0Xtr
 J5vUL8t8zmAV3F7i8XaoAEcR0dC0VQcoTc8fAOty71+5hsc7gvtyyNjqU/YWRVqK
 Xr+VJuSJ+OGx3MzpRWEkepagfPyqP5cyyCOK6gqIgqzc3IwqkR/3QHVRc6oR8YbY
 AQd7tqm2fQXK9WDHEy5hcaQeqb9uKZjQQoZejpPPerpJM+9RMgKxpCGtnLLIUhr/
 sgl7KyLIQPBmveO2vfOR+dmsJBqsLqneqkXDKMAIfpeVEEkHHAlCH4E5Ne1XUS+s
 ie4If+reuyn1Ktt5Ry1t7w2wr8cX1fcay3K28tgwjE2Bvremc5YnYgb3pyUDW38f
 tXXkpg/eTXd/Pn0Crpagoa9zJ927tt5JXIO1/PagPEP1XOqUuthshDFsrVqfqbs+
 36gqX2JWB4NJTg9B9KBHA3+iVCJyZLjUqOqws7hOJOvhQytZVm/IwkGBg1Slhe1a
 J5WemBlqX8lTgXz0nM7cOhPYTZeKe6hazCcb5VwxTUTj9SGyYtsMfqqTwRJO9kiF
 j1VzbOAgExDYe+GvfqOFPh9VqZho66+DyOD/Xtca4eH7oYyHSmP66o8nhRyPBPZA
 maBxQhUkPQn4/V/0fL2TwIdWYKsbj8bUyINKPZ2L35YfeICiaYIctTwNJxtRmItB
 M3VxWD3GZQ==
 =KhW4
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux

Pull splice updates from Jens Axboe:
 "This kills off ITER_PIPE to avoid a race between truncate,
  iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio
  with unpinned/unref'ed pages from an O_DIRECT splice read. This causes
  memory corruption.

  Instead, we either use (a) filemap_splice_read(), which invokes the
  buffered file reading code and splices from the pagecache into the
  pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads
  into it and then pushes the filled pages into the pipe; or (c) handle
  it in filesystem-specific code.

  Summary:

   - Rename direct_splice_read() to copy_splice_read()

   - Simplify the calculations for the number of pages to be reclaimed
     in copy_splice_read()

   - Turn do_splice_to() into a helper, vfs_splice_read(), so that it
     can be used by overlayfs and coda to perform the checks on the
     lower fs

   - Make vfs_splice_read() jump to copy_splice_read() to handle
     direct-I/O and DAX

   - Provide shmem with its own splice_read to handle non-existent pages
     in the pagecache. We don't want a ->read_folio() as we don't want
     to populate holes, but filemap_get_pages() requires it

   - Provide overlayfs with its own splice_read to call down to a lower
     layer as overlayfs doesn't provide ->read_folio()

   - Provide coda with its own splice_read to call down to a lower layer
     as coda doesn't provide ->read_folio()

   - Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs
     and random files as they just copy to the output buffer and don't
     splice pages

   - Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3,
     ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation

   - Make cifs use filemap_splice_read()

   - Replace pointers to generic_file_splice_read() with pointers to
     filemap_splice_read() as DIO and DAX are handled in the caller;
     filesystems can still provide their own alternate ->splice_read()
     op

   - Remove generic_file_splice_read()

   - Remove ITER_PIPE and its paraphernalia as generic_file_splice_read
     was the only user"

* tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits)
  splice: kdoc for filemap_splice_read() and copy_splice_read()
  iov_iter: Kill ITER_PIPE
  splice: Remove generic_file_splice_read()
  splice: Use filemap_splice_read() instead of generic_file_splice_read()
  cifs: Use filemap_splice_read()
  trace: Convert trace/seq to use copy_splice_read()
  zonefs: Provide a splice-read wrapper
  xfs: Provide a splice-read wrapper
  orangefs: Provide a splice-read wrapper
  ocfs2: Provide a splice-read wrapper
  ntfs3: Provide a splice-read wrapper
  nfs: Provide a splice-read wrapper
  f2fs: Provide a splice-read wrapper
  ext4: Provide a splice-read wrapper
  ecryptfs: Provide a splice-read wrapper
  ceph: Provide a splice-read wrapper
  afs: Provide a splice-read wrapper
  9p: Add splice_read wrapper
  net: Make sock_splice_read() use copy_splice_read() by default
  tty, proc, kernfs, random: Use copy_splice_read()
  ...
2023-06-26 11:52:12 -07:00
Linus Torvalds cc423f6337 for-6.5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSZ0YUACgkQxWXV+ddt
 WDuF9g/+OsEZflGYIZj11trG0l5HWApnINKqhZ524J0TNy9KxY0KOTqPCOg0O41E
 Vt7uJCMG06+ifvVEby8srjzUZxUutIuMeGIP91VyXUbt+CleAWunxnL8aKcTPS3T
 QxyGx0VmukO2UHYCXhXQLRXo/+zPcBxnk6UgVzcBCIOecOMTB1KCeblBmqd3q86f
 NVFse+BTkmtm86u/1rzEDqgY6lMHQ+jNoHhpRVGpzmSnSX8GELyOW3QnixS0LCo2
 no0vvW0QXkRJ+S68V5zBWlqa3xr21jOYcmON2Ra2G8Etsjx7W5XKS3I9k/4uonxb
 LbITmBwEZWt/aTzmLFT16S5M9BlRqH5Ffmsw7Ls+NDmdvH/f1zM8XeNAb7kpFTrn
 T3aALjkcd65/JFmgyVmzdt4BSmrUkYm0EEmLirQec86HJ4NlQJpJ2B7cfMWKPyal
 +VgaT4S+fLTc/HJD3nObMXTCxZrMf0sBUhU4/QXqL7TTjqqosSn26mlGNUocw7Ty
 HaESk7j2L9TMPt640r1G98j9ND7sWmyBmiYsah8F3MKZCIS892qhtFs0m5g2tA1F
 sjPv9u6M5Pi6ie5Eo8xs+SqKa7TPPVsbZ9XcMRBuzDc5AtUPAm6ii9QVwref8wTq
 qO379jDepgPj4HZkXMzQKxd6rw6wrF854304XhjHZefk+ChhIc4=
 =nN4X
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Mainly core changes, refactoring and optimizations.

  Performance is improved in some areas, overall there may be a
  cumulative improvement due to refactoring that removed lookups in the
  IO path or simplified IO submission tracking.

  Core:

   - submit IO synchronously for fast checksums (crc32c and xxhash),
     remove high priority worker kthread

   - read extent buffer in one go, simplify IO tracking, bio submission
     and locking

   - remove additional tracking of redirtied extent buffers, originally
     added for zoned mode but actually not needed

   - track ordered extent pointer in bio to avoid rbtree lookups during
     IO

   - scrub, use recovered data stripes as cache to avoid unnecessary
     read

   - in zoned mode, optimize logical to physical mappings of extents

   - remove PageError handling, not set by VFS nor writeback

   - cleanups, refactoring, better structure packing

   - lots of error handling improvements

   - more assertions, lockdep annotations

   - print assertion failure with the exact line where it happens

   - tracepoint updates

   - more debugging prints

  Performance:

   - speedup in fsync(), better tracking of inode logged status can
     avoid transaction commit

   - IO path structures track logical offsets in data structures and
     does not need to look it up

  User visible changes:

   - don't commit transaction for every created subvolume, this can
     reduce time when many subvolumes are created in a batch

   - print affected files when relocation fails

   - trigger orphan file cleanup during START_SYNC ioctl

  Notable fixes:

   - fix crash when disabling quota and relocation

   - fix crashes when removing roots from drity list

   - fix transacion abort during relocation when converting from newer
     profiles not covered by fallback

   - in zoned mode, stop reclaiming block groups if filesystem becomes
     read-only

   - fix rare race condition in tree mod log rewind that can miss some
     btree node slots

   - with enabled fsverity, drop up-to-date page bit in case the
     verification fails"

* tag 'for-6.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (194 commits)
  btrfs: fix race between quota disable and relocation
  btrfs: add comment to struct btrfs_fs_info::dirty_cowonly_roots
  btrfs: fix race when deleting free space root from the dirty cow roots list
  btrfs: fix race when deleting quota root from the dirty cow roots list
  btrfs: tracepoints: also show actual number of the outstanding extents
  btrfs: update i_version in update_dev_time
  btrfs: make btrfs_compressed_bioset static
  btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile
  btrfs: scrub: remove btrfs_fs_info::scrub_wr_completion_workers
  btrfs: scrub: remove scrub_ctx::csum_list member
  btrfs: do not BUG_ON after failure to migrate space during truncation
  btrfs: do not BUG_ON on failure to get dir index for new snapshot
  btrfs: send: do not BUG_ON() on unexpected symlink data extent
  btrfs: do not BUG_ON() when dropping inode items from log root
  btrfs: replace BUG_ON() at split_item() with proper error handling
  btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr()
  btrfs: do not BUG_ON() on tree mod log failures at insert_ptr()
  btrfs: do not BUG_ON() on tree mod log failure at insert_new_root()
  btrfs: do not BUG_ON() on tree mod log failures at push_nodes_for_insert()
  btrfs: abort transaction at update_ref_for_cow() when ref count is zero
  ...
2023-06-26 11:41:38 -07:00
Linus Torvalds e940efa936 zonefs changes for 6.5
- Modify the synchronous direct write path to use iomap instead of
    manually coding issuing zone append write BIOs, from me.
 
  - Use the FMODE_CAN_ODIRECT file flag to indicate support from direct
    IO instead of using the old way with noop direct_io methods, from
    Christoph.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCZJmJMQAKCRDdoc3SxdoY
 dqN9AQDYcY/WzK3eZWqgseXOoLBLibpCRE+SjoMFp4/hN47BDAD+I6owLtp3zyLi
 1+Wcdlsdr/i2NEsSfbYDUPI/33xlTww=
 =mj70
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs updates from Damien Le Moal:

 - Modify the synchronous direct write path to use iomap instead of
   manually coding issuing zone append write BIOs (me)

 - Use the FMODE_CAN_ODIRECT file flag to indicate support from direct
   IO instead of using the old way with noop direct_io methods
   (Christoph)

* tag 'zonefs-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
  zonefs: use iomap for synchronous direct writes
2023-06-26 11:29:29 -07:00
Linus Torvalds 098c5dd9cf Changes since last update:
- Fix rare I/O hang on deduplicated compressed images due to loop
    hooked chains;
 
  - Fix compact compression layout of 16k blocks on arm64 devices;
 
  - Fix atomic context detection of async decompression;
 
  - Decompression/Xattr code cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCZJjfSREceGlhbmdAa2Vy
 bmVsLm9yZwAKCRA5NzHcH7XmBPXVAQCQqmieRIbW84N6l4eonzoljoxRLCeDsjEn
 R0qTaT66GQEA3ZZvwFPmuOgdSGadsvradYdLuL389U5y12P3l0BwRwY=
 =ffx+
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
 "No outstanding new feature for this cycle.

  Most of these commits are decompression cleanups which are part of the
  ongoing development for subpage/folio compression support as well as
  xattr cleanups for the upcoming xattr bloom filter optimization [1].

  In addition, there are bugfixes to address some corner cases of
  compressed images due to global data de-duplication and arm64 16k
  pages.

  Summary:

   - Fix rare I/O hang on deduplicated compressed images due to loop
     hooked chains

   - Fix compact compression layout of 16k blocks on arm64 devices

   - Fix atomic context detection of async decompression

   - Decompression/Xattr code cleanups"

Link: https://lore.kernel.org/r/20230621083209.116024-1-jefflexu@linux.alibaba.com [1]

* tag 'erofs-for-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: clean up zmap.c
  erofs: remove unnecessary goto
  erofs: Fix detection of atomic context
  erofs: use separate xattr parsers for listxattr/getxattr
  erofs: unify inline/shared xattr iterators for listxattr/getxattr
  erofs: make the size of read data stored in buffer_ofs
  erofs: unify xattr_iter structures
  erofs: use absolute position in xattr iterator
  erofs: fix compact 4B support for 16k block size
  erofs: convert erofs_read_metabuf() to erofs_bread() for xattr
  erofs: use poison pointer to replace the hard-coded address
  erofs: use struct lockref to replace handcrafted approach
  erofs: adapt managed inode operations into folios
  erofs: kill hooked chains to avoid loops on deduplicated compressed images
  erofs: avoid on-stack pagepool directly passed by arguments
  erofs: allocate extra bvec pages directly instead of retrying
  erofs: clean up z_erofs_pcluster_readmore()
  erofs: remove the member readahead from struct z_erofs_decompress_frontend
  erofs: fold in z_erofs_decompress()
2023-06-26 11:00:18 -07:00
Linus Torvalds 74774e243c fsverity updates for 6.5
Several updates for fs/verity/:
 
 - Do all hashing with the shash API instead of with the ahash API.  This
   simplifies the code and reduces API overhead.  It should also make
   things slightly easier for XFS's upcoming support for fsverity.  It
   does drop fsverity's support for off-CPU hash accelerators, but that
   support was incomplete and not known to be used.
 
 - Update and export fsverity_get_digest() so that it's ready for
   overlayfs's upcoming support for fsverity checking of lowerdata.
 
 - Improve the documentation for builtin signature support.
 
 - Fix a bug in the large folio support.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZJjsWRQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK0IsAQCZ9ZP2U0DqLKV025LzcU9epUdS30xJ
 U7WOs8gP63pH4QEAqbU1O6bVhEzdFWGzq5gdzdLePWjOyHrGCVcR2u+fgw4=
 =ptAR
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux

Pull fsverity updates from Eric Biggers:
 "Several updates for fs/verity/:

   - Do all hashing with the shash API instead of with the ahash API.

     This simplifies the code and reduces API overhead. It should also
     make things slightly easier for XFS's upcoming support for
     fsverity. It does drop fsverity's support for off-CPU hash
     accelerators, but that support was incomplete and not known to be
     used

   - Update and export fsverity_get_digest() so that it's ready for
     overlayfs's upcoming support for fsverity checking of lowerdata

   - Improve the documentation for builtin signature support

   - Fix a bug in the large folio support"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
  fsverity: improve documentation for builtin signature support
  fsverity: rework fsverity_get_digest() again
  fsverity: simplify error handling in verify_data_block()
  fsverity: don't use bio_first_page_all() in fsverity_verify_bio()
  fsverity: constify fsverity_hash_alg
  fsverity: use shash API instead of ahash API
2023-06-26 10:56:13 -07:00
Linus Torvalds 4d483ab702 fscrypt updates for 6.5
Just one flex array conversion patch.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZJjnMhQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK5UTAP97F43BucDp2X1Yq27phZBGGNMuZBDV
 0wFxpW7IDC1WHQD+N9HQmxv+euDHD0elSf/fHWcsLuvApJA/6NTY8YDuugo=
 =Oy2c
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux

Pull fscrypt update from Eric Biggers:
 "Just one flex array conversion patch"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
  fscrypt: Replace 1-element array with flexible array
2023-06-26 10:54:55 -07:00
Linus Torvalds f7976a6493 NFSD 6.5 Release Notes
Fixes and clean-ups include:
 - Clean-ups in the READ path in anticipation of MSG_SPLICE_PAGES
 - Better NUMA awareness when allocating pages and other objects
 - A number of minor clean-ups to XDR encoding
 - Elimination of a race when accepting a TCP socket
 - Numerous observability enhancements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmSVpEQACgkQM2qzM29m
 f5flNQ/+N+GKwJWKkDDK9l0/OJWLSOV1DaUKs04FW79Xa05bqB9W3mnkQ2QzXmBq
 9BUUeAnn0bD+yDgsBd4+l5HMjK91Qm+e+ljE4Gn+hA2Kg5VsBPbNImXQTuAdC/mw
 2YWNqaQ6EjHWmzauAOiKqKwDsYefZaiS3+1CEuOlmXeDK979yU7zXHUbFvLjmNiP
 5ATDrxfIsQza+Je0sh3JGeJtFZCt127zOd6vEQCiedbC8yy7n1ldWi/OGWPsg/2H
 z/QWbs9Iw8ExESosBKfX6M13izJV9eZ69ZfvxmLFGcfZmHoOCOtIXzDDqmD6JsPY
 l4SBtWkP+OB4jePM1nEpFU65EQrGtRK/roGGXNtCTaZAcDBk2/jJ9b5gejkSDsaU
 B3alN0UwECeEzQ3whYYAGy4m1FPaQcFfxl1RydcvYEfRiuiYxZb/3EO62dDavv14
 WTYUGHNxso48DGngyO2xRBIda0Kwqc4vgkzOww6SP+Eok/22q/CsKHiuzRgnrLq6
 GYdRX0Zmvl/0lzuBsDOzpOIQg2DuGdc84fjUCxqQu0DBSkehWo4i2eGNHQuHbMUb
 Tl58/tYiFazVOX1aAKabzKm2iaCjKdLySDhTj+dSgTXKx6SqZl1cXvhFl9xImgYj
 uW9i+ALZO9PNvtpMIvGaHqQK6xyI9F6isgKyNpXwy9J5D6ck7GA=
 =1EQc
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:

 - Clean-ups in the READ path in anticipation of MSG_SPLICE_PAGES

 - Better NUMA awareness when allocating pages and other objects

 - A number of minor clean-ups to XDR encoding

 - Elimination of a race when accepting a TCP socket

 - Numerous observability enhancements

* tag 'nfsd-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (46 commits)
  nfsd: remove redundant assignments to variable len
  svcrdma: Fix stale comment
  NFSD: Distinguish per-net namespace initialization
  nfsd: move init of percpu reply_cache_stats counters back to nfsd_init_net
  SUNRPC: Address RCU warning in net/sunrpc/svc.c
  SUNRPC: Use sysfs_emit in place of strlcpy/sprintf
  SUNRPC: Remove transport class dprintk call sites
  SUNRPC: Fix comments for transport class registration
  svcrdma: Remove an unused argument from __svc_rdma_put_rw_ctxt()
  svcrdma: trace cc_release calls
  svcrdma: Convert "might sleep" comment into a code annotation
  NFSD: Add an nfsd4_encode_nfstime4() helper
  SUNRPC: Move initialization of rq_stime
  SUNRPC: Optimize page release in svc_rdma_sendto()
  svcrdma: Prevent page release when nothing was received
  svcrdma: Revert 2a1e4f21d8 ("svcrdma: Normalize Send page handling")
  SUNRPC: Revert 579900670a ("svcrdma: Remove unused sc_pages field")
  SUNRPC: Revert cc93ce9529 ("svcrdma: Retain the page backing rq_res.head[0].iov_base")
  NFSD: add encoding of op_recall flag for write delegation
  NFSD: Add "official" reviewers for this subsystem
  ...
2023-06-26 10:48:57 -07:00
Linus Torvalds c0a572d9d3 v6.5/vfs.mount
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU3/AAKCRCRxhvAZXjc
 ov1bAQDT+i3l8jS+r1HGVBuZHz2bPPkrk3ch+xGU9V/iBhrGtAD7BaogZ5OaeJP0
 A1CQecON51Tq79Nw+EgEBLcRJ/xhXAQ=
 =v3md
 -----END PGP SIGNATURE-----

Merge tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs mount updates from Christian Brauner:
 "This contains the work to extend move_mount() to allow adding a mount
  beneath the topmost mount of a mount stack.

  There are two LWN articles about this. One covers the original patch
  series in [1]. The other in [2] summarizes the session and roughly the
  discussion between Al and me at LSFMM. The second article also goes
  into some good questions from attendees.

  Since all details are found in the relevant commit with a technical
  dive into semantics and locking at the end I'm only adding the
  motivation and core functionality for this from commit message and
  leave out the invasive details. The code is also heavily commented and
  annotated as well which was explicitly requested.

  TL;DR:

    > mount -t ext4 /dev/sda /mnt
      |
      └─/mnt    /dev/sda    ext4

    > mount --beneath -t xfs /dev/sdb /mnt
      |
      └─/mnt    /dev/sdb    xfs
        └─/mnt  /dev/sda    ext4

    > umount /mnt
      |
      └─/mnt    /dev/sdb    xfs

  The longer motivation is that various distributions are adding or are
  in the process of adding support for system extensions and in the
  future configuration extensions through various tools. A more detailed
  explanation on system and configuration extensions can be found on the
  manpage which is listed below at [3].

  System extension images may – dynamically at runtime — extend the
  /usr/ and /opt/ directory hierarchies with additional files. This is
  particularly useful on immutable system images where a /usr/ and/or
  /opt/ hierarchy residing on a read-only file system shall be extended
  temporarily at runtime without making any persistent modifications.

  When one or more system extension images are activated, their /usr/
  and /opt/ hierarchies are combined via overlayfs with the same
  hierarchies of the host OS, and the host /usr/ and /opt/ overmounted
  with it ("merging"). When they are deactivated, the mount point is
  disassembled — again revealing the unmodified original host version of
  the hierarchy ("unmerging"). Merging thus makes the extension's
  resources suddenly appear below the /usr/ and /opt/ hierarchies as if
  they were included in the base OS image itself. Unmerging makes them
  disappear again, leaving in place only the files that were shipped
  with the base OS image itself.

  System configuration images are similar but operate on directories
  containing system or service configuration.

  On nearly all modern distributions mount propagation plays a crucial
  role and the rootfs of the OS is a shared mount in a peer group
  (usually with peer group id 1):

     TARGET  SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
     /       /       ext4    shared:1     29      1

  On such systems all services and containers run in a separate mount
  namespace and are pivot_root()ed into their rootfs. A separate mount
  namespace is almost always used as it is the minimal isolation
  mechanism services have. But usually they are even much more isolated
  up to the point where they almost become indistinguishable from
  containers.

  Mount propagation again plays a crucial role here. The rootfs of all
  these services is a slave mount to the peer group of the host rootfs.
  This is done so the service will receive mount propagation events from
  the host when certain files or directories are updated.

  In addition, the rootfs of each service, container, and sandbox is
  also a shared mount in its separate peer group:

     TARGET  SOURCE  FSTYPE  PROPAGATION         MNT_ID  PARENT_ID
     /       /       ext4    shared:24 master:1  71      47

  For people not too familiar with mount propagation, the master:1 means
  that this is a slave mount to peer group 1. Which as one can see is
  the host rootfs as indicated by shared:1 above. The shared:24
  indicates that the service rootfs is a shared mount in a separate peer
  group with peer group id 24.

  A service may run other services. Such nested services will also have
  a rootfs mount that is a slave to the peer group of the outer service
  rootfs mount.

  For containers things are just slighly different. A container's rootfs
  isn't a slave to the service's or host rootfs' peer group. The rootfs
  mount of a container is simply a shared mount in its own peer group:

     TARGET                    SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
     /home/ubuntu/debian-tree  /       ext4    shared:99    61      60

  So whereas services are isolated OS components a container is treated
  like a separate world and mount propagation into it is restricted to a
  single well known mount that is a slave to the peer group of the
  shared mount /run on the host:

     TARGET                  SOURCE              FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
     /propagate/debian-tree  /run/host/incoming  tmpfs   master:5     71      68

  Here, the master:5 indicates that this mount is a slave to the peer
  group with peer group id 5. This allows to propagate mounts into the
  container and served as a workaround for not being able to insert
  mounts into mount namespaces directly. But the new mount api does
  support inserting mounts directly. For the interested reader the
  blogpost in [4] might be worth reading where I explain the old and the
  new approach to inserting mounts into mount namespaces.

  Containers of course, can themselves be run as services. They often
  run full systems themselves which means they again run services and
  containers with the exact same propagation settings explained above.

  The whole system is designed so that it can be easily updated,
  including all services in various fine-grained ways without having to
  enter every single service's mount namespace which would be
  prohibitively expensive. The mount propagation layout has been
  carefully chosen so it is possible to propagate updates for system
  extensions and configurations from the host into all services.

  The simplest model to update the whole system is to mount on top of
  /usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
  will then propagate into every service. This works cleanly the first
  time. However, when the system is updated multiple times it becomes
  necessary to unmount the first update on /opt, /usr, /etc and then
  propagate the new update. But this means, there's an interval where
  the old base system is accessible. This has to be avoided to protect
  against downgrade attacks.

  The vfs already exposes a mechanism to userspace whereby mounts can be
  mounted beneath an existing mount. Such mounts are internally referred
  to as "tucked". The patch series exposes the ability to mount beneath
  a top mount through the new MOVE_MOUNT_BENEATH flag for the
  move_mount() system call. This allows userspace to seamlessly upgrade
  mounts. After this series the only thing that will have changed is
  that mounting beneath an existing mount can be done explicitly instead
  of just implicitly.

  The crux is that the proposed mechanism already exists and that it is
  so powerful as to cover cases where mounts are supposed to be updated
  with new versions. Crucially, it offers an important flexibility.
  Namely that updates to a system may either be forced or can be delayed
  and the umount of the top mount be left to a service if it is a
  cooperative one"

Link: https://lwn.net/Articles/927491 [1]
Link: https://lwn.net/Articles/934094 [2]
Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [3]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [4]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013

* tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: allow to mount beneath top mount
  fs: use a for loop when locking a mount
  fs: properly document __lookup_mnt()
  fs: add path_mounted()
2023-06-26 10:27:04 -07:00
Linus Torvalds 1f2300a738 v6.5/vfs.file
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU4WgAKCRCRxhvAZXjc
 oofvAQDs9RJwQUyWHJmQA+tWz5cUE5DviVWCwwul5dQRRCqgaQEA2OIO0gPFaVoq
 1OYOeLyUjl/cpS8e3u4uJtw34jttdQA=
 =AwcR
 -----END PGP SIGNATURE-----

Merge tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs file handling updates from Christian Brauner:
 "This contains Amir's work to fix a long-standing problem where an
  unprivileged overlayfs mount can be used to avoid fanotify permission
  events that were requested for an inode or superblock on the
  underlying filesystem.

  Some background about files opened in overlayfs. If a file is opened
  in overlayfs @file->f_path will refer to a "fake" path. What this
  means is that while @file->f_inode will refer to inode of the
  underlying layer, @file->f_path refers to an overlayfs
  {dentry,vfsmount} pair. The reasons for doing this are out of scope
  here but it is the reason why the vfs has been providing the
  open_with_fake_path() helper for overlayfs for very long time now. So
  nothing new here.

  This is for sure not very elegant and everyone including the overlayfs
  maintainers agree. Improving this significantly would involve more
  fragile and potentially rather invasive changes.

  In various codepaths access to the path of the underlying filesystem
  is needed for such hybrid file. The best example is fsnotify where
  this becomes security relevant. Passing the overlayfs
  @file->f_path->dentry will cause fsnotify to skip generating fsnotify
  events registered on the underlying inode or superblock.

  To fix this we extend the vfs provided open_with_fake_path() concept
  for overlayfs to create a backing file container that holds the real
  path and to expose a helper that can be used by relevant callers to
  get access to the path of the underlying filesystem through the new
  file_real_path() helper. This pattern is similar to what we do in
  d_real() and d_real_inode().

  The first beneficiary is fsnotify and fixes the security sensitive
  problem mentioned above.

  There's a couple of nice cleanups included as well.

  Over time, the old open_with_fake_path() helper added specifically for
  overlayfs a long time ago started to get used in other places such as
  cachefiles. Even though cachefiles have nothing to do with hybrid
  files.

  The only reason cachefiles used that concept was that files opened
  with open_with_fake_path() aren't charged against the caller's open
  file limit by raising FMODE_NOACCOUNT. It's just mere coincidence that
  both overlayfs and cachefiles need to ensure to not overcharge the
  caller for their internal open calls.

  So this work disentangles FMODE_NOACCOUNT use cases and backing file
  use-cases by adding the FMODE_BACKING flag which indicates that the
  file can be used to retrieve the backing file of another filesystem.
  (Fyi, Jens will be sending you a really nice cleanup from Christoph
  that gets rid of 3 FMODE_* flags otherwise this would be the last
  fmode_t bit we'd be using.)

  So now overlayfs becomes the sole user of the renamed
  open_with_fake_path() helper which is now named backing_file_open().
  For internal kernel users such as cachefiles that are only interested
  in FMODE_NOACCOUNT but not in FMODE_BACKING we add a new
  kernel_file_open() helper which opens a file without being charged
  against the caller's open file limit. All new helpers are properly
  documented and clearly annotated to mention their special uses.

  We also rename vfs_tmpfile_open() to kernel_tmpfile_open() to clearly
  distinguish it from vfs_tmpfile() and align it the other kernel_*()
  internal helpers"

* tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  ovl: enable fsnotify events on underlying real files
  fs: use backing_file container for internal files with "fake" f_path
  fs: move kmem_cache_zalloc() into alloc_empty_file*() helpers
  fs: use a helper for opening kernel internal files
  fs: rename {vfs,kernel}_tmpfile_open()
2023-06-26 10:14:36 -07:00
Linus Torvalds 2eedfa9e27 v6.5/vfs.rename.locking
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU4NwAKCRCRxhvAZXjc
 ordqAP0RmZEkUA5p98pK+0FEFIsS2S8qChh6YHQHP+hF606FGgEAivb3UPRm9p58
 kRb5yK0/oXDUxGv7A+x+SIMVbcRyLgw=
 =pi6N
 -----END PGP SIGNATURE-----

Merge tag 'v6.5/vfs.rename.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs rename locking updates from Christian Brauner:
 "This contains the work from Jan to fix problems with cross-directory
  renames originally reported in [1].

  To quickly sum it up some filesystems (so far we know at least about
  ext4, udf, f2fs, ocfs2, likely also reiserfs, gfs2 and others) need to
  lock the directory when it is being renamed into another directory.

  This is because we need to update the parent pointer in the directory
  in that case and if that races with other operations on the directory,
  in particular a conversion from one directory format into another, bad
  things can happen.

  So far we've done the locking in the filesystem code but recently
  Darrick pointed out in [2] that the RENAME_EXCHANGE case was missing.
  That one is particularly nasty because RENAME_EXCHANGE can arbitrarily
  mix regular files and directories and proper lock ordering is not
  achievable in the filesystems alone.

  This patch set adds locking into vfs_rename() so that not only parent
  directories but also moved inodes, regardless of whether they are
  directories or not, are locked when calling into the filesystem.

  This means establishing a locking order for unrelated directories. New
  helpers are added for this purpose and our documentation is updated to
  cover this in detail.

  The locking is now actually easier to follow as we now always lock
  source and target. We've always locked the target independent of
  whether it was a directory or file and we've always locked source if
  it was a regular file. The exact details for why this came about can
  be found in [3] and [4]"

Link: https://lore.kernel.org/all/20230117123735.un7wbamlbdihninm@quack3 [1]
Link: https://lore.kernel.org/all/20230517045836.GA11594@frogsfrogsfrogs [2]
Link: https://lore.kernel.org/all/20230526-schrebergarten-vortag-9cd89694517e@brauner [3]
Link: https://lore.kernel.org/all/20230530-seenotrettung-allrad-44f4b00139d4@brauner [4]

* tag 'v6.5/vfs.rename.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: Restrict lock_two_nondirectories() to non-directory inodes
  fs: Lock moved directories
  fs: Establish locking order for unrelated directories
  Revert "f2fs: fix potential corruption when moving a directory"
  Revert "udf: Protect rename against modification of moved directory"
  ext4: Remove ext4 locking of moved directory
2023-06-26 10:01:26 -07:00
Linus Torvalds 64bf6ae93e v6.5/vfs.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU4SwAKCRCRxhvAZXjc
 ojOTAP9gT/z1gasIf8OwDHb4inZGnVpHh2ApKLvgMXH6ICtwRgD+OBtOcf438Lx1
 cpFSTVJlh21QXMOOXWHe/LRUV2kZ5wI=
 =zdfx
 -----END PGP SIGNATURE-----

Merge tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Miscellaneous features, cleanups, and fixes for vfs and individual fs

  Features:

   - Use mode 0600 for file created by cachefilesd so it can be run by
     unprivileged users. This aligns them with directories which are
     already created with mode 0700 by cachefilesd

   - Reorder a few members in struct file to prevent some false sharing
     scenarios

   - Indicate that an eventfd is used a semaphore in the eventfd's
     fdinfo procfs file

   - Add a missing uapi header for eventfd exposing relevant uapi
     defines

   - Let the VFS protect transitions of a superblock from read-only to
     read-write in addition to the protection it already provides for
     transitions from read-write to read-only. Protecting read-only to
     read-write transitions allows filesystems such as ext4 to perform
     internal writes, keeping writers away until the transition is
     completed

  Cleanups:

   - Arnd removed the architecture specific arch_report_meminfo()
     prototypes and added a generic one into procfs.h. Note, we got a
     report about a warning in amdpgpu codepaths that suggested this was
     bisectable to this change but we concluded it was a false positive

   - Remove unused parameters from split_fs_names()

   - Rename put_and_unmap_page() to unmap_and_put_page() to let the name
     reflect the order of the cleanup operation that has to unmap before
     the actual put

   - Unexport buffer_check_dirty_writeback() as it is not used outside
     of block device aops

   - Stop allocating aio rings from highmem

   - Protecting read-{only,write} transitions in the VFS used open-coded
     barriers in various places. Replace them with proper little helpers
     and document both the helpers and all barrier interactions involved
     when transitioning between read-{only,write} states

   - Use flexible array members in old readdir codepaths

  Fixes:

   - Use the correct type __poll_t for epoll and eventfd

   - Replace all deprecated strlcpy() invocations, whose return value
     isn't checked with an equivalent strscpy() call

   - Fix some kernel-doc warnings in fs/open.c

   - Reduce the stack usage in jffs2's xattr codepaths finally getting
     rid of this: fs/jffs2/xattr.c:887:1: error: the frame size of 1088
     bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
     royally annoying compilation warning

   - Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY where an int and not
     fmode_t is required to avoid fmode_t to integer degradation
     warnings

   - Create coredumps with O_WRONLY instead of O_RDWR. There's a long
     explanation in that commit how O_RDWR is actually a bug which we
     found out with the help of Linus and git archeology

   - Fix "no previous prototype" warnings in the pipe codepaths

   - Add overflow calculations for remap_verify_area() as a signed
     addition overflow could be triggered in xfstests

   - Fix a null pointer dereference in sysv

   - Use an unsigned variable for length calculations in jfs avoiding
     compilation warnings with gcc 13

   - Fix a dangling pipe pointer in the watch queue codepath

   - The legacy mount option parser provided as a fallback by the VFS
     for filesystems not yet converted to the new mount api did prefix
     the generated mount option string with a leading ',' causing issues
     for some filesystems

   - Fix a repeated word in a comment in fs.h

   - autofs: Update the ctime when mtime is updated as mandated by
     POSIX"

* tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
  readdir: Replace one-element arrays with flexible-array members
  fs: Provide helpers for manipulating sb->s_readonly_remount
  fs: Protect reconfiguration of sb read-write from racing writes
  eventfd: add a uapi header for eventfd userspace APIs
  autofs: set ctime as well when mtime changes on a dir
  eventfd: show the EFD_SEMAPHORE flag in fdinfo
  fs/aio: Stop allocating aio rings from HIGHMEM
  fs: Fix comment typo
  fs: unexport buffer_check_dirty_writeback
  fs: avoid empty option when generating legacy mount string
  watch_queue: prevent dangling pipe pointer
  fs.h: Optimize file struct to prevent false sharing
  highmem: Rename put_and_unmap_page() to unmap_and_put_page()
  cachefiles: Allow the cache to be non-root
  init: remove unused names parameter in split_fs_names()
  jfs: Use unsigned variable for length calculations
  fs/sysv: Null check to prevent null-ptr-deref bug
  fs: use UB-safe check for signed addition overflow in remap_verify_area
  procfs: consolidate arch_report_meminfo declaration
  fs: pipe: reveal missing function protoypes
  ...
2023-06-26 09:50:21 -07:00
Linus Torvalds 5c1c88cddb v6.5/fs.ntfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU4DAAKCRCRxhvAZXjc
 oi8dAP9Rxv13trPMEFB1aJ3YHIMYRKSlG81vNJo+5ur43RFc4gD+NJBBYSb7bMYu
 V8uIKnyFfMtLtHZA9M00Xqipq/sLDA4=
 =C0YR
 -----END PGP SIGNATURE-----

Merge tag 'v6.5/fs.ntfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull ntfs updates from Christian Brauner:
 "A pile of various smaller fixes for ntfs"

* tag 'v6.5/fs.ntfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  ntfs: do not dereference a null ctx on error
  ntfs: Remove unneeded semicolon
  ntfs: Correct spelling
  ntfs: remove redundant initialization to pointer cb_sb_start
2023-06-26 09:47:39 -07:00
Jane Chu 1ea7ca1b09 dax: enable dax fault handler to report VM_FAULT_HWPOISON
When multiple processes mmap() a dax file, then at some point,
a process issues a 'load' and consumes a hwpoison, the process
receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb
set for the poison scope. Soon after, any other process issues
a 'load' to the poisoned page (that is unmapped from the kernel
side by memory_failure), it receives a SIGBUS with
si_code = BUS_ADRERR and without valid si_lsb.

This is confusing to user, and is different from page fault due
to poison in RAM memory, also some helpful information is lost.

Channel dax backend driver's poison detection to the filesystem
such that instead of reporting VM_FAULT_SIGBUS, it could report
VM_FAULT_HWPOISON.

If user level block IO syscalls fail due to poison, the errno will
be converted to EIO to maintain block API consistency.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
Link: https://lore.kernel.org/r/20230615181325.1327259-2-jane.chu@oracle.com
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2023-06-26 07:54:23 -06:00
Yunlei He cf2423a755 f2fs: remove unneeded page uptodate check/set
This patch remove unneeded page uptodate check/set in
f2fs_vm_page_mkwrite, which already done in set_page_dirty.

Signed-off-by: Yunlei He <heyunlei@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:21:42 -07:00
Yunlei He 396d0a2883 f2fs: update mtime and ctime in move file range method
Mtime and ctime stay old value without update after move
file range ioctl. This patch add time update.

Signed-off-by: Yunlei He <heyunlei@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:21:42 -07:00
Jaegeuk Kim 2724daf6c2 f2fs: compress tmp files given extension
Let's compress tmp files for the given extension list.

This patch does not change the previous behavior, but allow the cases as below.

Extention example: "ext"

- abc.ext : allow
- abc.ext.abc : allow
- abc.extm : not allow

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:21:41 -07:00
Yangtao Li 6201c478de f2fs: refactor struct f2fs_attr macro
This patch provides a large number of variants of F2FS_RW_ATTR
and F2FS_RO_ATTR macros, reducing the number of parameters required
to initialize the f2fs_attr structure.

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202304152234.wjaY3IYm-lkp@intel.com/
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:21:41 -07:00
Yangtao Li c3355ea9d8 f2fs: convert to use sbi directly
F2FS_I_SB(inode) is redundant.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:21:41 -07:00
Colin Ian King 3f8ac7da8c f2fs: remove redundant assignment to variable err
The assignment to variable err is redundant since the code jumps to
label next and err is then re-assigned a new value on the call to
sanity_check_node_chain. Remove the assignment.

Cleans up clang scan build warning:
fs/f2fs/recovery.c:464:6: warning: Value stored to 'err' is never read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:21:41 -07:00
Jaegeuk Kim 9ac00e7cef f2fs: do not issue small discard commands during checkpoint
If there're huge # of small discards, this will increase checkpoint latency
insanely. Let's issue small discards only by trim.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:21:41 -07:00
Daeho Jeong c9667b19e2 f2fs: check zone write pointer points to the end of zone
We don't need to report an issue, when the zone write pointer already
points to the end of the zone, since the zone mismatch is already taken
care.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:07:10 -07:00
Sheng Yong ac1ee161de f2fs: add f2fs_ioc_get_compress_blocks
This patch adds f2fs_ioc_get_compress_blocks() to provide a common
f2fs_get_compress_blocks().

Signed-off-by: Sheng Yong <shengyong@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:07:10 -07:00
Sheng Yong dde38c03b3 f2fs: cleanup MIN_INLINE_XATTR_SIZE
Signed-off-by: Sheng Yong <shengyong@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:07:10 -07:00
Sheng Yong c571fbb5b5 f2fs: add helper to check compression level
This patch adds a helper function to check if compression level is
valid.

Signed-off-by: Sheng Yong <shengyong@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:07:09 -07:00
Christoph Hellwig 94c8431fb4 f2fs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
Since commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag") file
systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
wiring up a dummy direct_IO method to indicate support for direct I/O.

Do that for f2fs so that noop_direct_IO can eventually be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:07:09 -07:00
Chao Yu f240d3aaf5 f2fs: do more sanity check on inode
There are several issues in sanity_check_inode():
- The code looks not clean, it checks extra_attr related condition
dispersively.
- It missed to check i_extra_isize w/ lower boundary
- It missed to check feature dependency: prjquota, inode_chksum,
inode_crtime, compression features rely on extra_attr feature.
- It's not necessary to check i_extra_isize due to it will only
be assigned to non-zero value if f2fs_has_extra_attr() is true
in do_read_inode().

Fix them all in this patch.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:07:09 -07:00
Chao Yu 64ee9163fe f2fs: compress: fix to check validity of i_compress_flag field
The last valid compress related field is i_compress_flag, check its
validity instead of i_log_cluster_size.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:07:09 -07:00
Yangtao Li 698a5c8c8e f2fs: add sanity compress level check for compressed file
Commit 3fde13f817 ("f2fs: compress: support compress level")
forgot to do basic compress level check, let's add it.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:07:09 -07:00
Jaegeuk Kim 00e120b5e4 f2fs: assign default compression level
Let's avoid any confusion from assigning compress_level=0 for LZ4HC and ZSTD.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:07:09 -07:00
Chao Yu ccf3ff2b30 f2fs: introduce F2FS_QUOTA_DEFAULT_FL for cleanup
This patch adds F2FS_QUOTA_DEFAULT_FL to include two default flags:
F2FS_NOATIME_FL and F2FS_IMMUTABLE_FL, and use it to clean up codes.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:07:08 -07:00
Chao Yu 8bec7dd1b3 f2fs: check return value of freeze_super()
freeze_super() can fail, it needs to check its return value and do
error handling in f2fs_resize_fs().

Fixes: 04f0b2eaa3 ("f2fs: ioctl for removing a range from F2FS")
Fixes: b4b10061ef ("f2fs: refactor resize_fs to avoid meta updates in progress")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-26 06:07:08 -07:00
Gustavo A. R. Silva 11d5e2061e ksmbd: Replace one-element array with flexible-array member
One-element arrays are deprecated, and we are replacing them with flexible
array members instead. So, replace one-element array with flexible-array
member in struct smb_negotiate_req.

This results in no differences in binary output.

Link: https://github.com/KSPP/linux/issues/79
Link: https://github.com/KSPP/linux/issues/317
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-26 00:07:04 -05:00
Gustavo A. R. Silva 5211cc8727 ksmbd: Use struct_size() helper in ksmbd_negotiate_smb_dialect()
Prefer struct_size() over open-coded versions.

Link: https://github.com/KSPP/linux/issues/160
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-26 00:07:04 -05:00
Namjae Jeon 7b7d709ef7 ksmbd: add missing compound request handing in some commands
This patch add the compound request handling to the some commands.
Existing clients do not send these commands as compound requests,
but ksmbd should consider that they may come.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-26 00:07:04 -05:00
Namjae Jeon 98422bdd4c ksmbd: fix out of bounds read in smb2_sess_setup
ksmbd does not consider the case of that smb2 session setup is
in compound request. If this is the second payload of the compound,
OOB read issue occurs while processing the first payload in
the smb2_sess_setup().

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21355
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-26 00:07:04 -05:00
Lu Hongfei f65fadb042 ksmbd: Replace the ternary conditional operator with min()
It would be better to replace the traditional ternary conditional
operator with min() in compare_sids.

Signed-off-by: Lu Hongfei <luhongfei@vivo.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-26 00:07:04 -05:00
Namjae Jeon 81a94b2784 ksmbd: use kvzalloc instead of kvmalloc
Use kvzalloc instead of kvmalloc.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-26 00:07:04 -05:00
Lu Hongfei ccb5889af9 ksmbd: Change the return value of ksmbd_vfs_query_maximal_access to void
The return value of ksmbd_vfs_query_maximal_access is meaningless,
it is better to modify it to void.

Signed-off-by: Lu Hongfei <luhongfei@vivo.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-26 00:07:04 -05:00
Namjae Jeon cf5e7f734f ksmbd: return a literal instead of 'err' in ksmbd_vfs_kern_path_locked()
Return a literal instead of 'err' in ksmbd_vfs_kern_path_locked().

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-26 00:07:04 -05:00
Namjae Jeon f87d4f85f4 ksmbd: use kzalloc() instead of __GFP_ZERO
Use kzalloc() instead of __GFP_ZERO.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-26 00:07:04 -05:00
Namjae Jeon 7bd9f0876f ksmbd: remove unused ksmbd_tree_conn_share function
Remove unused ksmbd_tree_conn_share function.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-26 00:07:04 -05:00
David Howells dc97391e66 sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)
Remove ->sendpage() and ->sendpage_locked().  sendmsg() with
MSG_SPLICE_PAGES should be used instead.  This allows multiple pages and
multipage folios to be passed through.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-afs@lists.infradead.org
cc: mptcp@lists.linux.dev
cc: rds-devel@oss.oracle.com
cc: tipc-discussion@lists.sourceforge.net
cc: virtualization@lists.linux-foundation.org
Link: https://lore.kernel.org/r/20230623225513.2732256-16-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24 15:50:13 -07:00
David Howells e52828cc01 ocfs2: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
Switch ocfs2 from using sendpage() to using sendmsg() + MSG_SPLICE_PAGES so
that sendpage can be phased out.

Signed-off-by: David Howells <dhowells@redhat.com>

cc: Mark Fasheh <mark@fasheh.com>
cc: Joel Becker <jlbec@evilplan.org>
cc: Joseph Qi <joseph.qi@linux.alibaba.com>
cc: ocfs2-devel@oss.oracle.com
Link: https://lore.kernel.org/r/20230623225513.2732256-15-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24 15:50:13 -07:00
David Howells 86d7bd6e66 ocfs2: Fix use of slab data with sendpage
ocfs2 uses kzalloc() to allocate buffers for o2net_hand, o2net_keep_req and
o2net_keep_resp and then passes these to sendpage.  This isn't really
allowed as the lifetime of slab objects is not controlled by page ref -
though in this case it will probably work.  sendmsg() with MSG_SPLICE_PAGES
will, however, print a warning and give an error.

Fix it to use folio_alloc() instead to allocate a buffer for the handshake
message, keepalive request and reply messages.

Fixes: 98211489d4 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Mark Fasheh <mark@fasheh.com>
cc: Kurt Hackel <kurt.hackel@oracle.com>
cc: Joel Becker <jlbec@evilplan.org>
cc: Joseph Qi <joseph.qi@linux.alibaba.com>
cc: ocfs2-devel@oss.oracle.com
Link: https://lore.kernel.org/r/20230623225513.2732256-14-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24 15:50:13 -07:00
David Howells a1a5e87527 dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
When transmitting data, call down a layer using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather using
sendpage.  This allows ->sendpage() to be replaced by something that can
handle multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christine Caulfield <ccaulfie@redhat.com>
cc: David Teigland <teigland@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: cluster-devel@redhat.com
Link: https://lore.kernel.org/r/20230623225513.2732256-7-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24 15:50:12 -07:00
Linus Torvalds f313c51d26 execve: expand new process stack manually ahead of time
This is a small step towards a model where GUP itself would not expand
the stack, and any user that needs GUP to not look up existing mappings,
but actually expand on them, would have to do so manually before-hand,
and with the mm lock held for writing.

It turns out that execve() already did almost exactly that, except it
didn't take the mm lock at all (it's single-threaded so no locking
technically needed, but it could cause lockdep errors).  And it only did
it for the CONFIG_STACK_GROWSUP case, since in that case GUP has
obviously never expanded the stack downwards.

So just make that CONFIG_STACK_GROWSUP case do the right thing with
locking, and enable it generally.  This will eventually help GUP, and in
the meantime avoids a special case and the lockdep issue.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-24 14:13:55 -07:00
Liam R. Howlett f440fa1ac9 mm: make find_extend_vma() fail if write lock not held
Make calls to extend_vma() and find_extend_vma() fail if the write lock
is required.

To avoid making this a flag-day event, this still allows the old
read-locking case for the trivial situations, and passes in a flag to
say "is it write-locked".  That way write-lockers can say "yes, I'm
being careful", and legacy users will continue to work in all the common
cases until they have been fully converted to the new world order.

Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-24 14:13:54 -07:00
Colin Ian King 7982f97560 ocfs2: remove redundant assignment to variable bit_off
Variable bit_off is being assigned a value that is never read, it is being
re-assigned a new value in the following while loop.  Remove the
assignment.  Cleans up clang scan build warning:

fs/ocfs2/localalloc.c:976:18: warning: Although the value stored to
'bit_off' is used in the enclosing expression, the value is never
actually read from 'bit_off' [deadcode.DeadStores]

Link: https://lkml.kernel.org/r/20230622102736.2831126-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23 17:04:04 -07:00
lipeifeng 341d51c886 mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
During the seq_printf,the mmap_sem_read_lock protection is not
required.

Link: https://lkml.kernel.org/r/20230622040152.1173-1-lipeifeng@oppo.com
Signed-off-by: lipeifeng <lipeifeng@oppo.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23 16:59:32 -07:00
Mike Kravetz fd4aed8d98 hugetlb: revert use of page_cache_next_miss()
Ackerley Tng reported an issue with hugetlbfs fallocate as noted in the
Closes tag.  The issue showed up after the conversion of hugetlb page
cache lookup code to use page_cache_next_miss.  User visible effects are:

- hugetlbfs fallocate incorrectly returns -EEXIST if pages are presnet
  in the file.
- hugetlb pages will not be included in core dumps if they need to be
  brought in via GUP.
- userfaultfd UFFDIO_COPY will not notice pages already present in the
  cache.  It may try to allocate a new page and potentially return
  ENOMEM as opposed to EEXIST.

Revert the use page_cache_next_miss() in hugetlb code.

IMPORTANT NOTE FOR STABLE BACKPORTS:
This patch will apply cleanly to v6.3.  However, due to the change of
filemap_get_folio() return values, it will not function correctly.  This
patch must be modified for stable backports.

[dan.carpenter@linaro.org: fix hugetlbfs_pagecache_present()]
  Link: https://lkml.kernel.org/r/efa86091-6a2c-4064-8f55-9b44e1313015@moroto.mountain
Link: https://lkml.kernel.org/r/20230621212403.174710-2-mike.kravetz@oracle.com
Fixes: d0ce0e47b3 ("mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-by: Ackerley Tng <ackerleytng@google.com>
Closes: https://lore.kernel.org/linux-mm/cover.1683069252.git.ackerleytng@google.com
Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Erdem Aktas <erdemaktas@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23 16:59:32 -07:00
Matthew Wilcox (Oracle) f5f288a023 afs: convert pagevec to folio_batch in afs_extend_writeback()
Patch series "Remove pagevecs".

Removes a folio->page->folio conversion for each folio that's involved. 
More importantly, removes one of the last few uses of a pagevec.

Link: https://lkml.kernel.org/r/20230621164557.3510324-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230621164557.3510324-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23 16:59:28 -07:00
Andrew Morton 63773d2b59 Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes. 2023-06-23 16:58:19 -07:00
Linus Torvalds 569fa9392d for-6.4-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSUZ6QACgkQxWXV+ddt
 WDvtmRAAnFx3qcM68Enoo7yAWzKoNwx7Kk/CvApeWZdx7JKAH/eL6LpI5NPXukgk
 RII2S8/ijWRf83ZBVVGaCk6c5y2pWd4So1OjFbPoB1T89R0qbcOjn+PXPAX8p3Uz
 /kRHhEHJBQTt/WQXTT1m3IPpUxBfgb8B5Cu9Fx1oEUaH6Mk0cZr9OZEQuhsVyvSu
 TIrLPLSlPjl2gXXpo6teVskYM2erbreP44/GjVQUZWlHzfuRRyHYWNBpc3Ba8kZ5
 qdqnynw14+TRwwqjHOmTx2Vm3NAFCxxaSpODoWW8L1cUo1gt8+PMcQ0QLWpds3ye
 Gs0OVumr9iO4sCkJWiX43shnrKH4d8FOGt0Of30sn3slVeDOkQU0ihFXUG7ZG4dS
 /5qIsAvJxlsOSwdwolcVRGsvd6NLQTAKijJtZqv9w3lo0nPqMRIu0HbdbGqXPFWP
 W8xwxRWD5g6/7KXrJJH4twXsT7O3T+Xl6VTJ7yU+k8+LPvMGJAT6bY8FT5S+CBcu
 wxh7su2jkN7ut2c9U/DydvF7tY4uvCZWF3M6MzgtZ4MjJmiT3HeN5do0Uup42/9h
 2c6hjUX1utTt4sozrWoUjUYeHB9eMjB3Wq5J0knYiMjB6GzvlAe78Bzy+XraMi3U
 IO4r07CatGUb6fChqhnBlAkv9aboy/2mTlL1lFnWqQg5F7b71IU=
 =jy/9
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "Unfortunately the recent u32 overflow fix was not complete, there was
  one conversion left, assertion not triggered by my tests but caught by
  Qu's fstests case.

  The "cleanup for later" has been promoted to a proper fix and wraps
  all uses of the stripe left shift so the diffstat has grown but leaves
  no potentially problematic uses.

  We should have done it that way before, sorry"

* tag 'for-6.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix remaining u32 overflows when left shifting stripe_nr
2023-06-23 16:09:53 -07:00
Baruch Siach aa88054b70 binfmt_elf: fix comment typo s/reset/regset/
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/0b2967c4a4141875c493e835d5a6f8f2d19ae2d6.1687499804.git.baruch@tkos.co.il
2023-06-23 09:36:30 -07:00
Jan Kara a42fb5a75c ext4: Fix warning in blkdev_put()
ext4_blkdev_remove() passes a wrong holder pointer to blkdev_put() which
triggers a warning there. Fix it.

Fixes: 2736e8eeb0 ("block: use the holder as indication for exclusive opens")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230622165107.13687-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23 08:14:41 -06:00
Immad Mir 95e2b352c0 FS: JFS: Check for read-only mounted filesystem in txBegin
This patch adds a check for read-only mounted filesystem
 in txBegin before starting a transaction potentially saving
 from NULL pointer deref.

Signed-off-by: Immad Mir <mirimmad17@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2023-06-23 09:02:33 -05:00
Immad Mir 47cfdc338d FS: JFS: Fix null-ptr-deref Read in txBegin
Syzkaller reported an issue where txBegin may be called
 on a superblock in a read-only mounted filesystem which leads
 to NULL pointer deref. This could be solved by checking if
 the filesystem is read-only before calling txBegin, and returning
 with appropiate error code.

Reported-By: syzbot+f1faa20eec55e0c8644c@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=be7e52c50c5182cc09a09ea6fc456446b2039de3

Signed-off-by: Immad Mir <mirimmad17@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2023-06-23 09:00:13 -05:00
Colin Ian King dd0c64258a fsdax: remove redundant variable 'error'
The variable 'error' is being assigned a value that is never read,
the assignment and the variable and redundant and can be removed.
Cleans up clang scan build warning:

fs/dax.c:1880:10: warning: Although the value stored to 'error' is
used in the enclosing expression, the value is never actually read
from 'error' [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20230621130256.2676126-1-colin.i.king@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2023-06-23 01:31:31 -06:00
Jakub Kicinski a7384f3918 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

tools/testing/selftests/net/fcnal-test.sh
  d7a2fc1437 ("selftests: net: fcnal-test: check if FIPS mode is enabled")
  dd017c72dd ("selftests: fcnal: Test SO_DONTROUTE on TCP sockets.")
https://lore.kernel.org/all/5007b52c-dd16-dbf6-8d64-b9701bfa498b@tessares.net/
https://lore.kernel.org/all/20230619105427.4a0df9b3@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-22 18:40:38 -07:00
Yogesh 4e302336d5 fs: jfs: Fix UBSAN: array-index-out-of-bounds in dbAllocDmapLev
Syzkaller reported the following issue:

UBSAN: array-index-out-of-bounds in fs/jfs/jfs_dmap.c:1965:6
index -84 is out of range for type 's8[341]' (aka 'signed char[341]')
CPU: 1 PID: 4995 Comm: syz-executor146 Not tainted 6.4.0-rc6-syzkaller-00037-gb6dad5178cea #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
 ubsan_epilogue lib/ubsan.c:217 [inline]
 __ubsan_handle_out_of_bounds+0x11c/0x150 lib/ubsan.c:348
 dbAllocDmapLev+0x3e5/0x430 fs/jfs/jfs_dmap.c:1965
 dbAllocCtl+0x113/0x920 fs/jfs/jfs_dmap.c:1809
 dbAllocAG+0x28f/0x10b0 fs/jfs/jfs_dmap.c:1350
 dbAlloc+0x658/0xca0 fs/jfs/jfs_dmap.c:874
 dtSplitUp fs/jfs/jfs_dtree.c:974 [inline]
 dtInsert+0xda7/0x6b00 fs/jfs/jfs_dtree.c:863
 jfs_create+0x7b6/0xbb0 fs/jfs/namei.c:137
 lookup_open fs/namei.c:3492 [inline]
 open_last_lookups fs/namei.c:3560 [inline]
 path_openat+0x13df/0x3170 fs/namei.c:3788
 do_filp_open+0x234/0x490 fs/namei.c:3818
 do_sys_openat2+0x13f/0x500 fs/open.c:1356
 do_sys_open fs/open.c:1372 [inline]
 __do_sys_openat fs/open.c:1388 [inline]
 __se_sys_openat fs/open.c:1383 [inline]
 __x64_sys_openat+0x247/0x290 fs/open.c:1383
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f1f4e33f7e9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc21129578 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1f4e33f7e9
RDX: 000000000000275a RSI: 0000000020000040 RDI: 00000000ffffff9c
RBP: 00007f1f4e2ff080 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f1f4e2ff110
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

The bug occurs when the dbAllocDmapLev()function attempts to access
dp->tree.stree[leafidx + LEAFIND] while the leafidx value is negative.

To rectify this, the patch introduces a safeguard within the
dbAllocDmapLev() function. A check has been added to verify if leafidx is
negative. If it is, the function immediately returns an I/O error, preventing
any further execution that could potentially cause harm.

Tested via syzbot.

Reported-by: syzbot+853a6f4dfa3cf37d3aea@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=ae2f5a27a07ae44b0f17
Signed-off-by: Yogesh <yogi.kernel@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2023-06-22 10:58:00 -05:00
Qu Wenruo cb091225a5 btrfs: fix remaining u32 overflows when left shifting stripe_nr
There was regression caused by a97699d1d6 ("btrfs: replace
map_lookup->stripe_len by BTRFS_STRIPE_LEN") and supposedly fixed by
a7299a18a1 ("btrfs: fix u32 overflows when left shifting stripe_nr").
To avoid code churn the fix was open coding the type casts but
unfortunately missed one which was still possible to hit [1].

The missing place was assignment of bioc->full_stripe_logical inside
btrfs_map_block().

Fix it by adding a helper that does the safe calculation of the offset
and use it everywhere even though it may not be strictly necessary due
to already using u64 types.  This replaces all remaining
"<< BTRFS_STRIPE_LEN_SHIFT" calls.

[1] https://lore.kernel.org/linux-btrfs/20230622065438.86402-1-wqu@suse.com/

Fixes: a7299a18a1 ("btrfs: fix u32 overflows when left shifting stripe_nr")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-22 17:03:55 +02:00
Gao Xiang 8241fdd3cd erofs: clean up zmap.c
Several trivial cleanups which aren't quite necessary to split:

 - Rename lcluster load functions as well as justify full indexes
   since they are typically used for global deduplication for
   compressed data;

 - Avoid unnecessary lines, comments for simplicity.

No logic changes.

Reviewed-by: Guo Xuenan <guoxuenan@huaweicloud.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230615064421.103178-1-hsiangkao@linux.alibaba.com
2023-06-22 21:16:34 +08:00
Yangtao Li 1990595547 erofs: remove unnecessary goto
It's redundant, let's remove it.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230615034539.14286-1-frank.li@vivo.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-22 21:16:34 +08:00
Sandeep Dhavale 12d0a24afd erofs: Fix detection of atomic context
Current check for atomic context is not sufficient as
z_erofs_decompressqueue_endio can be called under rcu lock
from blk_mq_flush_plug_list(). See the stacktrace [1]

In such case we should hand off the decompression work for async
processing rather than trying to do sync decompression in current
context. Patch fixes the detection by checking for
rcu_read_lock_any_held() and while at it use more appropriate
!in_task() check than in_atomic().

Background: Historically erofs would always schedule a kworker for
decompression which would incur the scheduling cost regardless of
the context. But z_erofs_decompressqueue_endio() may not always
be in atomic context and we could actually benefit from doing the
decompression in z_erofs_decompressqueue_endio() if we are in
thread context, for example when running with dm-verity.
This optimization was later added in patch [2] which has shown
improvement in performance benchmarks.

==============================================
[1] Problem stacktrace
[name:core&]BUG: sleeping function called from invalid context at kernel/locking/mutex.c:291
[name:core&]in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1615, name: CpuMonitorServi
[name:core&]preempt_count: 0, expected: 0
[name:core&]RCU nest depth: 1, expected: 0
CPU: 7 PID: 1615 Comm: CpuMonitorServi Tainted: G S      W  OE      6.1.25-android14-5-maybe-dirty-mainline #1
Hardware name: MT6897 (DT)
Call trace:
 dump_backtrace+0x108/0x15c
 show_stack+0x20/0x30
 dump_stack_lvl+0x6c/0x8c
 dump_stack+0x20/0x48
 __might_resched+0x1fc/0x308
 __might_sleep+0x50/0x88
 mutex_lock+0x2c/0x110
 z_erofs_decompress_queue+0x11c/0xc10
 z_erofs_decompress_kickoff+0x110/0x1a4
 z_erofs_decompressqueue_endio+0x154/0x180
 bio_endio+0x1b0/0x1d8
 __dm_io_complete+0x22c/0x280
 clone_endio+0xe4/0x280
 bio_endio+0x1b0/0x1d8
 blk_update_request+0x138/0x3a4
 blk_mq_plug_issue_direct+0xd4/0x19c
 blk_mq_flush_plug_list+0x2b0/0x354
 __blk_flush_plug+0x110/0x160
 blk_finish_plug+0x30/0x4c
 read_pages+0x2fc/0x370
 page_cache_ra_unbounded+0xa4/0x23c
 page_cache_ra_order+0x290/0x320
 do_sync_mmap_readahead+0x108/0x2c0
 filemap_fault+0x19c/0x52c
 __do_fault+0xc4/0x114
 handle_mm_fault+0x5b4/0x1168
 do_page_fault+0x338/0x4b4
 do_translation_fault+0x40/0x60
 do_mem_abort+0x60/0xc8
 el0_da+0x4c/0xe0
 el0t_64_sync_handler+0xd4/0xfc
 el0t_64_sync+0x1a0/0x1a4

[2] Link: https://lore.kernel.org/all/20210317035448.13921-1-huangjianan@oppo.com/

Reported-by: Will Shiu <Will.Shiu@mediatek.com>
Suggested-by: Gao Xiang <xiang@kernel.org>
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com>
Link: https://lore.kernel.org/r/20230621220848.3379029-1-dhavale@google.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-22 21:16:02 +08:00
Colin Ian King 75bfb70457 nfsd: remove redundant assignments to variable len
There are a few assignments to variable len where the value is not
being read and so the assignments are redundant and can be removed.
In one case, the variable len can be removed completely. Cleans up
4 clang scan warnings of the form:

fs/nfsd/export.c💯7: warning: Although the value stored to 'len'
is used in the enclosing expression, the value is never actually
read from 'len' [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-21 15:05:32 -04:00
Jan Kara 028f6055c9 udf: Fix uninitialized array access for some pathnames
For filenames that begin with . and are between 2 and 5 characters long,
UDF charset conversion code would read uninitialized memory in the
output buffer. The only practical impact is that the name may be prepended a
"unification hash" when it is not actually needed but still it is good
to fix this.

Reported-by: syzbot+cd311b1e43cc25f90d18@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/000000000000e2638a05fe9dc8f9@google.com
Signed-off-by: Jan Kara <jack@suse.cz>
2023-06-21 11:53:06 +02:00
Gustavo A. R. Silva 2507135e4f
readdir: Replace one-element arrays with flexible-array members
One-element arrays are deprecated, and we are replacing them with flexible
array members instead. So, replace one-element arrays with flexible-array
members in multiple structures.

Address the following -Wstringop-overflow warnings seen when built
m68k architecture with m5307c3_defconfig configuration:
In function '__put_user_fn',
    inlined from 'fillonedir' at fs/readdir.c:170:2:
include/asm-generic/uaccess.h:49:35: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=]
   49 |                 *(u8 __force *)to = *(u8 *)from;
      |                 ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~
fs/readdir.c: In function 'fillonedir':
fs/readdir.c:134:25: note: at offset 1 into destination object 'd_name' of size 1
  134 |         char            d_name[1];
      |                         ^~~~~~
In function '__put_user_fn',
    inlined from 'filldir' at fs/readdir.c:257:2:
include/asm-generic/uaccess.h:49:35: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=]
   49 |                 *(u8 __force *)to = *(u8 *)from;
      |                 ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~
fs/readdir.c: In function 'filldir':
fs/readdir.c:211:25: note: at offset 1 into destination object 'd_name' of size 1
  211 |         char            d_name[1];
      |                         ^~~~~~

This helps with the ongoing efforts to globally enable
-Wstringop-overflow.

This results in no differences in binary output.

Link: https://github.com/KSPP/linux/issues/79
Link: https://github.com/KSPP/linux/issues/312
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Message-Id: <ZJHiPJkNKwxkKz1c@work>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-21 09:06:59 +02:00
Eric Biggers 672d6ef4c7 fsverity: improve documentation for builtin signature support
fsverity builtin signatures (CONFIG_FS_VERITY_BUILTIN_SIGNATURES) aren't
the only way to do signatures with fsverity, and they have some major
limitations.  Yet, more users have tried to use them, e.g. recently by
https://github.com/ostreedev/ostree/pull/2640.  In most cases this seems
to be because users aren't sufficiently familiar with the limitations of
this feature and what the alternatives are.

Therefore, make some updates to the documentation to try to clarify the
properties of this feature and nudge users in the right direction.

Note that the Integrity Policy Enforcement (IPE) LSM, which is not yet
upstream, is planned to use the builtin signatures.  (This differs from
IMA, which uses its own signature mechanism.)  For that reason, my
earlier patch "fsverity: mark builtin signatures as deprecated"
(https://lore.kernel.org/r/20221208033548.122704-1-ebiggers@kernel.org),
which marked builtin signatures as "deprecated", was controversial.

This patch therefore stops short of marking the feature as deprecated.
I've also revised the language to focus on better explaining the feature
and what its alternatives are.

Link: https://lore.kernel.org/r/20230620041937.5809-1-ebiggers@kernel.org
Reviewed-by: Colin Walters <walters@verbum.org>
Reviewed-by: Luca Boccassi <bluca@debian.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-06-20 22:47:55 -07:00
Paulo Alcantara f0b6a834a8 smb: client: fix warning in generic_ip_connect()
This fixes the following warning reported by kernel test robot

  fs/smb/client/connect.c:2974 generic_ip_connect() error: we
  previously assumed 'socket' could be null (see line 2962)

Link: https://lore.kernel.org/all/202306170124.CtQqzf0I-lkp@intel.com/
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-20 23:03:03 -05:00
Paulo Alcantara 215533f888 smb: client: fix warning in CIFSFindNext()
This fixes the following warning reported by kernel test robot

  fs/smb/client/cifssmb.c:4216 CIFSFindNext() warn: missing error
  code? 'rc'

Link: https://lore.kernel.org/all/202306170124.CtQqzf0I-lkp@intel.com/
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-20 23:03:03 -05:00
Paulo Alcantara 032137fe13 smb: client: fix warning in CIFSFindFirst()
This fixes the following warning reported by kernel test robot

  fs/smb/client/cifssmb.c:4089 CIFSFindFirst() warn: missing error
  code? 'rc'

Link: https://lore.kernel.org/all/202306170124.CtQqzf0I-lkp@intel.com/
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-20 23:03:03 -05:00
Steve French e8eeca0bf4 smb3: do not reserve too many oplock credits
There were cases reported where servers will sometimes return more
credits than requested on oplock break responses, which can lead to
most of the credits being allocated for oplock breaks (instead of
for normal operations like read and write) if number of SMB3 requests
in flight always stays above 0 (the oplock and echo credits are
rebalanced when in flight requests goes down to zero).

If oplock credits gets unexpectedly large (e.g. three is more than it
would ever be expected to be) and in flight requests are greater than
zero, then rebalance the oplock credits and regular credits (go
back to reserving just one oplock credit).

Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-20 23:03:03 -05:00
Steve French acf35d79ee cifs: print more detail when invalidate_inode_mapping fails
We had seen cases where cifs_invalidate_mapping was logging:
   "Could not invalidate inode ..."
if invalidate_inode_pages2 fails but this message does not show what
the rc is.  Update the logged message to also log the return code.

Suggested-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-20 23:03:03 -05:00
Paulo Alcantara 12c30f33cc smb: client: fix warning in cifs_smb3_do_mount()
This fixes the following warning reported by kernel test robot

  fs/smb/client/cifsfs.c:982 cifs_smb3_do_mount() warn: possible
  memory leak of 'cifs_sb'

Link: https://lore.kernel.org/all/202306170124.CtQqzf0I-lkp@intel.com/
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-20 23:02:52 -05:00
Linus Torvalds 8ba90f5cc7 19 hotfixes. 8 of these are cc:stable.
This includes a wholesale reversion of the post-6.4 series "make slab shrink
 lockless".  After input from Dave Chinner it has been decided that we
 should go a different way.  Thread starts at
 https://lkml.kernel.org/r/ZH6K0McWBeCjaf16@dread.disaster.area.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJH/qAAKCRDdBJ7gKXxA
 jq7uAP9AtDGHfvOuW5jlHdYfpUBnbfuQDKjiik71UuIxyhtwQQEAqpOBv7UDuhHj
 NbNIGTIi/xM5vkpjV6CBo9ymR7qTKwo=
 =uGuc
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-06-20-12-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "19 hotfixes.  8 of these are cc:stable.

  This includes a wholesale reversion of the post-6.4 series 'make slab
  shrink lockless'. After input from Dave Chinner it has been decided
  that we should go a different way [1]"

Link: https://lkml.kernel.org/r/ZH6K0McWBeCjaf16@dread.disaster.area [1]

* tag 'mm-hotfixes-stable-2023-06-20-12-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  selftests/mm: fix cross compilation with LLVM
  mailmap: add entries for Ben Dooks
  nilfs2: prevent general protection fault in nilfs_clear_dirty_page()
  Revert "mm: vmscan: make global slab shrink lockless"
  Revert "mm: vmscan: make memcg slab shrink lockless"
  Revert "mm: vmscan: add shrinker_srcu_generation"
  Revert "mm: shrinkers: make count and scan in shrinker debugfs lockless"
  Revert "mm: vmscan: hold write lock to reparent shrinker nr_deferred"
  Revert "mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()"
  Revert "mm: shrinkers: convert shrinker_rwsem to mutex"
  nilfs2: fix buffer corruption due to concurrent device reads
  scripts/gdb: fix SB_* constants parsing
  scripts: fix the gfp flags header path in gfp-translate
  udmabuf: revert 'Add support for mapping hugepages (v4)'
  mm/khugepaged: fix iteration in collapse_file
  memfd: check for non-NULL file_seals in memfd_create() syscall
  mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
  mm/mprotect: fix do_mprotect_pkey() limit check
  writeback: fix dereferencing NULL mapping->host on writeback_page_template
2023-06-20 17:20:22 -07:00
Linus Torvalds 4b0c7a1ba0 for-6.4-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSR56sACgkQxWXV+ddt
 WDvdORAAk3IVO8V6md56wSQBC0S2itaMomSYlzBJL8fqriKZwhaG6C/pJPyt0wyQ
 zj1ZsJqX3WwhDssVWIYNrj6CLr7gOQiMpbX363PYTFc6Hvrw8FqDVk/VjworsSmC
 IyVo3K11HCLIapbYrB1QLxzufDqYieYw6c2jS1elnwUUssvPHxonlfc2IZ1nef9z
 jLS9ZBbEjO7KaUwKrDJ1g4OFljwSnuYCW41fZWFaoXnvlBk84BbBvcRisFY69a++
 Kv0ux1Z2NJXmBOzKG4ZXCCNamwHj/oHqOT2cZm7ZzWrruIo5IP/Q4lzXHxKkbxA0
 7cUfS+O7UqwBDzZZsBEos6Pn82q1rj9RFQkw8B6gCazwRN16TiA2U3AqOkjPZolL
 RXMhLB1oEpQbyxB/PMhk63yV3uHs7VxVvyycuG9qlynFCRVFNxj4hE4TXFJu97DD
 TBOXmUaCBbyRMX72BycziW4hwsbWafkmZqRWV8wHqRXOD4nd7QCj51ZOFTIsNmbq
 Gpo2cJxmzEeYJC/o4Kq+ABCcmrphMhgALugR8T0TBP9gqp0WVs6NmixTzFEc9Zo6
 VbcN721J7R7yYBknJIUQ5GuJJddhu4Ya86vk/N2eBsGVSq0KxCtJoHbM4RBYTaTl
 NiXcdXu1Cr4TojoZN5W6cH6BBSCvIcEBruQtNkoIj5G0Wfrvt0U=
 =IFO0
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "One more regression fix for an assertion failure that uncovered a
  nasty problem with stripe calculations. This is caused by a u32
  overflow when there are enough devices. The fstests require 6 so this
  hasn't been caught, I was able to hit it with 8.

  The fix is minimal and only adds u64 casts, we'll clean that up later.
  I did various additional tests to be sure"

* tag 'for-6.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix u32 overflows when left shifting stripe_nr
2023-06-20 14:38:21 -07:00
Wonguk Lee f3fb462443 fs: jfs: (trivial) Fix typo in dbInitTree function
While trying to fix the jfs UBSAN problem reported in syzkaller,
(https://syzkaller.appspot.com/bug?id=01abadbd6ae6a08b1f1987aa61554c6b3ac19ff2)

I found the typo in the comment of dbInitTree function and fix it.

Signed-off-by: Wonguk Lee <wonguk.lee1023@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2023-06-20 14:58:08 -05:00
Linus Torvalds 99ec1ed7c2 four smb3 server fixes, all also for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmSNFQcACgkQiiy9cAdy
 T1F/0gwAs396WKBsx5fEiHPucWOdBemT/IqSW+ZTY5LNwB6A4B71W7D8w5by/0MU
 bXprt3XJuPnb21euW2L8BlGg9jJyd3GBcJWv4/GbOzQ//ZMkyT50sZbrWSDP2luq
 jELbe+CS+xOimOSBnIEDxYWqWxetTfDUwlbUHTrHaMGCCZcRJZmEXsJEGelSUx4+
 uipbj0sMUB3x4H7MyEND0bbGxhMKzZXsymRvKTsS3aR1qTioQ8FoNrmDlUBCgN4u
 qbEsG7GGQTj/WrdUrzbQ9DkZa17C+2PNV/j1xKGu/9hBpr2lZSyNmj46MqmPegKD
 4lPwG0GmWEXWuRP7QCXNzS9p9Rkms2C1MnPH7x2NqzoOXsJSZFb4j/fp0oKG4mu1
 zoIjZ5cA+wsOxoJTyOHNdWYFzBcPnUIQ5ZP6696alejA9TtdxqrQdhEBNO9YnsCP
 xreIMADWX8pIjcdDx6DSQ8ZzhyB5Uu46j0K1Ve1e/58HX8FhGrBcOQGFexIXjuMa
 8655VONz
 =yfOS
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:
 "Four smb3 server fixes, all also for stable:

   - fix potential oops in parsing compounded requests

   - fix various paths (mkdir, create etc) where mnt_want_write was not
     checked first

   - fix slab out of bounds in check_message and write"

* tag '6.4-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: validate session id and tree id in the compound request
  ksmbd: fix out-of-bound read in smb2_write
  ksmbd: add mnt_want_write to ksmbd vfs functions
  ksmbd: validate command payload size
2023-06-20 11:50:40 -07:00
Siddh Raman Pant 11509910c5 jfs: jfs_dmap: Validate db_l2nbperpage while mounting
In jfs_dmap.c at line 381, BLKTODMAP is used to get a logical block
number inside dbFree(). db_l2nbperpage, which is the log2 number of
blocks per page, is passed as an argument to BLKTODMAP which uses it
for shifting.

Syzbot reported a shift out-of-bounds crash because db_l2nbperpage is
too big. This happens because the large value is set without any
validation in dbMount() at line 181.

Thus, make sure that db_l2nbperpage is correct while mounting.

Max number of blocks per page = Page size / Min block size
=> log2(Max num_block per page) = log2(Page size / Min block size)
				= log2(Page size) - log2(Min block size)

=> Max db_l2nbperpage = L2PSIZE - L2MINBLOCKSIZE

Reported-and-tested-by: syzbot+d2cd27dcf8e04b232eb2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?id=2a70a453331db32ed491f5cbb07e81bf2d225715
Cc: stable@vger.kernel.org
Suggested-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Siddh Raman Pant <code@siddh.me>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2023-06-20 12:37:50 -05:00
Qu Wenruo a7299a18a1 btrfs: fix u32 overflows when left shifting stripe_nr
[BUG]
David reported an ASSERT() get triggered during fio load on 8 devices
with data/raid6 and metadata/raid1c3:

  fio --rw=randrw --randrepeat=1 --size=3000m \
	  --bsrange=512b-64k --bs_unaligned \
	  --ioengine=libaio --fsync=1024 \
	  --name=job0 --name=job1 \

The ASSERT() is from rbio_add_bio() of raid56.c:

	ASSERT(orig_logical >= full_stripe_start &&
	       orig_logical + orig_len <= full_stripe_start +
	       rbio->nr_data * BTRFS_STRIPE_LEN);

Which is checking if the target rbio is crossing the full stripe
boundary.

  [100.789] assertion failed: orig_logical >= full_stripe_start && orig_logical + orig_len <= full_stripe_start + rbio->nr_data * BTRFS_STRIPE_LEN, in fs/btrfs/raid56.c:1622
  [100.795] ------------[ cut here ]------------
  [100.796] kernel BUG at fs/btrfs/raid56.c:1622!
  [100.797] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
  [100.798] CPU: 1 PID: 100 Comm: kworker/u8:4 Not tainted 6.4.0-rc6-default+ #124
  [100.799] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
  [100.802] Workqueue: writeback wb_workfn (flush-btrfs-1)
  [100.803] RIP: 0010:rbio_add_bio+0x204/0x210 [btrfs]
  [100.806] RSP: 0018:ffff888104a8f300 EFLAGS: 00010246
  [100.808] RAX: 00000000000000a1 RBX: ffff8881075907e0 RCX: ffffed1020951e01
  [100.809] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000001
  [100.811] RBP: 0000000141d20000 R08: 0000000000000001 R09: ffff888104a8f04f
  [100.813] R10: ffffed1020951e09 R11: 0000000000000003 R12: ffff88810e87f400
  [100.815] R13: 0000000041d20000 R14: 0000000144529000 R15: ffff888101524000
  [100.817] FS:  0000000000000000(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
  [100.821] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [100.822] CR2: 000055d54e44c270 CR3: 000000010a9a1006 CR4: 00000000003706a0
  [100.824] Call Trace:
  [100.825]  <TASK>
  [100.825]  ? die+0x32/0x80
  [100.826]  ? do_trap+0x12d/0x160
  [100.827]  ? rbio_add_bio+0x204/0x210 [btrfs]
  [100.827]  ? rbio_add_bio+0x204/0x210 [btrfs]
  [100.829]  ? do_error_trap+0x90/0x130
  [100.830]  ? rbio_add_bio+0x204/0x210 [btrfs]
  [100.831]  ? handle_invalid_op+0x2c/0x30
  [100.833]  ? rbio_add_bio+0x204/0x210 [btrfs]
  [100.835]  ? exc_invalid_op+0x29/0x40
  [100.836]  ? asm_exc_invalid_op+0x16/0x20
  [100.837]  ? rbio_add_bio+0x204/0x210 [btrfs]
  [100.837]  raid56_parity_write+0x64/0x270 [btrfs]
  [100.838]  btrfs_submit_chunk+0x26e/0x800 [btrfs]
  [100.840]  ? btrfs_bio_init+0x80/0x80 [btrfs]
  [100.841]  ? release_pages+0x503/0x6d0
  [100.842]  ? folio_unlock+0x2f/0x60
  [100.844]  ? __folio_put+0x60/0x60
  [100.845]  ? btrfs_do_readpage+0xae0/0xae0 [btrfs]
  [100.847]  btrfs_submit_bio+0x21/0x60 [btrfs]
  [100.847]  submit_one_bio+0x6a/0xb0 [btrfs]
  [100.849]  extent_write_cache_pages+0x395/0x680 [btrfs]
  [100.850]  ? __extent_writepage+0x520/0x520 [btrfs]
  [100.851]  ? mark_usage+0x190/0x190
  [100.852]  extent_writepages+0xdb/0x130 [btrfs]
  [100.853]  ? extent_write_locked_range+0x480/0x480 [btrfs]
  [100.854]  ? mark_usage+0x190/0x190
  [100.854]  ? attach_extent_buffer_page+0x220/0x220 [btrfs]
  [100.855]  ? reacquire_held_locks+0x178/0x280
  [100.856]  ? writeback_sb_inodes+0x245/0x7f0
  [100.857]  do_writepages+0x102/0x2e0
  [100.858]  ? page_writeback_cpu_online+0x10/0x10
  [100.859]  ? __lock_release.isra.0+0x14a/0x4d0
  [100.860]  ? reacquire_held_locks+0x280/0x280
  [100.861]  ? __lock_acquired+0x1e9/0x3d0
  [100.862]  ? do_raw_spin_lock+0x1b0/0x1b0
  [100.863]  __writeback_single_inode+0x94/0x450
  [100.864]  writeback_sb_inodes+0x372/0x7f0
  [100.864]  ? lock_sync+0xd0/0xd0
  [100.865]  ? do_raw_spin_unlock+0x93/0xf0
  [100.866]  ? sync_inode_metadata+0xc0/0xc0
  [100.867]  ? rwsem_optimistic_spin+0x340/0x340
  [100.868]  __writeback_inodes_wb+0x70/0x130
  [100.869]  wb_writeback+0x2d1/0x530
  [100.869]  ? __writeback_inodes_wb+0x130/0x130
  [100.870]  ? lockdep_hardirqs_on_prepare.part.0+0xf1/0x1c0
  [100.870]  wb_do_writeback+0x3eb/0x480
  [100.871]  ? wb_writeback+0x530/0x530
  [100.871]  ? mark_lock_irq+0xcd0/0xcd0
  [100.872]  wb_workfn+0xe0/0x3f0<

[CAUSE]
Commit a97699d1d6 ("btrfs: replace map_lookup->stripe_len by
BTRFS_STRIPE_LEN") changes how we calculate the map length, to reduce
u64 division.

Function btrfs_max_io_len() is to get the length to the stripe boundary.

It calculates the full stripe start offset (inside the chunk) by the
following code:

		*full_stripe_start =
			rounddown(*stripe_nr, nr_data_stripes(map)) <<
			BTRFS_STRIPE_LEN_SHIFT;

The calculation itself is fine, but the value returned by rounddown() is
dependent on both @stripe_nr (which is u32) and nr_data_stripes() (which
returned int).

Thus the result is also u32, then we do the left shift, which can
overflow u32.

If such overflow happens, @full_stripe_start will be a value way smaller
than @offset, causing later "full_stripe_len - (offset -
*full_stripe_start)" to underflow, thus make later length calculation to
have no stripe boundary limit, resulting a write bio to exceed stripe
boundary.

There are some other locations like this, with a u32 @stripe_nr got left
shift, which can lead to a similar overflow.

[FIX]
Fix all @stripe_nr with left shift with a type cast to u64 before the
left shift.

Those involved @stripe_nr or similar variables are recording the stripe
number inside the chunk, which is small enough to be contained by u32,
but their offset inside the chunk can not fit into u32.

Thus for those specific left shifts, a type cast to u64 is necessary so
this patch does not touch them and the code will be cleaned up in the
future to keep the fix minimal.

Reported-by: David Sterba <dsterba@suse.com>
Fixes: a97699d1d6 ("btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN")
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-20 19:10:31 +02:00
Christian Brauner ceecc2d87f ovl: reserve ability to reconfigure mount options with new mount api
Using the old mount api to remount an overlayfs superblock via
mount(MS_REMOUNT) all mount options will be silently ignored. For
example, if you create an overlayfs mount:

        mount -t overlay overlay -o lowerdir=/mnt/a:/mnt/b,upperdir=/mnt/upper,workdir=/mnt/work /mnt/merged

and then issue a remount via:

        # force mount(8) to use mount(2)
        export LIBMOUNT_FORCE_MOUNT2=always
        mount -t overlay overlay -o remount,WOOTWOOT,lowerdir=/DOESNT-EXIST /mnt/merged

with completely nonsensical mount options whatsoever it will succeed
nonetheless. This prevents us from every changing any mount options we
might introduce in the future that could reasonably be changed during a
remount.

We don't need to carry this issue into the new mount api port. Similar
to FUSE we can use the fs_context::oldapi member to figure out that this
is a request coming through the legacy mount api. If we detect it we
continue silently ignoring all mount options.

But for the new mount api we simply report that mount options cannot
currently be changed. This will allow us to potentially alter mount
properties for new or even old properties. It any case, silently
ignoring everything is not something new apis should do.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-20 18:28:07 +03:00
Alexander Aring fc4ea4229c fs: dlm: remove filter local comms on close
The current way how lowcomms is configured is due configfs entries. Each
comms configfs entry will create a lowcomms connection. Even the local
connection itself will be stored as a lowcomms connection, although most
functionality for a local lowcomms connection struct is not necessary.

Now in some scenarios we will see that dlm_controld reports a -EEXIST
when configure a node via configfs:

... /sys/kernel/config/dlm/cluster/comms/1/addr: write failed: 17 -1

Doing a:

cat /sys/kernel/config/dlm/cluster/comms/1/addr_list

reported nothing. This was being seen on cluster with nodeid 1 and it's
local configuration. To be sure the configfs entries are in sync with
lowcomms connection structures we always call dlm_midcomms_close() to be
sure the lowcomms connection gets removed when the configfs entry gets
dropped.

Before commit 07ee38674a ("fs: dlm: filter ourself midcomms calls") it
was just doing this by accident and the filter by doing:

if (nodeid == dlm_our_nodeid())
	return 0;

inside dlm_midcomms_close() was never been hit because drop_comm() sets
local_comm to NULL and cause that dlm_our_nodeid() returns always the
invalid nodeid 0.

Fixes: 07ee38674a ("fs: dlm: filter ourself midcomms calls")
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-20 09:25:20 -05:00