OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Steve French	61351d6d54	smb3: send backup intent on compounded query info When mounting with backupuid set, we should be setting CREATE_OPEN_BACKUP_INTENT flag on compounded opens as well, especially the case of compounded smb2_query_path_info. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-10-23 21:16:05 -05:00
Steve French	0cb012d1a0	cifs: track writepages in vfs operation counters writepages and readpages operations did not call get/free_xid so the statistics for file copy could get confusing with "vfs operations" not increasing. Add get_xid and free_xid to cifs readpages and writepages functions. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-10-23 21:16:05 -05:00
Gustavo A. R. Silva	f70556c8ca	smb2: fix uninitialized variable bug in smb2_ioctl_query_info There is a potential execution path in which variable resp_buftype is passed as an argument to function free_rsp_buf(), in which it is used in a comparison without being properly initialized previously. Fix this by initializing variable resp_buftype to CIFS_NO_BUFFER in order to avoid unpredictable or unintended results. Addresses-Coverity-ID: 1473971 ("Uninitialized scalar variable") Fixes: c5d25bdb2967 ("cifs: add IOCTL for QUERY_INFO passthrough to userspace") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-10-23 21:16:05 -05:00
Ronnie Sahlberg	f5b05d622a	cifs: add IOCTL for QUERY_INFO passthrough to userspace This allows userspace tools to query the raw info levels for cifs files and process the response in userspace. In particular this is useful for many of those data where there is no corresponding native data structure in linux. For example querying the security descriptor for a file and extract the SIDs. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:05 -05:00
Steve French	8c1beb9801	cifs: minor clarification in comments Clarify meaning (in comments) meaning of various options for debug messages in cifs.ko. Also fixed trivial formatting/style issue with previous patch. Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:05 -05:00
Rodrigo Freire	f80eaedd6c	CIFS: Print message when attempting a mount Currently, no messages are printed when mounting a CIFS filesystem and no debug configuration is enabled. However, a CIFS mount information is valuable when troubleshooting and/or forensic analyzing a system and finding out if was a CIFS endpoint mount attempted. Other filesystems such as XFS, EXT* does issue a printk() when mounting their filesystems. A terse log message is printed only if cifsFYI is not enabled. Otherwise, the default full debug message is printed. In order to not clutter and classify correctly the event messages, these are logged as KERN_INFO level. Sample mount operations: [root@corinthians ~]# mount -o user=administrator //172.25.250.18/c$ /mnt (non-existent system) [root@corinthians ~]# mount -o user=administrator //172.25.250.19/c$ /mnt (Valid system) Kernel message log for the mount operations: [ 450.464543] CIFS: Attempting to mount //172.25.250.18/c$ [ 456.478186] CIFS VFS: Error connecting to socket. Aborting operation. [ 456.478381] CIFS VFS: cifs_mount failed w/return code = -113 [ 467.688866] CIFS: Attempting to mount //172.25.250.19/c$ Signed-off-by: Rodrigo Freire <rfreire@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:05 -05:00
Rodrigo Freire	9a0efeccfa	CIFS: Adds information-level logging function Currently, CIFS lacks a internal logging function that prints out data when CIFS_DEBUG=n. When CIFS_DEBUG=y, the only message level for CIFS events are KERN_ERR or KERN_DEBUG. This patch creates cifs_info(), which is useful for printing non-critical event messges, at either CIFS_DEBUG state. Signed-off-by: Rodrigo Freire <rfreire@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:05 -05:00
Ronnie Sahlberg	9645759ce6	cifs: OFD locks do not conflict with eachothers RHBZ 1484130 Update cifs_find_fid_lock_conflict() to recognize that ODF locks do not conflict with eachother. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:05 -05:00
Long Li	ff526d8605	CIFS: SMBD: Do not call ib_dereg_mr on invalidated memory registration It is not necessary to deregister a memory registration after it has been successfully invalidated. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:05 -05:00
Long Li	6d3adb23be	CIFS: pass page offsets on SMB1 read/write When issuing SMB1 read/write, pass the page offset to transport. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:05 -05:00
Garry McNulty	ef2298a06d	fs/cifs: fix uninitialised variable warnings In some error conditions, resp_buftype can be passed uninitialised to free_rsp_buf(), potentially resulting in a spurious debug message. If resp_buftype randomly had the value 1 (CIFS_SMALL_BUFFER) then this would log a debug message. The rsp pointer is initialised to NULL so there is no other side-effect. Detected by CoverityScan, CID 1438585 ("Uninitialized scalar variable") Detected by CoverityScan, CID 1438667 ("Uninitialized scalar variable") Detected by CoverityScan, CID 1438764 ("Uninitialized scalar variable") Signed-off-by: Garry McNulty <garrmcnu@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-10-23 21:16:05 -05:00
Steve French	179e44d49c	smb3: add tracepoint for sending lease break responses to server Be able to log a ftrace message on success and/or failure of sending a lease break response to the server. Example output: TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION \| \| \| \|\|\|\| \| \| kworker/1:1-5681 [001] .... 11123.530457: smb3_lease_done: sid=0x291e3e0f tid=0x8ba43071 lease_key=0x1852ca0d3ecd9b55847750a86716fde lease_state=0x0 Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-10-23 21:16:05 -05:00
Steve French	9b9c5bea0b	cifs: do not return atime less than mtime In network file system it is fairly easy for server and client atime vs. mtime to get confused (and atime updated less frequently) which we noticed broke some apps which expect atime >= mtime Also ignore relatime mount option (rather than error on it) since relatime is basically what some network server fs are doing (relatime). Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-10-23 21:16:05 -05:00
Steve French	3d621230b8	smb3: update default requested iosize to 4MB from 1MB for recent dialects Modern servers often support 8MB as maximum i/o size, and we see some performance benefits (my testing showed 1 to 13% on write paths, and 1 to 3% on read paths for increasing the default to 4MB). If server doesn't support larger i/o size, during negotiate protocol it is already set correctly to the server's maximum if lower than 4MB. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-10-23 21:16:05 -05:00
Steve French	6e4d3bbe92	smb3: Add debug message later in smb2/smb3 reconnect path As we reset credits later in the reconnect path, useful to have optional (cifsFYI) debug message. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-10-23 21:16:05 -05:00
Aurelien Aptel	8393072bab	CIFS: make 'nodfs' mount opt a superblock flag tcon->Flags is only used by SMB1 code and changing it is not permanent (you lose the setting on tcon reconnect). * Move the setting to superblock flags (per mount-points). * Make automount callback exit early when flag present * Make dfs resolving happening in mount syscall exit early if flag present Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-10-23 21:16:05 -05:00
Steve French	9e1a37dad4	smb3: track the instance of each session for debugging Each time we reconnect to the same server, bump an instance counter (and display in /proc/fs/cifs/DebugData) to make it easier to debug. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-10-23 21:16:04 -05:00
Steve French	37e6a70576	smb3: minor missing defines relating to reparse points Previously reserved dpen response field changed in smb3 Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-10-23 21:16:04 -05:00
Steve French	00778e2294	smb3: add way to control slow response threshold for logging and stats /proc/fs/cifs/Stats when CONFIG_CIFS_STATS2 is enabled logs 'slow' responses, but depending on the server you are debugging a one second timeout may be too fast, so allow setting it to a larger number of seconds via new module parameter /sys/module/cifs/parameters/slow_rsp_threshold or via modprobe: slow_rsp_threshold:Amount of time (in seconds) to wait before logging that a response is delayed. Default: 1 (if set to 0 disables msg). (uint) Recommended values are 0 (disabled) to 32767 (9 hours) with the default remaining as 1 second. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-10-23 21:16:04 -05:00
Steve French	1c3a13a38a	cifs: minor updates to module description for cifs.ko note smb3 (and common more modern servers) in the module description Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-10-23 21:16:04 -05:00
Steve French	5a519bead4	cifs: protect against server returning invalid file system block size For a network file system we generally prefer large i/o, but if the server returns invalid file system block/sector sizes in cifs (vers=1.0) QFSInfo then set block size to a default of a reasonable minimum (4K). Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-10-23 21:16:04 -05:00
Steve French	2c887635cd	smb3: allow stats which track session and share reconnects to be reset Currently, "echo 0 > /proc/fs/cifs/Stats" resets all of the stats except the session and share reconnect counts. Fix it to reset those as well. CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-10-23 21:16:04 -05:00
Steve French	4d5bdf2869	SMB3: Backup intent flag missing from compounded ops When "backup intent" is requested on the mount (e.g. backupuid or backupgid mount options), the corresponding flag was missing from some of the new compounding operations as well (now that open_query_close is gone). Related to kernel bugzilla #200953 Reported-and-tested-by: <whh@rubrik.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-10-23 21:16:04 -05:00
Ronnie Sahlberg	14e562ada2	cifs: create a define for the max number of iov we need for a SMB2 set_info So we don't overflow the io vector arrays accidentally Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:04 -05:00
Ronnie Sahlberg	bb435512ce	cifs: change SMB2_OP_RENAME and SMB2_OP_HARDLINK to use compounding Get rid of smb2_open_op_close() as all operations are now migrated to smb2_compound_op(). Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:04 -05:00
Ronnie Sahlberg	3764cbd179	cifs: remove the is_falloc argument to SMB2_set_eof We never pass is_falloc==true here anyway and if we ever need to support is_falloc in the future, SMB2_set_eof is such a trivial wrapper around send_set_info() that we can/should just create a differently named wrapper for that new functionality. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:04 -05:00
Ronnie Sahlberg	dcbf910357	cifs: change SMB2_OP_SET_INFO to use compounding Cuts number of network roundtrips significantly for some common syscalls Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:04 -05:00
Ronnie Sahlberg	f7bfe04bf0	cifs: change SMB2_OP_SET_EOF to use compounding This changes SMB2_OP_SET_EOF to use compounding in some situations. This is part of the path based API to truncate a file. Most of the time this will however not be invoked for SMB2 since cifs_set_file_size() will as far as I can tell almost always just open the file synchronously and switch to the handle based truncate code path, thus bypassing the compounding we add here. Rewriting cifs_set_file_size() and make that whole pile of code more compounding friendly, and also easier to read and understand, is a different project though and not for this patch. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:04 -05:00
Ronnie Sahlberg	c2e0fe3f5a	cifs: make rmdir() use compounding This and previous patches drop the number of roundtrips we need for rmdir() from 6 to 2. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:04 -05:00
Ronnie Sahlberg	ba8ca11685	cifs: create helpers for SMB2_set_info_init/free() so that we can use these later for compounded set-info calls. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:04 -05:00
Ronnie Sahlberg	47dd9597df	cifs: change unlink to use a compound This,and previous patches, drops the number of roundtrips from five to two for unlink() Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:04 -05:00
Ronnie Sahlberg	f733e3936d	cifs: change mkdir to use a compound This with the previous patch changes mkdir() from needing 6 roundtrips to just 3. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:04 -05:00
Ronnie Sahlberg	c5a5f38f07	cifs: add a smb2_compound_op and change QUERY_INFO to use it This turns most open/query-info/close patterns in cifs.ko to become compounds. This changes stat from using 3 roundtrips to just a single one. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:04 -05:00
Ronnie Sahlberg	cb5c2e6394	cifs: fix a credits leak for compund commands When processing the mids for compounds we would only add credits based on the last successful mid in the compound which would leak credits and eventually triggering a re-connect. Fix this by splitting the mid processing part into two loops instead of one where the first loop just waits for all mids and then counts how many credits we were granted for the whole compound. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:04 -05:00
Steve French	b340a4d4aa	smb3: add tracepoint to catch cases where credit refund of failed op overlaps reconnect Add tracepoint to catch potential cases where a pending operation overlapping a reconnect could fail and incorrectly refund its credits causing the client to think it has more credits available than the server thinks it does. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-10-23 21:16:03 -05:00
YueHaibing	ce7fb50f92	cifs: remove set but not used variable 'cifs_sb' Fixes gcc '-Wunused-but-set-variable' warning: fs/cifs/ioctl.c: In function 'cifs_ioctl': fs/cifs/ioctl.c:164:23: warning: variable 'cifs_sb' set but not used [-Wunused-but-set-variable] Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:03 -05:00
YueHaibing	d034feeb44	cifs: Use kmemdup rather than duplicating its implementation in smb311_posix_mkdir() Use kmemdup rather than duplicating its implementation Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-10-23 21:16:03 -05:00
Steve French	d42c8a87d1	smb3: do not display confusing message on mount to Azure servers Some servers (e.g. Azure) return "STATUS_NOT_IMPLEMENTED" rather than "STATUS_INVALID_DEVICE_REQUEST" on query network interface info at mount. This shouldn't cause us to log a warning message automatically. Don't log this unless noisier cifsFYI is enabled. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-10-23 21:16:03 -05:00
David Howells	3bf0fb6f33	afs: Probe multiple fileservers simultaneously Send probes to all the unprobed fileservers in a fileserver list on all addresses simultaneously in an attempt to find out the fastest route whilst not getting stuck for 20s on any server or address that we don't get a reply from. This alleviates the problem whereby attempting to access a new server can take a long time because the rotation algorithm ends up rotating through all servers and addresses until it finds one that responds. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:09 +01:00
David Howells	18ac61853c	afs: Fix callback handling In some circumstances, the callback interest pointer is NULL, so in such a case we can't dereference it when checking to see if the callback is broken. This causes an oops in some circumstances. Fix this by replacing the function that worked out the aggregate break counter with one that actually does the comparison, and then make that return true (ie. broken) if there is no callback interest as yet (ie. the pointer is NULL). Fixes: `68251f0a68` ("afs: Fix whole-volume callback handling") Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:09 +01:00
David Howells	2feeaf8433	afs: Eliminate the address pointer from the address list cursor Eliminate the address pointer from the address list cursor as it's redundant (ac->addrs[ac->index] can be used to find the same address) and address lists must be replaced rather than being rearranged, so is of limited value. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:09 +01:00
David Howells	744bcd713a	afs: Allow dumping of server cursor on operation failure Provide an option to allow the file or volume location server cursor to be dumped if the rotation routine falls off the end without managing to contact a server. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:09 +01:00
David Howells	30062bd13e	afs: Implement YFS support in the fs client Implement support for talking to YFS-variant fileservers in the cache manager and the filesystem client. These implement upgraded services on the same port as their AFS services. YFS fileservers provide expanded capabilities over AFS. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:08 +01:00
David Howells	d4936803a9	afs: Expand data structure fields to support YFS Expand fields in various data structures to support the expanded information that YFS is capable of returning. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:08 +01:00
David Howells	f58db83fd3	afs: Get the target vnode in afs_rmdir() and get a callback on it Get the target vnode in afs_rmdir() and validate it before we attempt the deletion, The vnode pointer will be passed through to the delivery function in a later patch so that the delivery function can mark it deleted. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:08 +01:00
David Howells	12d8e95a91	afs: Calc callback expiry in op reply delivery Calculate the callback expiration time at the point of operation reply delivery, using the reply time queried from AF_RXRPC on that call as a base. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:08 +01:00
David Howells	36bb5f490a	afs: Fix FS.FetchStatus delivery from updating wrong vnode The FS.FetchStatus reply delivery function was updating inode of the directory in which a lookup had been done with the status of the looked up file. This corrupts some of the directory state. Fixes: `5cf9dd55a0` ("afs: Prospectively look up extra files when doing a single lookup") Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:08 +01:00
David Howells	35dbfba311	afs: Implement the YFS cache manager service Implement the YFS cache manager service which gives extra capabilities on top of AFS. This is done by listening for an additional service on the same port and indicating that anyone requesting an upgrade should be upgraded to the YFS port. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:08 +01:00
David Howells	06aeb29714	afs: Remove callback details from afs_callback_break struct Remove unnecessary details of a broken callback, such as version, expiry and type, from the afs_callback_break struct as they're not actually used and make the list take more memory. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:08 +01:00
David Howells	0067191201	afs: Commit the status on a new file/dir/symlink Call the function to commit the status on a new file, dir or symlink so that the access rights for the caller's key are cached for that object. Without this, the next access to the file will cause a FetchStatus operation to be emitted to retrieve the access rights. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:08 +01:00
David Howells	3b6492df41	afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS Increase the sizes of the volume ID to 64 bits and the vnode ID (inode number equivalent) to 96 bits to allow the support of YFS. This requires the iget comparator to check the vnode->fid rather than i_ino and i_generation as i_ino is not sufficiently capacious. It also requires this data to be placed into the vnode cache key for fscache. For the moment, just discard the top 32 bits of the vnode ID when returning it though stat. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:08 +01:00
David Howells	2a0b4f64c9	afs: Don't invoke the server to read data beyond EOF When writing a new page, clear space in the page rather than attempting to load it from the server if the space is beyond the EOF. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:08 +01:00
David Howells	f51375cd9e	afs: Add a couple of tracepoints to log I/O errors Add a couple of tracepoints to log the production of I/O errors within the AFS filesystem. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:07 +01:00
David Howells	4ac15ea536	afs: Handle EIO from delivery function Fix afs_deliver_to_call() to handle -EIO being returned by the operation delivery function, indicating that the call found itself in the wrong state, by printing an error and aborting the call. Currently, an assertion failure will occur. This can happen, say, if the delivery function falls off the end without calling afs_extract_data() with the want_more parameter set to false to collect the end of the Rx phase of a call. The assertion failure looks like: AFS: Assertion failed 4 == 7 is false 0x4 == 0x7 is false ------------[ cut here ]------------ kernel BUG at fs/afs/rxrpc.c:462! and is matched in the trace buffer by a line like: kworker/7:3-3226 [007] ...1 85158.030203: afs_io_error: c=0003be0c r=-5 CM_REPLY Fixes: `98bf40cd99` ("afs: Protect call->state changes against signals") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:07 +01:00
David Howells	ded2f4c58a	afs: Fix TTL on VL server and address lists Currently the TTL on VL server and address lists isn't set in all circumstances and may be set to poor choices in others, since the TTL is derived from the SRV/AFSDB DNS record if and when available. Fix the TTL by limiting the range to a minimum and maximum from the current time. At some point these can be made into sysctl knobs. Further, use the TTL we obtained from the upcall to set the expiry on negative results too; in future a mechanism can be added to force reloading of such data. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:07 +01:00
David Howells	0a5143f2f8	afs: Implement VL server rotation Track VL servers as independent entities rather than lumping all their addresses together into one set and implement server-level rotation by: (1) Add the concept of a VL server list, where each server has its own separate address list. This code is similar to the FS server list. (2) Use the DNS resolver to retrieve a set of servers and their associated addresses, ports, preference and weight ratings. (3) In the case of a legacy DNS resolver or an address list given directly through /proc/net/afs/cells, create a list containing just a dummy server record and attach all the addresses to that. (4) Implement a simple rotation policy, for the moment ignoring the priorities and weights assigned to the servers. (5) Show the address list through /proc/net/afs/<cell>/vlservers. This also displays the source and status of the data as indicated by the upcall. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:07 +01:00
David Howells	e7f680f45b	afs: Improve FS server rotation error handling Improve the error handling in FS server rotation by: (1) Cache the latest useful error value for the fs operation as a whole in struct afs_fs_cursor separately from the error cached in the afs_addr_cursor struct. The one in the address cursor gets clobbered occasionally. Copy over the error to the fs operation only when it's something we'd be interested in passing to userspace. (2) Make it so that EDESTADDRREQ is the default that is seen only if no addresses are available to be accessed. (3) When calling utility functions, such as checking a volume status or probing a fileserver, don't let a successful result clobber the cached error in the cursor; instead, stash the result in a temporary variable until it has been assessed. (4) Don't return ETIMEDOUT or ETIME if a better error, such as ENETUNREACH, is already cached. (5) On leaving the rotation loop, turn any remote abort code into a more useful error than ECONNABORTED. Fixes: `d2ddc776a4` ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:07 +01:00
David Howells	12bdcf333f	afs: Set up the iov_iter before calling afs_extract_data() afs_extract_data sets up a temporary iov_iter and passes it to AF_RXRPC each time it is called to describe the remaining buffer to be filled. Instead: (1) Put an iterator in the afs_call struct. (2) Set the iterator for each marshalling stage to load data into the appropriate places. A number of convenience functions are provided to this end (eg. afs_extract_to_buf()). This iterator is then passed to afs_extract_data(). (3) Use the new ITER_DISCARD iterator to discard any excess data provided by FetchData. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:07 +01:00
David Howells	160cb9574b	afs: Better tracing of protocol errors Include the site of detection of AFS protocol errors in trace lines to better be able to determine what went wrong. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:07 +01:00
David Howells	aa563d7bca	iov_iter: Separate type from direction and use accessor functions In the iov_iter struct, separate the iterator type from the iterator direction and use accessor functions to access them in most places. Convert a bunch of places to use switch-statements to access them rather then chains of bitwise-AND statements. This makes it easier to add further iterator types. Also, this can be more efficient as to implement a switch of small contiguous integers, the compiler can use ~50% fewer compare instructions than it has to use bitwise-and instructions. Further, cease passing the iterator type into the iterator setup function. The iterator function can set that itself. Only the direction is required. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:41:07 +01:00
David Howells	00e2370744	iov_iter: Use accessor function Use accessor functions to access an iterator's type and direction. This allows for the possibility of using some other method of determining the type of iterator than if-chains with bitwise-AND conditions. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-24 00:40:44 +01:00
Frank Sorenson	86bbd7422a	NFS: change sign of nfs_fh length The filehandle has a length which is defined as a 32-bit "unsigned integer". Change sign of the length appropriately. Signed-off-by: Frank Sorenson <sorenson@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-10-23 12:22:21 -04:00
Linus Torvalds	99792e0cea	Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "Lots of changes in this cycle: - Lots of CPA (change page attribute) optimizations and related cleanups (Thomas Gleixner, Peter Zijstra) - Make lazy TLB mode even lazier (Rik van Riel) - Fault handler cleanups and improvements (Dave Hansen) - kdump, vmcore: Enable kdumping encrypted memory with AMD SME enabled (Lianbo Jiang) - Clean up VM layout documentation (Baoquan He, Ingo Molnar) - ... plus misc other fixes and enhancements" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry() x86/mm: Kill stray kernel fault handling comment x86/mm: Do not warn about PCI BIOS W+X mappings resource: Clean it up a bit resource: Fix find_next_iomem_res() iteration issue resource: Include resource end in walk_*() interfaces x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error x86/mm: Remove spurious fault pkey check x86/mm/vsyscall: Consider vsyscall page part of user address space x86/mm: Add vsyscall address helper x86/mm: Fix exception table comments x86/mm: Add clarifying comments for user addr space x86/mm: Break out user address space handling x86/mm: Break out kernel address space handling x86/mm: Clarify hardware vs. software "error_code" x86/mm/tlb: Make lazy TLB mode lazier x86/mm/tlb: Add freed_tables element to flush_tlb_info x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range smp,cpumask: introduce on_each_cpu_cond_mask smp: use __cpumask_set_cpu in on_each_cpu_cond ...	2018-10-23 17:05:28 +01:00
Linus Torvalds	0200fbdd43	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking and misc x86 updates from Ingo Molnar: "Lots of changes in this cycle - in part because locking/core attracted a number of related x86 low level work which was easier to handle in a single tree: - Linux Kernel Memory Consistency Model updates (Alan Stern, Paul E. McKenney, Andrea Parri) - lockdep scalability improvements and micro-optimizations (Waiman Long) - rwsem improvements (Waiman Long) - spinlock micro-optimization (Matthew Wilcox) - qspinlocks: Provide a liveness guarantee (more fairness) on x86. (Peter Zijlstra) - Add support for relative references in jump tables on arm64, x86 and s390 to optimize jump labels (Ard Biesheuvel, Heiko Carstens) - Be a lot less permissive on weird (kernel address) uaccess faults on x86: BUG() when uaccess helpers fault on kernel addresses (Jann Horn) - macrofy x86 asm statements to un-confuse the GCC inliner. (Nadav Amit) - ... and a handful of other smaller changes as well" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits) locking/lockdep: Make global debug_locks* variables read-mostly locking/lockdep: Fix debug_locks off performance problem locking/pvqspinlock: Extend node size when pvqspinlock is configured locking/qspinlock_stat: Count instances of nested lock slowpaths locking/qspinlock, x86: Provide liveness guarantee x86/asm: 'Simplify' GEN_*_RMWcc() macros locking/qspinlock: Rework some comments locking/qspinlock: Re-order code locking/lockdep: Remove duplicated 'lock_class_ops' percpu array x86/defconfig: Enable CONFIG_USB_XHCI_HCD=y futex: Replace spin_is_locked() with lockdep locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs x86/extable: Macrofy inline assembly code to work around GCC inlining bugs x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs x86/refcount: Work around GCC inlining bug x86/objtool: Use asm macros to work around GCC inlining bugs ...	2018-10-23 13:08:53 +01:00
Ding Xiang	84db119f5a	ubifs: Remove unneeded semicolon delete redundant semicolon Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:49:02 +02:00
Sascha Hauer	d8a22773a1	ubifs: Enable authentication support With the preparations all being done this patch now enables authentication support for UBIFS. Authentication is enabled when the newly introduced auth_key and auth_hash_name mount options are passed. auth_key provides the key which is used for authentication whereas auth_hash_name provides the hashing algorithm used for this FS. Passing these options make authentication mandatory and only UBIFS images that can be authenticated with the given key are allowed. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:49:01 +02:00
Sascha Hauer	1e76592f2c	ubifs: Do not update inode size in-place in authenticated mode In authenticated mode we cannot fixup the inode sizes in-place during recovery as this would invalidate the hashes and HMACs we stored for this inode. Instead, we just write the updated inodes to the journal. We can only do this after ubifs_rcvry_gc_commit() is done though, so for authenticated mode call ubifs_recover_size() after ubifs_rcvry_gc_commit() and not vice versa as normally done. Calling ubifs_recover_size() after ubifs_rcvry_gc_commit() has the drawback that after a commit the size fixup information is gone, so when a powercut happens while recovering from another powercut we may lose some data written right before the first powercut. This is why we only do this in authenticated mode and leave the behaviour for unauthenticated mode untouched. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:57 +02:00
Sascha Hauer	104115a3eb	ubifs: Add hashes and HMACs to default filesystem This patch calculates the necessary hashes and HMACs for the default filesystem so that the dynamically created default fs can be authenticated. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:57 +02:00
Sascha Hauer	e158e02ff7	ubifs: authentication: Authenticate super block node This adds a HMAC covering the super block node and adds the logic that decides if a filesystem shall be mounted unauthenticated or authenticated. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:57 +02:00
Sascha Hauer	b5b1f08369	ubifs: Create hash for default LPT During creation of the default filesystem on an empty flash the default LPT is created. With this patch a hash over the default LPT is calculated which can be added to the default filesystems master node. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:56 +02:00
Sascha Hauer	625700ccb5	ubfis: authentication: Authenticate master node The master node contains hashes over the root index node and the LPT. This patch adds a HMAC to authenticate the master node itself. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:52 +02:00
Sascha Hauer	a1dc58140f	ubifs: authentication: Authenticate LPT The LPT needs to be authenticated aswell. Since the LPT is only written during commit it is enough to authenticate the whole LPT with a single hash which is stored in the master node. Only the leaf nodes (pnodes) are hashed which makes the implementation much simpler than it would be to hash the complete LPT. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:47 +02:00
Sascha Hauer	da8ef65f95	ubifs: Authenticate replayed journal Make sure that during replay all buds can be authenticated. To do this we calculate the hash chain until we find an authentication node and check the HMAC in that node against the current status of the hash chain. After a power cut it can happen that some nodes have been written, but not yet the authentication node for them. These nodes have to be discarded during replay. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:40 +02:00
Sascha Hauer	6f06d96fdf	ubifs: Add auth nodes to garbage collector journal head To be able to authenticate the garbage collector journal head add authentication nodes to the buds the garbage collector creates. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:40 +02:00
Sascha Hauer	6a98bc4614	ubifs: Add authentication nodes to journal Nodes that are written to flash can only be authenticated through the index after the next commit. When a journal replay is necessary the nodes are not yet referenced by the index and thus can't be authenticated. This patch overcomes this situation by creating a hash over all nodes beginning from the commit start node over the reference node(s) and the buds themselves. From time to time we insert authentication nodes. Authentication nodes contain a HMAC from the current hash state, so that they can be used to authenticate a journal replay up to the point where the authentication node is. The hash is continued afterwards so that theoretically we would only have to check the HMAC of the last authentication node we find. Overall we get this picture: ,,,,,,,, ,......,........................................... ,. CS , hash1.----. hash2.----. ,. \| , . \|hmac . \|hmac ,. v , . v . v ,.REF#0,-> bud -> bud -> bud.-> auth -> bud -> bud.-> auth ... ,..\|...,........................................... , \| , , \| ,,,,,,,,,,,,,,, . \| hash3,----. , \| , \|hmac , v , v , REF#1 -> bud -> bud,-> auth ... ,,,\|,,,,,,,,,,,,,,,,,, v REF#2 -> ... \| V ... Note how hash3 covers CS, REF#0 and REF#1 so that it is not possible to exchange or skip any reference nodes. Unlike the picture suggests the auth nodes themselves are not hashed. With this it is possible for an offline attacker to cut each journal head or to drop the last reference node(s), but not to skip any journal heads or to reorder any operations. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:39 +02:00
Sascha Hauer	16a26b20d2	ubifs: authentication: Add hashes to index nodes With this patch the hashes over the index nodes stored in the tree node cache are written to flash and are checked when read back from flash. The hash of the root index node is stored in the master node. During journal replay the hashes are regenerated from the read nodes and stored in the tree node cache. This means the nodes must previously be authenticated by other means. This is done in a later patch. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:39 +02:00
Sascha Hauer	823838a486	ubifs: Add hashes to the tree node cache As part of the UBIFS authentication support every branch in the index gets a hash covering the referenced node. To make that happen the tree node cache needs hashes over the nodes. This patch adds a hash argument to ubifs_tnc_add() and ubifs_tnc_add_nm(). The hashes are calculated from the callers of these functions which actually prepare the nodes. With this patch all the leaf nodes of the index tree get hashes, but currently nothing is done with these hashes, this is left for a later patch. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:39 +02:00
Sascha Hauer	a384b47e49	ubifs: Create functions to embed a HMAC in a node With authentication support some nodes (master node, super block node) get a HMAC embedded into them. This patch adds functions to prepare and write such a node. The difficulty is that besides the HMAC the nodes also have a CRC which must stay valid. This means we first have to initialize all fields in the node, then calculate the HMAC (not covering the CRC) and finally calculate the CRC. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:37 +02:00
Sascha Hauer	49525e5eec	ubifs: Add helper functions for authentication support This patch adds the various helper functions needed for authentication support. We need functions to hash nodes, to embed HMACs into a node and to compare hashes and HMACs. Most functions first check if this filesystem is authenticated and bail out early if not, which makes the functions safe to be called with disabled authentication. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:33 +02:00
Sascha Hauer	dead97266f	ubifs: Add separate functions to init/crc a node When adding authentication support we will embed a HMAC into some nodes. To prepare these nodes we have to first initialize the nodes, then add a HMAC and finally add a CRC. To accomplish this add separate ubifs_init_node/ubifs_crc_node functions. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:29 +02:00
Sascha Hauer	5125cfdff1	ubifs: Format changes for authentication support This patch adds the changes to the on disk format needed for authentication support. We'll add: * a HMAC covering super block node * a HMAC covering the master node * a hash over the root index node to the master node * a hash over the LPT to the master node * a flag to the filesystem flag indicating the filesystem is authenticated * an authentication node necessary to authenticate the nodes written to the journal heads while they are written. * a HMAC of a well known message to the super block node to be able to check if the correct key is provided And finally, not visible in this patch, nevertheless explained here: * hashes over the referenced child nodes in each branch of a index node Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:29 +02:00
Sascha Hauer	fd6150051b	ubifs: Store read superblock node The superblock node is read/modified/written several times throughout the UBIFS code. Instead of reading it from the device each time just keep a copy in memory and write back the modified copy when necessary. This patch helps for authentication support, here we not only have to read the superblock node, but also have to authenticate it, which is easier if we do it once during initialization. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:29 +02:00
Sascha Hauer	83407437bb	ubifs: Drop write_node write_node() is used only once and can easily be replaced with calls to ubifs_prepare_node()/write_head() which makes the code a bit shorter. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:24 +02:00
Sascha Hauer	e635cf8c3b	ubifs: Implement ubifs_lpt_lookup using ubifs_pnode_lookup ubifs_lpt_lookup() starts by looking up the nth pnode in the LPT. We already have this functionality in ubifs_pnode_lookup(). Use this function rather than open coding its functionality. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:21 +02:00
Sascha Hauer	0e26b6e255	ubifs: Export pnode_lookup as ubifs_pnode_lookup ubifs_lpt_lookup could be implemented using pnode_lookup. To make that possible move pnode_lookup from lpt.c to lpt_commit.c. Rename it to ubifs_pnode_lookup since it's now exported. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:17 +02:00
Sascha Hauer	22ceaa8c68	ubifs: Pass ubifs_zbranch to read_znode() read_znode() takes len, lnum and offs arguments which the caller all extracts from the same struct ubifs_zbranch . When adding authentication support we would have to add a pointer to a hash to the arguments which is also part of struct ubifs_zbranch. Pass the ubifs_zbranch instead so that we do not have to add another argument. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:14 +02:00
Sascha Hauer	545bc8f6b0	ubifs: Pass ubifs_zbranch to try_read_node() try_read_node() takes len, lnum and offs arguments which the caller all extracts from the same struct ubifs_zbranch . When adding authentication support we would have to add a pointer to a hash to the arguments which is also part of struct ubifs_zbranch. Pass the ubifs_zbranch instead so that we do not have to add another argument. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:10 +02:00
Sascha Hauer	c4de6d7e43	ubifs: Refactor create_default_filesystem() create_default_filesystem() allocates memory for a node, writes that node and frees the memory directly afterwards. With this patch we allocate memory for all nodes at the beginning of the function and free the memory at the end. This makes it easier to implement authentication support since with authentication support we'll need the contents of some nodes when creating other nodes. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2018-10-23 13:48:06 +02:00
Chao Yu	7813081969	f2fs: fix to keep project quota consistent This patch does below changes to keep consistence of project quota data in sudden power-cut case: - update inode.i_projid and project quota atomically under lock_op() in f2fs_ioc_setproject() - recover inode.i_projid and project quota in recover_inode() Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-22 17:54:48 -07:00
Chao Yu	af033b2aa8	f2fs: guarantee journalled quota data by checkpoint For journalled quota mode, let checkpoint to flush dquot dirty data and quota file data to guarntee persistence of all quota sysfile in last checkpoint, by this way, we can avoid corrupting quota sysfile when encountering SPO. The implementation is as below: 1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is cached dquot metadata changes in quota subsystem, and later checkpoint should: a) flush dquot metadata into quota file. b) flush quota file to storage to keep file usage be consistent. 2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota operation failed due to -EIO or -ENOSPC, so later, a) checkpoint will skip syncing dquot metadata. b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a hint for fsck repairing. 3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota data updating is very heavy, it may cause hungtask in block_operation(). To avoid this, if our retry time exceed threshold, let's just skip flushing and retry in next checkpoint(). Signed-off-by: Weichao Guo <guoweichao@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: avoid warnings and set fsck flag] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-22 17:54:47 -07:00
Sheng Yong	26b5a07919	f2fs: cleanup dirty pages if recover failed During recover, we will try to create new dentries for inodes with dentry_mark. But if the parent is missing (e.g. killed by fsck), recover will break. But those recovered dirty pages are not cleanup. This will hit f2fs_bug_on: [ 53.519566] F2FS-fs (loop0): Found nat_bits in checkpoint [ 53.539354] F2FS-fs (loop0): recover_inode: ino = 5, name = file, inline = 3 [ 53.539402] F2FS-fs (loop0): recover_dentry: ino = 5, name = file, dir = 0, err = -2 [ 53.545760] F2FS-fs (loop0): Cannot recover all fsync data errno=-2 [ 53.546105] F2FS-fs (loop0): access invalid blkaddr:4294967295 [ 53.546171] WARNING: CPU: 1 PID: 1798 at fs/f2fs/checkpoint.c:163 f2fs_is_valid_blkaddr+0x26c/0x320 [ 53.546174] Modules linked in: [ 53.546183] CPU: 1 PID: 1798 Comm: mount Not tainted 4.19.0-rc2+ #1 [ 53.546186] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 53.546191] RIP: 0010:f2fs_is_valid_blkaddr+0x26c/0x320 [ 53.546195] Code: 85 bb 00 00 00 48 89 df 88 44 24 07 e8 ad a8 db ff 48 8b 3b 44 89 e1 48 c7 c2 40 03 72 a9 48 c7 c6 e0 01 72 a9 e8 84 3c ff ff <0f> 0b 0f b6 44 24 07 e9 8a 00 00 00 48 8d bf 38 01 00 00 e8 7c a8 [ 53.546201] RSP: 0018:ffff88006c067768 EFLAGS: 00010282 [ 53.546208] RAX: 0000000000000000 RBX: ffff880068844200 RCX: ffffffffa83e1a33 [ 53.546211] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88006d51e590 [ 53.546215] RBP: 0000000000000005 R08: ffffed000daa3cb3 R09: ffffed000daa3cb3 [ 53.546218] R10: 0000000000000001 R11: ffffed000daa3cb2 R12: 00000000ffffffff [ 53.546221] R13: ffff88006a1f8000 R14: 0000000000000200 R15: 0000000000000009 [ 53.546226] FS: 00007fb2f3646840(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 [ 53.546229] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 53.546234] CR2: 00007f0fd77f0008 CR3: 00000000687e6002 CR4: 00000000000206e0 [ 53.546237] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 53.546240] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 53.546242] Call Trace: [ 53.546248] f2fs_submit_page_bio+0x95/0x740 [ 53.546253] read_node_page+0x161/0x1e0 [ 53.546271] ? truncate_node+0x650/0x650 [ 53.546283] ? add_to_page_cache_lru+0x12c/0x170 [ 53.546288] ? pagecache_get_page+0x262/0x2d0 [ 53.546292] __get_node_page+0x200/0x660 [ 53.546302] f2fs_update_inode_page+0x4a/0x160 [ 53.546306] f2fs_write_inode+0x86/0xb0 [ 53.546317] __writeback_single_inode+0x49c/0x620 [ 53.546322] writeback_single_inode+0xe4/0x1e0 [ 53.546326] sync_inode_metadata+0x93/0xd0 [ 53.546330] ? sync_inode+0x10/0x10 [ 53.546342] ? do_raw_spin_unlock+0xed/0x100 [ 53.546347] f2fs_sync_inode_meta+0xe0/0x130 [ 53.546351] f2fs_fill_super+0x287d/0x2d10 [ 53.546367] ? vsnprintf+0x742/0x7a0 [ 53.546372] ? f2fs_commit_super+0x180/0x180 [ 53.546379] ? up_write+0x20/0x40 [ 53.546385] ? set_blocksize+0x5f/0x140 [ 53.546391] ? f2fs_commit_super+0x180/0x180 [ 53.546402] mount_bdev+0x181/0x200 [ 53.546406] mount_fs+0x94/0x180 [ 53.546411] vfs_kern_mount+0x6c/0x1e0 [ 53.546415] do_mount+0xe5e/0x1510 [ 53.546420] ? fs_reclaim_release+0x9/0x30 [ 53.546424] ? copy_mount_string+0x20/0x20 [ 53.546428] ? fs_reclaim_acquire+0xd/0x30 [ 53.546435] ? __might_sleep+0x2c/0xc0 [ 53.546440] ? ___might_sleep+0x53/0x170 [ 53.546453] ? __might_fault+0x4c/0x60 [ 53.546468] ? _copy_from_user+0x95/0xa0 [ 53.546474] ? memdup_user+0x39/0x60 [ 53.546478] ksys_mount+0x88/0xb0 [ 53.546482] __x64_sys_mount+0x5d/0x70 [ 53.546495] do_syscall_64+0x65/0x130 [ 53.546503] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 53.547639] ---[ end trace b804d1ea2fec893e ]--- So if recover fails, we need to drop all recovered data. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-22 17:54:47 -07:00
Sahitya Tummala	1e78e8bd9d	f2fs: fix data corruption issue with hardware encryption Direct IO can be used in case of hardware encryption. The following scenario results into data corruption issue in this path - Thread A - Thread B- -> write file#1 in direct IO -> GC gets kicked in -> GC submitted bio on meta mapping for file#1, but pending completion -> write file#1 again with new data in direct IO -> GC bio gets completed now -> GC writes old data to the new location and thus file#1 is corrupted. Fix this by submitting and waiting for pending io on meta mapping for direct IO case in f2fs_map_blocks(). Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-22 17:54:47 -07:00
Chao Yu	0c093b590e	f2fs: fix to recover inode->i_flags of inode block during POR Testcase to reproduce this bug: 1. mkfs.f2fs /dev/sdd 2. mount -t f2fs /dev/sdd /mnt/f2fs 3. touch /mnt/f2fs/file 4. sync 5. chattr +a /mnt/f2fs/file 6. xfs_io -a /mnt/f2fs/file -c "fsync" 7. godown /mnt/f2fs 8. umount /mnt/f2fs 9. mount -t f2fs /dev/sdd /mnt/f2fs 10. xfs_io /mnt/f2fs/file There is no error when opening this file w/o O_APPEND, but actually, we expect the correct result should be: /mnt/f2fs/file: Operation not permitted The root cause is, in recover_inode(), we recover inode->i_flags more than F2FS_I(inode)->i_flags, so fix it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-22 17:54:47 -07:00
Chao Yu	9149a5eb60	f2fs: spread f2fs_set_inode_flags() This patch changes codes as below: - use f2fs_set_inode_flags() to update i_flags atomically to avoid potential race. - synchronize F2FS_I(inode)->i_flags to inode->i_flags in f2fs_new_inode(). - use f2fs_set_inode_flags() to simply codes in f2fs_quota_{on,off}. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-22 17:54:47 -07:00
Chao Yu	2baf078185	f2fs: fix to spread clear_cold_data() We need to drop PG_checked flag on page as well when we clear PG_uptodate flag, in order to avoid treating the page as GCing one later. Signed-off-by: Weichao Guo <guoweichao@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-22 17:54:46 -07:00
Jaegeuk Kim	164a63fa6b	Revert "f2fs: fix to clear PG_checked flag in set_page_dirty()" This reverts commit `66110abc4c`. If we clear the cold data flag out of the writeback flow, we can miscount -1 by end_io, which incurs a deadlock caused by all I/Os being blocked during heavy GC. Balancing F2FS Async: - IO (CP: 1, Data: -1, Flush: ( 0 0 1), Discard: ( ... GC thread: IRQ - move_data_page() - set_page_dirty() - clear_cold_data() - f2fs_write_end_io() - type = WB_DATA_TYPE(page); here, we get wrong type - dec_page_count(sbi, type); - f2fs_wait_on_page_writeback() Cc: <stable@vger.kernel.org> Reported-and-Tested-by: Park Ju Hyung <qkrwngud825@gmail.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-22 17:54:46 -07:00
Jaegeuk Kim	5f9abab42b	f2fs: account read IOs and use IO counts for is_idle This patch adds issued read IO counts which is under block layer. Chao modified a bit, since: Below race can cause reversed reference on F2FS_RD_DATA, there is the same issue in f2fs_submit_page_bio(), fix them by relocate __submit_bio() and inc_page_count. Thread A Thread B - f2fs_write_begin - f2fs_submit_page_read - __submit_bio - f2fs_read_end_io - __read_end_io - dec_page_count(, F2FS_RD_DATA) - inc_page_count(, F2FS_RD_DATA) Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-22 17:54:46 -07:00
Chao Yu	78efac537d	f2fs: fix to account IO correctly for cgroup writeback Now, we have supported cgroup writeback, it depends on correctly IO account of specified filesystem. But in commit `d1b3e72d54` ("f2fs: submit bio of in-place-update pages"), we split write paths from f2fs_submit_page_mbio() to two: - f2fs_submit_page_bio() for IPU path - f2fs_submit_page_bio() for OPU path But still we account write IO only in f2fs_submit_page_mbio(), result in incorrect IO account, fix it by adding missing IO account in IPU path. Fixes: `d1b3e72d54` ("f2fs: submit bio of in-place-update pages") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-22 17:54:39 -07:00
Chao Yu	4c58ed0768	f2fs: fix to account IO correctly Below race can cause reversed reference on dirty count, fix it by relocating __submit_bio() and inc_page_count(). Thread A Thread B - f2fs_inplace_write_data - f2fs_submit_page_bio - __submit_bio - f2fs_write_end_io - dec_page_count - inc_page_count Cc: <stable@vger.kernel.org> Fixes: `d1b3e72d54` ("f2fs: submit bio of in-place-update pages") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-22 17:53:48 -07:00
Linus Torvalds	a36cf68651	SPI NOR changes: Core changes: * Support non-uniform erase size * Support controllers with limited TX fifo size Driver changes: * m25p80: Re-issue a WREN command after each write access * cadence: Pass a proper dir value to dma_[un]map_single() * fsl-qspi: Check fsl_qspi_get_seqid() return val make sure 4B addressing opcodes are properly handled * intel-spi: Add a new PCI entry for Ice Lake NAND changes: Raw NAND core changes: - Two batchs of cleanups of the NAND API, including: * Deprecating a lot of interfaces (now replaced by ->exec_op()). * Moving code in separate drivers (JEDEC, ONFI), in private files (internals), in platform drivers, etc. * Functions/structures reordering. * Exclusive use of the nand_chip structure instead of the MTD one all across the subsystem. - Addition of the nand_wait_readrdy/rdy_op() helpers. Raw NAND controllers drivers changes: - Various coccinelle patches. - Marvell: * Use regmap_update_bits() for syscon access. * More documentation. * BCH failure path rework. * More layouts to be supported. * IRQ handler complete() condition fixed. - Fsl_ifc: * SRAM initialization fixed for newer controller versions. - Denali: * Fix licenses mismatch and use a SPDX tag. * Set SPARE_AREA_SKIP_BYTES register to 8 if unset. - Qualcomm: * Do not include dma-direct.h. - Docg4: * Removed. - Ams-delta: * Use of a GPIO lookup table * Internal machinery changes. Raw NAND chip drivers changes: - Toshiba: * Add support for Toshiba memory BENAND * Pass a single nand_chip object to the status helper. - ESMT: * New driver to retrieve the ECC requirements from the 5th ID byte. MTD changes: * physmap cleanups/fixe * gpio-addr-flash cleanups/fixes -----BEGIN PGP SIGNATURE----- iQJQBAABCgA6FiEEKmCqpbOU668PNA69Ze02AX4ItwAFAlvNmJocHGJvcmlzLmJy ZXppbGxvbkBib290bGluLmNvbQAKCRBl7TYBfgi3AKr+D/4pH0gYSGDUstZsqUHW gZsPRU4jmOA+OCRbSxE03bOZOYR0UPgdoUYLfAKhZ5qxc7wHd3b47wykP0kUEviu lC8QhSs4DUA+EuMVPDVS4FlXRT0e3dMV7jh/IK6nInshD2YkaovyCqa6GvgwFEcM 7BCizxRhtHV8+fo7GVQWuMLb9ZfbEvz42D0sYOu6hIsF1SnRDvHOvfdDUFEXpJoF a2mC9ove65ChEzc2iZ/r260iZ4aoJYb9XJRJKWxmYeITZSmLNmcrUyGqAaNQ7NRc AIuPXASbeHGjrIuEfXpKYW07AE5MV1nJFSk3v4u5FjgoohOoobPp7npk+Lz/qwe3 y8/uW9qOQ/iEOsRnkvMGNu4Yjhw41L1a+J5wVvUvzmwHy1xMCrRHYB/8gXoZzemR A7XmCPwjAFVWv1WeKV2cs5MLW9yZq9QNMtGLlNs5OgFR6CccLk67/tHIkYLsJ/l9 IDXFhd/YKhTeF361u6Iimmgb427TwM71P9N+OMpv4nk46DxurpxkgGle3nslzKNU DOFNFlMGiPIx3h4X96AKER6u7cxiOXnLGq+XeHa/y8tIrziy3jg03YoTh0RSwKxV J3EXwh1sFLaeebwWEgpE3Ag1LOxpRCqJ2ED71SPzS/DR/938HBVsEoPqUEHNf1in jwsUB3cRzNwf+XX68DRCVT5GFA== =7w4w -----END PGP SIGNATURE----- Merge tag 'mtd/for-4.20' of git://git.infradead.org/linux-mtd Pull mtd updates from Boris Brezillon: "SPI NOR core changes: - Support non-uniform erase size - Support controllers with limited TX fifo size Driver changes: - m25p80: Re-issue a WREN command after each write access - cadence: Pass a proper dir value to dma_[un]map_single() - fsl-qspi: Check fsl_qspi_get_seqid() return val make sure 4B addressing opcodes are properly handled - intel-spi: Add a new PCI entry for Ice Lake Raw NAND core changes: - Two batchs of cleanups of the NAND API, including: * Deprecating a lot of interfaces (now replaced by ->exec_op()). * Moving code in separate drivers (JEDEC, ONFI), in private files (internals), in platform drivers, etc. * Functions/structures reordering. * Exclusive use of the nand_chip structure instead of the MTD one all across the subsystem. - Addition of the nand_wait_readrdy/rdy_op() helpers. Raw NAND controllers drivers changes: - Various coccinelle patches. - Marvell: * Use regmap_update_bits() for syscon access. * More documentation. * BCH failure path rework. * More layouts to be supported. * IRQ handler complete() condition fixed. - Fsl_ifc: * SRAM initialization fixed for newer controller versions. - Denali: * Fix licenses mismatch and use a SPDX tag. * Set SPARE_AREA_SKIP_BYTES register to 8 if unset. - Qualcomm: * Do not include dma-direct.h. - Docg4: * Removed. - Ams-delta: * Use of a GPIO lookup table * Internal machinery changes. Raw NAND chip drivers changes: - Toshiba: * Add support for Toshiba memory BENAND * Pass a single nand_chip object to the status helper. - ESMT: * New driver to retrieve the ECC requirements from the 5th ID byte. MTD changes: - physmap cleanups/fixe - gpio-addr-flash cleanups/fixes" * tag 'mtd/for-4.20' of git://git.infradead.org/linux-mtd: (93 commits) jffs2: free jffs2_sb_info through jffs2_kill_sb() mtd: spi-nor: fsl-quadspi: fix read error for flash size larger than 16MB mtd: spi-nor: intel-spi: Add support for Intel Ice Lake SPI serial flash mtd: maps: gpio-addr-flash: Convert to gpiod mtd: maps: gpio-addr-flash: Replace array with an integer mtd: maps: gpio-addr-flash: Use order instead of size mtd: spi-nor: fsl-quadspi: Don't let -EINVAL on the bus mtd: devices: m25p80: Make sure WRITE_EN is issued before each write mtd: spi-nor: Support controllers with limited TX FIFO size mtd: spi-nor: cadence-quadspi: Use proper enum for dma_[un]map_single mtd: spi-nor: parse SFDP Sector Map Parameter Table mtd: spi-nor: add support to non-uniform SFDP SPI NOR flash memories mtd: rawnand: marvell: fix the IRQ handler complete() condition mtd: rawnand: denali: set SPARE_AREA_SKIP_BYTES register to 8 if unset mtd: rawnand: r852: fix spelling mistake "card_registred" -> "card_registered" mtd: rawnand: toshiba: Pass a single nand_chip object to the status helper mtd: maps: gpio-addr-flash: Use devm_* functions mtd: maps: gpio-addr-flash: Fix ioremapped size mtd: maps: gpio-addr-flash: Replace custom printk mtd: physmap_of: Release resources on error ...	2018-10-23 01:09:22 +01:00
Filipe Manana	9084cb6a24	Btrfs: fix use-after-free when dumping free space We were iterating a block group's free space cache rbtree without locking first the lock that protects it (the free_space_ctl->free_space_offset rbtree is protected by the free_space_ctl->tree_lock spinlock). KASAN reported an use-after-free problem when iterating such a rbtree due to a concurrent rbtree delete: [ 9520.359168] ================================================================== [ 9520.359656] BUG: KASAN: use-after-free in rb_next+0x13/0x90 [ 9520.359949] Read of size 8 at addr ffff8800b7ada500 by task btrfs-transacti/1721 [ 9520.360357] [ 9520.360530] CPU: 4 PID: 1721 Comm: btrfs-transacti Tainted: G L 4.19.0-rc8-nbor #555 [ 9520.360990] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 9520.362682] Call Trace: [ 9520.362887] dump_stack+0xa4/0xf5 [ 9520.363146] print_address_description+0x78/0x280 [ 9520.363412] kasan_report+0x263/0x390 [ 9520.363650] ? rb_next+0x13/0x90 [ 9520.363873] __asan_load8+0x54/0x90 [ 9520.364102] rb_next+0x13/0x90 [ 9520.364380] btrfs_dump_free_space+0x146/0x160 [btrfs] [ 9520.364697] dump_space_info+0x2cd/0x310 [btrfs] [ 9520.364997] btrfs_reserve_extent+0x1ee/0x1f0 [btrfs] [ 9520.365310] __btrfs_prealloc_file_range+0x1cc/0x620 [btrfs] [ 9520.365646] ? btrfs_update_time+0x180/0x180 [btrfs] [ 9520.365923] ? _raw_spin_unlock+0x27/0x40 [ 9520.366204] ? btrfs_alloc_data_chunk_ondemand+0x2c0/0x5c0 [btrfs] [ 9520.366549] btrfs_prealloc_file_range_trans+0x23/0x30 [btrfs] [ 9520.366880] cache_save_setup+0x42e/0x580 [btrfs] [ 9520.367220] ? btrfs_check_data_free_space+0xd0/0xd0 [btrfs] [ 9520.367518] ? lock_downgrade+0x2f0/0x2f0 [ 9520.367799] ? btrfs_write_dirty_block_groups+0x11f/0x6e0 [btrfs] [ 9520.368104] ? kasan_check_read+0x11/0x20 [ 9520.368349] ? do_raw_spin_unlock+0xa8/0x140 [ 9520.368638] btrfs_write_dirty_block_groups+0x2af/0x6e0 [btrfs] [ 9520.368978] ? btrfs_start_dirty_block_groups+0x870/0x870 [btrfs] [ 9520.369282] ? do_raw_spin_unlock+0xa8/0x140 [ 9520.369534] ? _raw_spin_unlock+0x27/0x40 [ 9520.369811] ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs] [ 9520.370137] commit_cowonly_roots+0x4b9/0x610 [btrfs] [ 9520.370560] ? commit_fs_roots+0x350/0x350 [btrfs] [ 9520.370926] ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs] [ 9520.371285] btrfs_commit_transaction+0x5e5/0x10e0 [btrfs] [ 9520.371612] ? btrfs_apply_pending_changes+0x90/0x90 [btrfs] [ 9520.371943] ? start_transaction+0x168/0x6c0 [btrfs] [ 9520.372257] transaction_kthread+0x21c/0x240 [btrfs] [ 9520.372537] kthread+0x1d2/0x1f0 [ 9520.372793] ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs] [ 9520.373090] ? kthread_park+0xb0/0xb0 [ 9520.373329] ret_from_fork+0x3a/0x50 [ 9520.373567] [ 9520.373738] Allocated by task 1804: [ 9520.373974] kasan_kmalloc+0xff/0x180 [ 9520.374208] kasan_slab_alloc+0x11/0x20 [ 9520.374447] kmem_cache_alloc+0xfc/0x2d0 [ 9520.374731] __btrfs_add_free_space+0x40/0x580 [btrfs] [ 9520.375044] unpin_extent_range+0x4f7/0x7a0 [btrfs] [ 9520.375383] btrfs_finish_extent_commit+0x15f/0x4d0 [btrfs] [ 9520.375707] btrfs_commit_transaction+0xb06/0x10e0 [btrfs] [ 9520.376027] btrfs_alloc_data_chunk_ondemand+0x237/0x5c0 [btrfs] [ 9520.376365] btrfs_check_data_free_space+0x81/0xd0 [btrfs] [ 9520.376689] btrfs_delalloc_reserve_space+0x25/0x80 [btrfs] [ 9520.377018] btrfs_direct_IO+0x42e/0x6d0 [btrfs] [ 9520.377284] generic_file_direct_write+0x11e/0x220 [ 9520.377587] btrfs_file_write_iter+0x472/0xac0 [btrfs] [ 9520.377875] aio_write+0x25c/0x360 [ 9520.378106] io_submit_one+0xaa0/0xdc0 [ 9520.378343] __se_sys_io_submit+0xfa/0x2f0 [ 9520.378589] __x64_sys_io_submit+0x43/0x50 [ 9520.378840] do_syscall_64+0x7d/0x240 [ 9520.379081] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 9520.379387] [ 9520.379557] Freed by task 1802: [ 9520.379782] __kasan_slab_free+0x173/0x260 [ 9520.380028] kasan_slab_free+0xe/0x10 [ 9520.380262] kmem_cache_free+0xc1/0x2c0 [ 9520.380544] btrfs_find_space_for_alloc+0x4cd/0x4e0 [btrfs] [ 9520.380866] find_free_extent+0xa99/0x17e0 [btrfs] [ 9520.381166] btrfs_reserve_extent+0xd5/0x1f0 [btrfs] [ 9520.381474] btrfs_get_blocks_direct+0x60b/0xbd0 [btrfs] [ 9520.381761] __blockdev_direct_IO+0x10ee/0x58a1 [ 9520.382059] btrfs_direct_IO+0x25a/0x6d0 [btrfs] [ 9520.382321] generic_file_direct_write+0x11e/0x220 [ 9520.382623] btrfs_file_write_iter+0x472/0xac0 [btrfs] [ 9520.382904] aio_write+0x25c/0x360 [ 9520.383172] io_submit_one+0xaa0/0xdc0 [ 9520.383416] __se_sys_io_submit+0xfa/0x2f0 [ 9520.383678] __x64_sys_io_submit+0x43/0x50 [ 9520.383927] do_syscall_64+0x7d/0x240 [ 9520.384165] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 9520.384439] [ 9520.384610] The buggy address belongs to the object at ffff8800b7ada500 which belongs to the cache btrfs_free_space of size 72 [ 9520.385175] The buggy address is located 0 bytes inside of 72-byte region [ffff8800b7ada500, ffff8800b7ada548) [ 9520.385691] The buggy address belongs to the page: [ 9520.385957] page:ffffea0002deb680 count:1 mapcount:0 mapping:ffff880108a1d700 index:0x0 compound_mapcount: 0 [ 9520.388030] flags: 0x8100(slab\|head) [ 9520.388281] raw: 0000000000008100 ffffea0002deb608 ffffea0002728808 ffff880108a1d700 [ 9520.388722] raw: 0000000000000000 0000000000130013 00000001ffffffff 0000000000000000 [ 9520.389169] page dumped because: kasan: bad access detected [ 9520.389473] [ 9520.389658] Memory state around the buggy address: [ 9520.389943] ffff8800b7ada400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 9520.390368] ffff8800b7ada480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 9520.390796] >ffff8800b7ada500: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc [ 9520.391223] ^ [ 9520.391461] ffff8800b7ada580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 9520.391885] ffff8800b7ada600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 9520.392313] ================================================================== [ 9520.392772] BTRFS critical (device vdc): entry offset 2258497536, bytes 131072, bitmap no [ 9520.393247] BUG: unable to handle kernel NULL pointer dereference at 0000000000000011 [ 9520.393705] PGD 800000010dbab067 P4D 800000010dbab067 PUD 107551067 PMD 0 [ 9520.394059] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 9520.394378] CPU: 4 PID: 1721 Comm: btrfs-transacti Tainted: G B L 4.19.0-rc8-nbor #555 [ 9520.394858] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 9520.395350] RIP: 0010:rb_next+0x3c/0x90 [ 9520.396461] RSP: 0018:ffff8801074ff780 EFLAGS: 00010292 [ 9520.396762] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81b5ac4c [ 9520.397115] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000011 [ 9520.397468] RBP: ffff8801074ff7a0 R08: ffffed0021d64ccc R09: ffffed0021d64ccc [ 9520.397821] R10: 0000000000000001 R11: ffffed0021d64ccb R12: ffff8800b91e0000 [ 9520.398188] R13: ffff8800a3ceba48 R14: ffff8800b627bf80 R15: 0000000000020000 [ 9520.398555] FS: 0000000000000000(0000) GS:ffff88010eb00000(0000) knlGS:0000000000000000 [ 9520.399007] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9520.399335] CR2: 0000000000000011 CR3: 0000000106b52000 CR4: 00000000000006a0 [ 9520.399679] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 9520.400023] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 9520.400400] Call Trace: [ 9520.400648] btrfs_dump_free_space+0x146/0x160 [btrfs] [ 9520.400974] dump_space_info+0x2cd/0x310 [btrfs] [ 9520.401287] btrfs_reserve_extent+0x1ee/0x1f0 [btrfs] [ 9520.401609] __btrfs_prealloc_file_range+0x1cc/0x620 [btrfs] [ 9520.401952] ? btrfs_update_time+0x180/0x180 [btrfs] [ 9520.402232] ? _raw_spin_unlock+0x27/0x40 [ 9520.402522] ? btrfs_alloc_data_chunk_ondemand+0x2c0/0x5c0 [btrfs] [ 9520.402882] btrfs_prealloc_file_range_trans+0x23/0x30 [btrfs] [ 9520.403261] cache_save_setup+0x42e/0x580 [btrfs] [ 9520.403570] ? btrfs_check_data_free_space+0xd0/0xd0 [btrfs] [ 9520.403871] ? lock_downgrade+0x2f0/0x2f0 [ 9520.404161] ? btrfs_write_dirty_block_groups+0x11f/0x6e0 [btrfs] [ 9520.404481] ? kasan_check_read+0x11/0x20 [ 9520.404732] ? do_raw_spin_unlock+0xa8/0x140 [ 9520.405026] btrfs_write_dirty_block_groups+0x2af/0x6e0 [btrfs] [ 9520.405375] ? btrfs_start_dirty_block_groups+0x870/0x870 [btrfs] [ 9520.405694] ? do_raw_spin_unlock+0xa8/0x140 [ 9520.405958] ? _raw_spin_unlock+0x27/0x40 [ 9520.406243] ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs] [ 9520.406574] commit_cowonly_roots+0x4b9/0x610 [btrfs] [ 9520.406899] ? commit_fs_roots+0x350/0x350 [btrfs] [ 9520.407253] ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs] [ 9520.407589] btrfs_commit_transaction+0x5e5/0x10e0 [btrfs] [ 9520.407925] ? btrfs_apply_pending_changes+0x90/0x90 [btrfs] [ 9520.408262] ? start_transaction+0x168/0x6c0 [btrfs] [ 9520.408582] transaction_kthread+0x21c/0x240 [btrfs] [ 9520.408870] kthread+0x1d2/0x1f0 [ 9520.409138] ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs] [ 9520.409440] ? kthread_park+0xb0/0xb0 [ 9520.409682] ret_from_fork+0x3a/0x50 [ 9520.410508] Dumping ftrace buffer: [ 9520.410764] (ftrace buffer empty) [ 9520.411007] CR2: 0000000000000011 [ 9520.411297] ---[ end trace 01a0863445cf360a ]--- [ 9520.411568] RIP: 0010:rb_next+0x3c/0x90 [ 9520.412644] RSP: 0018:ffff8801074ff780 EFLAGS: 00010292 [ 9520.412932] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81b5ac4c [ 9520.413274] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000011 [ 9520.413616] RBP: ffff8801074ff7a0 R08: ffffed0021d64ccc R09: ffffed0021d64ccc [ 9520.414007] R10: 0000000000000001 R11: ffffed0021d64ccb R12: ffff8800b91e0000 [ 9520.414349] R13: ffff8800a3ceba48 R14: ffff8800b627bf80 R15: 0000000000020000 [ 9520.416074] FS: 0000000000000000(0000) GS:ffff88010eb00000(0000) knlGS:0000000000000000 [ 9520.416536] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9520.416848] CR2: 0000000000000011 CR3: 0000000106b52000 CR4: 00000000000006a0 [ 9520.418477] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 9520.418846] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 9520.419204] Kernel panic - not syncing: Fatal exception [ 9520.419666] Dumping ftrace buffer: [ 9520.419930] (ftrace buffer empty) [ 9520.420168] Kernel Offset: disabled [ 9520.420406] ---[ end Kernel panic - not syncing: Fatal exception ]--- Fix this by acquiring the respective lock before iterating the rbtree. Reported-by: Nikolay Borisov <nborisov@suse.com> CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-22 20:31:22 +02:00
Linus Torvalds	6ab9e09238	for-4.20/block-20181021 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvNQKgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgps+8D/9Iy6YIeoPwN10gYsqIh0P2fS3wKzL3kiww 3vFsWO78PzgLxUlNmB7teLtNFc/R5mi8becZmAdvs9za5YFZk56o3Ifv1x+e+z00 VY1/gxhiJD6suLeJ6lECnERGDaiWOZVRMo2TE17vxYGW6GGaa0Ts6PUUXmpla1u5 WKctgt0Qv9WVNyiIdLdeHqzKJwsSSwNTt8fK7eFhy3x8e0CwJr+GtXckbbW3LFkY lug0npsTli3EmEPMovZhd25SjZmTk5GTM+ADZQ7Tnv5KXoDWB9jn6TcCSAi3G+5d 5WUVwfnDyYJiH8qvlg5tRJ690muIy3xMOmpr7QBQ0YnR/LQ3EW+1CVfqD+qimgLH TXzlREXQpBP3YlxSDS5nddz4o5z84GZmC9B/43ujPaZKIQ6eBXYdkmQH7tPtSugm C6VGomR5tHotjxIiAsexh/5hAus+wW8bObKGTPTyINT0ub3XNclwCKLh26CgI9ie WvbS9g3j/KPvu/7s6weZpgD+cks0YdWe/XdXXxiHwsGI9h3J2aJna5RQt1rKWDm5 wGCgbc/B8eSwiWx+GXlqdB9/Dy/bGXOnSTDnKpEVl1f5zNjeLwUKXbjvkMefWs4m jEIcquuDETORY+ZYEfa5YbmS4Lhskr0kzMVTVkZ++81tAWpSCU9Xh3IHrR8TNpt+ J0oh0FHBDg== =LRTT -----END PGP SIGNATURE----- Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block Pull block layer updates from Jens Axboe: "This is the main pull request for block changes for 4.20. This contains: - Series enabling runtime PM for blk-mq (Bart). - Two pull requests from Christoph for NVMe, with items such as; - Better AEN tracking - Multipath improvements - RDMA fixes - Rework of FC for target removal - Fixes for issues identified by static checkers - Fabric cleanups, as prep for TCP transport - Various cleanups and bug fixes - Block merging cleanups (Christoph) - Conversion of drivers to generic DMA mapping API (Christoph) - Series fixing ref count issues with blkcg (Dennis) - Series improving BFQ heuristics (Paolo, et al) - Series improving heuristics for the Kyber IO scheduler (Omar) - Removal of dangerous bio_rewind_iter() API (Ming) - Apply single queue IPI redirection logic to blk-mq (Ming) - Set of fixes and improvements for bcache (Coly et al) - Series closing a hotplug race with sysfs group attributes (Hannes) - Set of patches for lightnvm: - pblk trace support (Hans) - SPDX license header update (Javier) - Tons of refactoring patches to cleanly abstract the 1.2 and 2.0 specs behind a common core interface. (Javier, Matias) - Enable pblk to use a common interface to retrieve chunk metadata (Matias) - Bug fixes (Various) - Set of fixes and updates to the blk IO latency target (Josef) - blk-mq queue number updates fixes (Jianchao) - Convert a bunch of drivers from the old legacy IO interface to blk-mq. This will conclude with the removal of the legacy IO interface itself in 4.21, with the rest of the drivers (me, Omar) - Removal of the DAC960 driver. The SCSI tree will introduce two replacement drivers for this (Hannes)" * tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits) block: setup bounce bio_sets properly blkcg: reassociate bios when make_request() is called recursively blkcg: fix edge case for blk_get_rl() under memory pressure nvme-fabrics: move controller options matching to fabrics nvme-rdma: always have a valid trsvcid mtip32xx: fully switch to the generic DMA API rsxx: switch to the generic DMA API umem: switch to the generic DMA API sx8: switch to the generic DMA API sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code skd: switch to the generic DMA API ubd: remove use of blk_rq_map_sg nvme-pci: remove duplicate check drivers/block: Remove DAC960 driver nvme-pci: fix hot removal during error handling nvmet-fcloop: suppress a compiler warning nvme-core: make implicit seed truncation explicit nvmet-fc: fix kernel-doc headers nvme-fc: rework the request initialization code nvme-fc: introduce struct nvme_fcp_op_w_sgl ...	2018-10-22 17:46:08 +01:00
Kees Cook	1227daa43b	pstore/ram: Clarify resource reservation labels When ramoops reserved a memory region in the kernel, it had an unhelpful label of "persistent_memory". When reading /proc/iomem, it would be repeated many times, did not hint that it was ramoops in particular, and didn't clarify very much about what each was used for: 400000000-407ffffff : Persistent Memory (legacy) 400000000-400000fff : persistent_memory 400001000-400001fff : persistent_memory ... 4000ff000-4000fffff : persistent_memory Instead, this adds meaningful labels for how the various regions are being used: 400000000-407ffffff : Persistent Memory (legacy) 400000000-400000fff : ramoops:dump(0/252) 400001000-400001fff : ramoops:dump(1/252) ... 4000fc000-4000fcfff : ramoops:dump(252/252) 4000fd000-4000fdfff : ramoops:console 4000fe000-4000fe3ff : ramoops:ftrace(0/3) 4000fe400-4000fe7ff : ramoops:ftrace(1/3) 4000fe800-4000febff : ramoops:ftrace(2/3) 4000fec00-4000fefff : ramoops:ftrace(3/3) 4000ff000-4000fffff : ramoops:pmsg Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> Tested-by: Guenter Roeck <groeck@chromium.org>	2018-10-22 07:11:58 -07:00
Kees Cook	95047b0519	pstore: Refactor compression initialization This refactors compression initialization slightly to better handle getting potentially called twice (via early pstore_register() calls and later pstore_init()) and improves the comments and reporting to be more verbose. Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Guenter Roeck <groeck@chromium.org>	2018-10-22 07:11:58 -07:00
Joel Fernandes (Google)	416031653e	pstore: Allocate compression during late_initcall() ramoops's call of pstore_register() was recently moved to run during late_initcall() because the crypto backend may not have been ready during postcore_initcall(). This meant early-boot crash dumps were not getting caught by pstore any more. Instead, lets allow calls to pstore_register() earlier, and once crypto is ready we can initialize the compression. Reported-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> Fixes: `cb3bee0369` ("pstore: Use crypto compress API") [kees: trivial rebase] Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Guenter Roeck <groeck@chromium.org>	2018-10-22 07:11:58 -07:00
Kees Cook	cb095afd44	pstore: Centralize init/exit routines In preparation for having additional actions during init/exit, this moves the init/exit into platform.c, centralizing the logic to make call outs to the fs init/exit. Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Guenter Roeck <groeck@chromium.org>	2018-10-22 07:11:58 -07:00
Luis Henriques	ea4cdc548e	ceph: new mount option to disable usage of copy-from op Add a new mount option 'nocopyfrom' that will prevent the usage of the RADOS 'copy-from' operation in cephfs. This could be useful, for example, for an administrator to temporarily mitigate any possible bugs in the 'copy-from' implementation. Currently, only copy_file_range uses this RADOS operation. Setting this mount option will result in this syscall reverting to the default VFS implementation, i.e. to perform the copies locally instead of doing remote object copies. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:24 +02:00
Luis Henriques	503f82a993	ceph: support copy_file_range file operation This commit implements support for the copy_file_range syscall in cephfs. It is implemented using the RADOS 'copy-from' operation, which allows to do a remote object copy, without the need to download/upload data from/to the OSDs. Some manual copy may however be required if the source/destination file offsets aren't object aligned or if the copy length is smaller than the object size. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:24 +02:00
Luis Henriques	2ee9dd958d	ceph: add non-blocking parameter to ceph_try_get_caps() ceph_try_get_caps currently calls try_get_cap_refs with the nonblock parameter always set to 'true'. This change adds a new parameter that allows to set it's value. This will be useful for a follow-up patch that will need to get two sets of capabilities for two different inodes without risking a deadlock. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:23 +02:00
Ilya Dryomov	0d9c1ab3be	libceph: preallocate message data items Currently message data items are allocated with ceph_msg_data_create() in setup_request_data() inside send_request(). send_request() has never been allowed to fail, so each allocation is followed by a BUG_ON: data = ceph_msg_data_create(...); BUG_ON(!data); It's been this way since support for multiple message data items was added in commit `6644ed7b7e` ("libceph: make message data be a pointer") in 3.10. There is no reason to delay the allocation of message data items until the last possible moment and we certainly don't need a linked list of them as they are only ever appended to the end and never erased. Make ceph_msg_new2() take max_data_items and adapt the rest of the code. Reported-by: Jerry Lee <leisurelysw24@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:22 +02:00
Ilya Dryomov	26f887e0a3	libceph, rbd, ceph: move ceph_osdc_alloc_messages() calls The current requirement is that ceph_osdc_alloc_messages() should be called after oid and oloc are known. In preparation for preallocating message data items, move ceph_osdc_alloc_messages() further down, so that it is called when OSD op codes are known. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:22 +02:00
Ilya Dryomov	61d2f85504	ceph: num_ops is off by one in ceph_aio_retry_work() Two OSD op slots are allocated, but only one is ever used. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:21 +02:00
Xuehan Xu	6680288441	ceph: set timeout conditionally in __cap_delay_requeue __cap_delay_requeue could be invoked through ceph_check_caps when there exists caps that needs to be sent and are delayed by "i_hold_caps_min" or "i_hold_caps_max". If __cap_delay_requeue sets timeout unconditionally, there could be a chance that some "wanted" caps can not be release for a long since their timeouts are reset every time they get delayed. Fixes: http://tracker.ceph.com/issues/36369 Signed-off-by: Xuehan Xu <xuxuehan@360.cn> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:21 +02:00
Ilya Dryomov	894868330a	libceph: don't consume a ref on pagelist in ceph_msg_data_add_pagelist() Because send_mds_reconnect() wants to send a message with a pagelist and pass the ownership to the messenger, ceph_msg_data_add_pagelist() consumes a ref which is then put in ceph_msg_data_destroy(). This makes managing pagelists in the OSD client (where they are wrapped in ceph_osd_data) unnecessarily hard because the handoff only happens in ceph_osdc_start_request() instead of when the pagelist is passed to ceph_osd_data_pagelist_init(). I counted several memory leaks on various error paths. Fix up ceph_msg_data_add_pagelist() and carry a pagelist ref in ceph_osd_data. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:21 +02:00
Ilya Dryomov	33165d4723	libceph: introduce ceph_pagelist_alloc() struct ceph_pagelist cannot be embedded into anything else because it has its own refcount. Merge allocation and initialization together. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:21 +02:00
Luis Henriques	bddff633ab	ceph: only allow punch hole mode in fallocate Current implementation of cephfs fallocate isn't correct as it doesn't really reserve the space in the cluster, which means that a subsequent call to a write may actually fail due to lack of space. In fact, it is currently possible to fallocate an amount space that is larger than the free space in the cluster. It has behaved this way since the initial commit `ad7a60de88` ("ceph: punch hole support"). Since there's no easy solution to fix this at the moment, this patch simply removes support for all fallocate operations but FALLOC_FL_PUNCH_HOLE (which implies FALLOC_FL_KEEP_SIZE). Link: https://tracker.ceph.com/issues/36317 Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:20 +02:00
Yan, Zheng	fce7a9744b	ceph: refactor ceph_sync_read() Avoid allocating memory for the entire user request: striped_read() does a synchronous OSD request per object, so it doesn't need more than object size worth of pages at a time. [ Preserve the comment, changelog. ] Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:20 +02:00
Yan, Zheng	74c9e6bf4c	ceph: check if LOOKUPNAME request was aborted when filling trace d_lookup()/d_alloc() require parent inode locked. Parent inode is not locked if request is aborted. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:20 +02:00
Yan, Zheng	c58f450bd6	ceph: fix dentry leak in ceph_readdir_prepopulate Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:20 +02:00
Yan, Zheng	efe328230d	Revert "ceph: fix dentry leak in splice_dentry()" This reverts commit `8b8f53af1e`. splice_dentry() is used by three places. For two places, req->r_dentry is passed to splice_dentry(). In the case of error, req->r_dentry does not get updated. So splice_dentry() should not drop reference. Cc: stable@vger.kernel.org # 4.18+ Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:20 +02:00
Chengguang Xu	5da207993e	ceph: check snap first in ceph_set_acl() Do the snap check first in ceph_set_acl(), so we can avoid unnecessary operations when the inode has snap. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:20 +02:00
Chengguang Xu	3167893ae6	ceph: reset cap hold timeout only for requeued inode __cap_delay_requeue() only requeue inode which does not have CEPH_I_FLUSH flag, so avoid reset cap hold timeout for that inode. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-10-22 10:28:19 +02:00
Matthew Wilcox	a283348629	page cache: Finish XArray conversion With no more radix tree API users left, we can drop the GFP flags and use xa_init() instead of INIT_RADIX_TREE(). Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:44 -04:00
Matthew Wilcox	b15cd80068	dax: Convert page fault handlers to XArray This is the last part of DAX to be converted to the XArray so remove all the old helper functions. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:44 -04:00
Matthew Wilcox	9f32d22130	dax: Convert dax_lock_mapping_entry to XArray Instead of always retrying when we slept, only retry if the page has moved. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:44 -04:00
Matthew Wilcox	9fc747f68d	dax: Convert dax writeback to XArray Use XArray iteration instead of a pagevec. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:43 -04:00
Matthew Wilcox	07f2d89cc2	dax: Convert __dax_invalidate_entry to XArray Avoids walking the radix tree multiple times looking for tags. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:43 -04:00
Matthew Wilcox	084a899008	dax: Convert dax_layout_busy_page to XArray Instead of using a pagevec, just use the XArray iterators. Add a conditional rescheduling point which probably should have been there in the original. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:43 -04:00
Matthew Wilcox	cfc93c6c6c	dax: Convert dax_insert_pfn_mkwrite to XArray Add some XArray-based helper functions to replace the radix tree based metaphors currently in use. The biggest change is that converted code doesn't see its own lock bit; get_unlocked_entry() always returns an entry with the lock bit clear. So we don't have to mess around loading the current entry and clearing the lock bit; we can just store the unlocked entry that we already have. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:43 -04:00
Matthew Wilcox	ec4907ff69	dax: Hash on XArray instead of mapping Since the XArray is embedded in the struct address_space, its address contains exactly as much entropy as the address of the mapping. This patch is purely preparatory for later patches which will simplify the wait/wake interfaces. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:43 -04:00
Matthew Wilcox	a77d19f46a	dax: Rename some functions Remove mentions of 'radix' and 'radix tree'. Simplify some names by dropping the word 'mapping'. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:42 -04:00
Matthew Wilcox	5ec2d99de7	f2fs: Convert to XArray This is a straightforward conversion. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:42 -04:00
Matthew Wilcox	f611ff6375	nilfs2: Convert to XArray This is close to a 1:1 replacement of radix tree APIs with their XArray equivalents. It would be possible to optimise nilfs_copy_back_pages(), but that doesn't seem to be in the performance path. Also, I think it has a pre-existing bug, and I've added a note to that effect in the source code. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:42 -04:00
Matthew Wilcox	04edf02cdd	fs: Convert writeback to XArray A couple of short loops. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:42 -04:00
Matthew Wilcox	ec82e1c1c8	fs: Convert buffer to XArray Mostly comment fixes, but one use of __xa_set_mark. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:41 -04:00
Matthew Wilcox	0a943c65e7	btrfs: Convert page cache to XArray Signed-off-by: Matthew Wilcox <willy@infradead.org> Acked-by: David Sterba <dsterba@suse.com>	2018-10-21 10:46:41 -04:00
Matthew Wilcox	10bbd23585	pagevec: Use xa_mark_t Removes sparse warnings. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:39 -04:00
Matthew Wilcox	0d3f929666	page cache: Convert hole search to XArray The page cache offers the ability to search for a miss in the previous or next N locations. Rather than teach the XArray about the page cache's definition of a miss, use xas_prev() and xas_next() to search the page array. This should be more efficient as it does not have to start the lookup from the top for each index. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-10-21 10:46:33 -04:00
David S. Miller	2e2d6f0342	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net net/sched/cls_api.c has overlapping changes to a call to nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL to the 5th argument, and another (from 'net-next') added cb->extack instead of NULL to the 6th argument. net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to code which moved (to mr_table_dump)) in 'net-next'. Thanks to David Ahern for the heads up. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-19 11:03:06 -07:00
Bob Peterson	8e31582a9a	gfs2: Fix minor typo: couln't versus couldn't. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2018-10-19 11:34:04 -05:00
Filipe Manana	421f0922a2	Btrfs: fix use-after-free during inode eviction At inode.c:evict_inode_truncate_pages(), when we iterate over the inode's extent states, we access an extent state record's "state" field after we unlocked the inode's io tree lock. This can lead to a use-after-free issue because after we unlock the io tree that extent state record might have been freed due to being merged into another adjacent extent state record (a previous inflight bio for a read operation finished in the meanwhile which unlocked a range in the io tree and cause a merge of extent state records, as explained in the comment before the while loop added in commit `6ca0709756` ("Btrfs: fix hang during inode eviction due to concurrent readahead")). Fix this by keeping a copy of the extent state's flags in a local variable and using it after unlocking the io tree. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201189 Fixes: `b9d0b38928` ("btrfs: Add handler for invalidate page") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-19 12:25:33 +02:00
Josef Bacik	c495144bc6	btrfs: move the dio_sem higher up the callchain We're getting a lockdep splat because we take the dio_sem under the log_mutex. What we really need is to protect fsync() from logging an extent map for an extent we never waited on higher up, so just guard the whole thing with dio_sem. ====================================================== WARNING: possible circular locking dependency detected 4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411 Not tainted ------------------------------------------------------ aio-dio-invalid/30928 is trying to acquire lock: 0000000092621cfd (&mm->mmap_sem){++++}, at: get_user_pages_unlocked+0x5a/0x1e0 but task is already holding lock: 00000000cefe6b35 (&ei->dio_sem){++++}, at: btrfs_direct_IO+0x3be/0x400 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #5 (&ei->dio_sem){++++}: lock_acquire+0xbd/0x220 down_write+0x51/0xb0 btrfs_log_changed_extents+0x80/0xa40 btrfs_log_inode+0xbaf/0x1000 btrfs_log_inode_parent+0x26f/0xa80 btrfs_log_dentry_safe+0x50/0x70 btrfs_sync_file+0x357/0x540 do_fsync+0x38/0x60 __ia32_sys_fdatasync+0x12/0x20 do_fast_syscall_32+0x9a/0x2f0 entry_SYSENTER_compat+0x84/0x96 -> #4 (&ei->log_mutex){+.+.}: lock_acquire+0xbd/0x220 __mutex_lock+0x86/0xa10 btrfs_record_unlink_dir+0x2a/0xa0 btrfs_unlink+0x5a/0xc0 vfs_unlink+0xb1/0x1a0 do_unlinkat+0x264/0x2b0 do_fast_syscall_32+0x9a/0x2f0 entry_SYSENTER_compat+0x84/0x96 -> #3 (sb_internal#2){.+.+}: lock_acquire+0xbd/0x220 __sb_start_write+0x14d/0x230 start_transaction+0x3e6/0x590 btrfs_evict_inode+0x475/0x640 evict+0xbf/0x1b0 btrfs_run_delayed_iputs+0x6c/0x90 cleaner_kthread+0x124/0x1a0 kthread+0x106/0x140 ret_from_fork+0x3a/0x50 -> #2 (&fs_info->cleaner_delayed_iput_mutex){+.+.}: lock_acquire+0xbd/0x220 __mutex_lock+0x86/0xa10 btrfs_alloc_data_chunk_ondemand+0x197/0x530 btrfs_check_data_free_space+0x4c/0x90 btrfs_delalloc_reserve_space+0x20/0x60 btrfs_page_mkwrite+0x87/0x520 do_page_mkwrite+0x31/0xa0 __handle_mm_fault+0x799/0xb00 handle_mm_fault+0x7c/0xe0 __do_page_fault+0x1d3/0x4a0 async_page_fault+0x1e/0x30 -> #1 (sb_pagefaults){.+.+}: lock_acquire+0xbd/0x220 __sb_start_write+0x14d/0x230 btrfs_page_mkwrite+0x6a/0x520 do_page_mkwrite+0x31/0xa0 __handle_mm_fault+0x799/0xb00 handle_mm_fault+0x7c/0xe0 __do_page_fault+0x1d3/0x4a0 async_page_fault+0x1e/0x30 -> #0 (&mm->mmap_sem){++++}: __lock_acquire+0x42e/0x7a0 lock_acquire+0xbd/0x220 down_read+0x48/0xb0 get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_fast+0xa4/0x150 iov_iter_get_pages+0xc3/0x340 do_direct_IO+0xf93/0x1d70 __blockdev_direct_IO+0x32d/0x1c20 btrfs_direct_IO+0x227/0x400 generic_file_direct_write+0xcf/0x180 btrfs_file_write_iter+0x308/0x58c aio_write+0xf8/0x1d0 io_submit_one+0x3a9/0x620 __ia32_compat_sys_io_submit+0xb2/0x270 do_int80_syscall_32+0x5b/0x1a0 entry_INT80_compat+0x88/0xa0 other info that might help us debug this: Chain exists of: &mm->mmap_sem --> &ei->log_mutex --> &ei->dio_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ei->dio_sem); lock(&ei->log_mutex); lock(&ei->dio_sem); lock(&mm->mmap_sem); * DEADLOCK * 1 lock held by aio-dio-invalid/30928: #0: 00000000cefe6b35 (&ei->dio_sem){++++}, at: btrfs_direct_IO+0x3be/0x400 stack backtrace: CPU: 0 PID: 30928 Comm: aio-dio-invalid Not tainted 4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Call Trace: dump_stack+0x7c/0xbb print_circular_bug.isra.37+0x297/0x2a4 check_prev_add.constprop.45+0x781/0x7a0 ? __lock_acquire+0x42e/0x7a0 validate_chain.isra.41+0x7f0/0xb00 __lock_acquire+0x42e/0x7a0 lock_acquire+0xbd/0x220 ? get_user_pages_unlocked+0x5a/0x1e0 down_read+0x48/0xb0 ? get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_fast+0xa4/0x150 iov_iter_get_pages+0xc3/0x340 do_direct_IO+0xf93/0x1d70 ? __alloc_workqueue_key+0x358/0x490 ? __blockdev_direct_IO+0x14b/0x1c20 __blockdev_direct_IO+0x32d/0x1c20 ? btrfs_run_delalloc_work+0x40/0x40 ? can_nocow_extent+0x490/0x490 ? kvm_clock_read+0x1f/0x30 ? can_nocow_extent+0x490/0x490 ? btrfs_run_delalloc_work+0x40/0x40 btrfs_direct_IO+0x227/0x400 ? btrfs_run_delalloc_work+0x40/0x40 generic_file_direct_write+0xcf/0x180 btrfs_file_write_iter+0x308/0x58c aio_write+0xf8/0x1d0 ? kvm_clock_read+0x1f/0x30 ? __might_fault+0x3e/0x90 io_submit_one+0x3a9/0x620 ? io_submit_one+0xe5/0x620 __ia32_compat_sys_io_submit+0xb2/0x270 do_int80_syscall_32+0x5b/0x1a0 entry_INT80_compat+0x88/0xa0 CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-19 12:20:04 +02:00
Josef Bacik	30928e9baa	btrfs: don't run delayed_iputs in commit This could result in a really bad case where we do something like evict evict_refill_and_join btrfs_commit_transaction btrfs_run_delayed_iputs evict evict_refill_and_join btrfs_commit_transaction ... forever We have plenty of other places where we run delayed iputs that are much safer, let those do the work. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-19 12:20:03 +02:00
Josef Bacik	80ee54bfe8	btrfs: fix insert_reserved error handling We were not handling the reserved byte accounting properly for data references. Metadata was fine, if it errored out the error paths would free the bytes_reserved count and pin the extent, but it even missed one of the error cases. So instead move this handling up into run_one_delayed_ref so we are sure that both cases are properly cleaned up in case of a transaction abort. CC: stable@vger.kernel.org # 4.18+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-19 12:20:03 +02:00
Josef Bacik	49940bdd57	btrfs: only free reserved extent if we didn't insert it When we insert the file extent once the ordered extent completes we free the reserved extent reservation as it'll have been migrated to the bytes_used counter. However if we error out after this step we'll still clear the reserved extent reservation, resulting in a negative accounting of the reserved bytes for the block group and space info. Fix this by only doing the free if we didn't successfully insert a file extent for this extent. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-19 12:20:03 +02:00
Josef Bacik	fb5c39d7a8	btrfs: don't use ctl->free_space for max_extent_size max_extent_size is supposed to be the largest contiguous range for the space info, and ctl->free_space is the total free space in the block group. We need to keep track of these separately and _only_ use the max_free_space if we don't have a max_extent_size, as that means our original request was too large to search any of the block groups for and therefore wouldn't have a max_extent_size set. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-19 12:20:03 +02:00
Josef Bacik	ad22cf6ea4	btrfs: set max_extent_size properly We can't use entry->bytes if our entry is a bitmap entry, we need to use entry->max_extent_size in that case. Fix up all the logic to make this consistent. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-19 12:20:03 +02:00
Josef Bacik	21a94f7acf	btrfs: reset max_extent_size properly If we use up our block group before allocating a new one we'll easily get a max_extent_size that's set really really low, which will result in a lot of fragmentation. We need to make sure we're resetting the max_extent_size when we add a new chunk or add new space. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-19 12:20:03 +02:00
Boris Brezillon	042c1a5a60	NAND core changes: - Two batchs of cleanups of the NAND API, including: * Deprecating a lot of interfaces (now replaced by ->exec_op()). * Moving code in separate drivers (JEDEC, ONFI), in private files (internals), in platform drivers, etc. * Functions/structures reordering. * Exclusive use of the nand_chip structure instead of the MTD one all across the subsystem. - Addition of the nand_wait_readrdy/rdy_op() helpers. Raw NAND controllers drivers changes: - Various coccinelle patches. - Marvell: * Use regmap_update_bits() for syscon access. * More documentation. * BCH failure path rework. * More layouts to be supported. * IRQ handler complete() condition fixed. - Fsl_ifc: * SRAM initialization fixed for newer controller versions. - Denali: * Fix licenses mismatch and use a SPDX tag. * Set SPARE_AREA_SKIP_BYTES register to 8 if unset. - Qualcomm: * Do not include dma-direct.h. - Docg4: * Removed. - Ams-delta: * Use of a GPIO lookup table * Internal machinery changes. Raw NAND chip drivers changes: - Toshiba: * Add support for Toshiba memory BENAND * Pass a single nand_chip object to the status helper. - ESMT: * New driver to retrieve the ECC requirements from the 5th ID byte. -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEE9HuaYnbmDhq/XIDIJWrqGEe9VoQFAlvDdvgACgkQJWrqGEe9 VoQ0zQf/QfXYjLDbC/1FXckEkEeVZPRbXQMf+ptdTo0inJefy9LMOvhRKmuABCxn wwPa99aqQ+aHESnU/s2d8FUrhNRUf35lmRIAP512lJk79tBfgo73l5WSc6UTg6hO w3KbigOvTi7wCKQYGDFOBZuoFFCvpNsIEtWFuodCbYG1GmeAaL7L2fhbri9JbcE2 zPqmCv0w5BUrPR7i/taI5gD+KDtRdmCEyfb7wQoJy2XZVO0SKqGsBhDhYWqwCpck eP2UzQ4cFT8X1S0/AfUpbowc6cftdqIwW9VRY4inx3ALQ+sKxrkJ+0E3gL9lnZ+F p/is0gTdvDmnOjXfwGcL7GXGy0nN8Q== =EKv1 -----END PGP SIGNATURE----- Merge tag 'nand/for-4.20' of git://git.infradead.org/linux-mtd into mtd/next NAND core changes: - Two batchs of cleanups of the NAND API, including: * Deprecating a lot of interfaces (now replaced by ->exec_op()). * Moving code in separate drivers (JEDEC, ONFI), in private files (internals), in platform drivers, etc. * Functions/structures reordering. * Exclusive use of the nand_chip structure instead of the MTD one all across the subsystem. - Addition of the nand_wait_readrdy/rdy_op() helpers. Raw NAND controllers drivers changes: - Various coccinelle patches. - Marvell: * Use regmap_update_bits() for syscon access. * More documentation. * BCH failure path rework. * More layouts to be supported. * IRQ handler complete() condition fixed. - Fsl_ifc: * SRAM initialization fixed for newer controller versions. - Denali: * Fix licenses mismatch and use a SPDX tag. * Set SPARE_AREA_SKIP_BYTES register to 8 if unset. - Qualcomm: * Do not include dma-direct.h. - Docg4: * Removed. - Ams-delta: * Use of a GPIO lookup table * Internal machinery changes. Raw NAND chip drivers changes: - Toshiba: * Add support for Toshiba memory BENAND * Pass a single nand_chip object to the status helper. - ESMT: * New driver to retrieve the ECC requirements from the 5th ID byte.	2018-10-19 09:20:09 +02:00
Benjamin Coddington	fc187514d8	nfs: remove redundant call to nfs_context_set_write_error() We don't need to call this in the direct, read, or pnfs resend paths and the only other caller is the write path in nfs_page_async_flush() which already checks and sets the pg_error on the context. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-10-18 17:20:57 -04:00
Benjamin Coddington	fdbd1a2e4a	nfs: Fix a missed page unlock after pg_doio() We must check pg_error and call error_cleanup after any call to pg_doio. Currently, we are skipping the unlock of a page if we encounter an error in nfs_pageio_complete() before handing off the work to the RPC layer. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-10-18 17:20:57 -04:00
Mike Marshall	22fc9db296	orangefs: no need to check for service_operation returns > 0 service_operation returns > 0 is undefined. Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2018-10-18 14:05:46 -04:00
Mike Marshall	34e6148a2c	orangefs: some error code paths missed kmem_cache_free If a slab cache object is allocated, it needs to be freed eventually, certainly before anyone unloads the module that allocated it. Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2018-10-18 13:58:25 -04:00
Mike Marshall	b5d72cdc53	orangefs: don't let orangefs_iget return NULL. Suggested by Dan Carpenter. Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2018-10-18 13:52:23 -04:00
Mike Marshall	56249998b2	orangefs: don't let orangefs_new_inode return NULL Suggested by Dan Carpenter Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2018-10-18 13:47:16 -04:00
Eric Sandeen	fa520c47ea	fscache: Fix out of bound read in long cookie keys fscache_set_key() can incur an out-of-bounds read, reported by KASAN: BUG: KASAN: slab-out-of-bounds in fscache_alloc_cookie+0x5b3/0x680 [fscache] Read of size 4 at addr ffff88084ff056d4 by task mount.nfs/32615 and also reported by syzbot at https://lkml.org/lkml/2018/7/8/236 BUG: KASAN: slab-out-of-bounds in fscache_set_key fs/fscache/cookie.c:120 [inline] BUG: KASAN: slab-out-of-bounds in fscache_alloc_cookie+0x7a9/0x880 fs/fscache/cookie.c:171 Read of size 4 at addr ffff8801d3cc8bb4 by task syz-executor907/4466 This happens for any index_key_len which is not divisible by 4 and is larger than the size of the inline key, because the code allocates exactly index_key_len for the key buffer, but the hashing loop is stepping through it 4 bytes (u32) at a time in the buf[] array. Fix this by calculating how many u32 buffers we'll need by using DIV_ROUND_UP, and then using kcalloc() to allocate a precleared allocation buffer to hold the index_key, then using that same count as the hashing index limit. Fixes: `ec0328e46d` ("fscache: Maintain a catalogue of allocated cookies") Reported-by: syzbot+a95b989b2dde8e806af8@syzkaller.appspotmail.com Signed-off-by: Eric Sandeen <sandeen@redhat.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-10-18 11:32:21 +02:00
David Howells	1ff22883b0	fscache: Fix incomplete initialisation of inline key space The inline key in struct rxrpc_cookie is insufficiently initialized, zeroing only 3 of the 4 slots, therefore an index_key_len between 13 and 15 bytes will end up hashing uninitialized memory because the memcpy only partially fills the last buf[] element. Fix this by clearing fscache_cookie objects on allocation rather than using the slab constructor to initialise them. We're going to pretty much fill in the entire struct anyway, so bringing it into our dcache writably shouldn't incur much overhead. This removes the need to do clearance in fscache_set_key() (where we aren't doing it correctly anyway). Also, we don't need to set cookie->key_len in fscache_set_key() as we already did it in the only caller, so remove that. Fixes: `ec0328e46d` ("fscache: Maintain a catalogue of allocated cookies") Reported-by: syzbot+a95b989b2dde8e806af8@syzkaller.appspotmail.com Reported-by: Eric Sandeen <sandeen@redhat.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-10-18 11:32:21 +02:00
Al Viro	169b803397	cachefiles: fix the race between cachefiles_bury_object() and rmdir(2) the victim might've been rmdir'ed just before the lock_rename(); unlike the normal callers, we do not look the source up after the parents are locked - we know it beforehand and just recheck that it's still the child of what used to be its parent. Unfortunately, the check is too weak - we don't spot a dead directory since its ->d_parent is unchanged, dentry is positive, etc. So we sail all the way to ->rename(), with hosting filesystems _not_ expecting to be asked renaming an rmdir'ed subdirectory. The fix is easy, fortunately - the lock on parent is sufficient for making IS_DEADDIR() on child safe. Cc: stable@vger.kernel.org Fixes: `9ae326a690` (CacheFiles: A cache that backs onto a mounted filesystem) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-10-18 11:32:21 +02:00
Christoph Hellwig	96987eea53	xfs: cancel COW blocks before swapext We need to make sure we have no outstanding COW blocks before we swap extents, as there is nothing preventing us from having preallocated COW delalloc on either inode that swapext is called on. That case can easily be reproduced by running generic/324 in always_cow mode: [ 620.760572] XFS: Assertion failed: tip->i_delayed_blks == 0, file: fs/xfs/xfs_bmap_util.c, line: 1669 [ 620.761608] ------------[ cut here ]------------ [ 620.762171] kernel BUG at fs/xfs/xfs_message.c:102! [ 620.762732] invalid opcode: 0000 [#1] SMP PTI [ 620.763272] CPU: 0 PID: 24153 Comm: xfs_fsr Tainted: G W 4.19.0-rc1+ #4182 [ 620.764203] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014 [ 620.765202] RIP: 0010:assfail+0x20/0x28 [ 620.765646] Code: 31 ff e8 83 fc ff ff 0f 0b c3 48 89 f1 41 89 d0 48 c7 c6 48 ca 8d 82 48 89 fa 38 [ 620.767758] RSP: 0018:ffffc9000898bc10 EFLAGS: 00010202 [ 620.768359] RAX: 0000000000000000 RBX: ffff88012f14ba40 RCX: 0000000000000000 [ 620.769174] RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff828560d9 [ 620.769982] RBP: ffff88012f14b300 R08: 0000000000000000 R09: 0000000000000000 [ 620.770788] R10: 000000000000000a R11: f000000000000000 R12: ffffc9000898bc98 [ 620.771638] R13: ffffc9000898bc9c R14: ffff880130b5e2b8 R15: ffff88012a1fa2a8 [ 620.772504] FS: 00007fdc36e0fbc0(0000) GS:ffff88013ba00000(0000) knlGS:0000000000000000 [ 620.773475] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 620.774168] CR2: 00007fdc3604d000 CR3: 0000000132afc000 CR4: 00000000000006f0 [ 620.774978] Call Trace: [ 620.775274] xfs_swap_extent_forks+0x2a0/0x2e0 [ 620.775792] xfs_swap_extents+0x38b/0xab0 [ 620.776256] xfs_ioc_swapext+0x121/0x140 [ 620.776709] xfs_file_ioctl+0x328/0xc90 [ 620.777154] ? rcu_read_lock_sched_held+0x50/0x60 [ 620.777694] ? xfs_iunlock+0x233/0x260 [ 620.778127] ? xfs_setattr_nonsize+0x3be/0x6a0 [ 620.778647] do_vfs_ioctl+0x9d/0x680 [ 620.779071] ? ksys_fchown+0x47/0x80 [ 620.779552] ksys_ioctl+0x35/0x70 [ 620.780040] __x64_sys_ioctl+0x11/0x20 [ 620.780530] do_syscall_64+0x4b/0x190 [ 620.780927] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 620.781467] RIP: 0033:0x7fdc364d0f07 [ 620.781900] Code: b3 66 90 48 8b 05 81 5f 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 28 [ 620.784044] RSP: 002b:00007ffe2a766038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 620.784896] RAX: ffffffffffffffda RBX: 0000000000000025 RCX: 00007fdc364d0f07 [ 620.785667] RDX: 0000560296ca2fc0 RSI: 00000000c0c0586d RDI: 0000000000000005 [ 620.786398] RBP: 0000000000000025 R08: 0000000000001200 R09: 0000000000000000 [ 620.787283] R10: 0000000000000432 R11: 0000000000000246 R12: 0000000000000005 [ 620.788051] R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000006 [ 620.788927] Modules linked in: [ 620.789340] ---[ end trace 9503b7417ffdbdb0 ]--- [ 620.790065] RIP: 0010:assfail+0x20/0x28 [ 620.790642] Code: 31 ff e8 83 fc ff ff 0f 0b c3 48 89 f1 41 89 d0 48 c7 c6 48 ca 8d 82 48 89 fa 38 [ 620.793038] RSP: 0018:ffffc9000898bc10 EFLAGS: 00010202 [ 620.793609] RAX: 0000000000000000 RBX: ffff88012f14ba40 RCX: 0000000000000000 [ 620.794317] RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff828560d9 [ 620.795025] RBP: ffff88012f14b300 R08: 0000000000000000 R09: 0000000000000000 [ 620.795778] R10: 000000000000000a R11: f000000000000000 R12: ffffc9000898bc98 [ 620.796675] R13: ffffc9000898bc9c R14: ffff880130b5e2b8 R15: ffff88012a1fa2a8 [ 620.797782] FS: 00007fdc36e0fbc0(0000) GS:ffff88013ba00000(0000) knlGS:0000000000000000 [ 620.798908] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 620.799594] CR2: 00007fdc3604d000 CR3: 0000000132afc000 CR4: 00000000000006f0 [ 620.800424] Kernel panic - not syncing: Fatal exception [ 620.801191] Kernel Offset: disabled [ 620.801597] ---[ end Kernel panic - not syncing: Fatal exception ]--- Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:21:55 +11:00
Brian Foster	efc3289cf8	xfs: clear ail delwri queued bufs on unmount of shutdown fs In the typical unmount case, the AIL is forced out by the unmount sequence before the xfsaild task is stopped. Since AIL items are removed on writeback completion, this means that the AIL ->ail_buf_list delwri queue has been drained. This is not always true in the shutdown case, however. It's possible for buffers to sit on a delwri queue for a period of time across submission attempts if said items are locked or have been relogged and pinned since first added to the queue. If the attempt to log such an item results in a log I/O error, the error processing can shutdown the fs, remove the item from the AIL, stale the buffer (dropping the LRU reference) and clear its delwri queue state. The latter bit means the buffer will be released from a delwri queue on the next submission attempt, but this might never occur if the filesystem has shutdown and the AIL is empty. This means that such buffers are held indefinitely by the AIL delwri queue across destruction of the AIL. Aside from being a memory leak, these buffers can also hold references to in-core perag structures. The latter problem manifests as a generic/475 failure, reproducing the following asserts at unmount time: XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 151 XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 132 To prevent this problem, clear the AIL delwri queue as a final step before xfsaild() exit. The !empty state should never occur in the normal case, so add an assert to catch unexpected problems going forward. [dgc: add comment explaining need for xfs_buf_delwri_cancel() after calling xfs_buf_delwri_submit_nowait().] Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:21:49 +11:00
Carlos Maiolino	26ca39015e	xfs: use offsetof() in place of offset macros for __xfsstats Most offset macro mess is used in xfs_stats_format() only, and we can simply get the right offsets using offsetof(), instead of several macros to mark the offsets inside __xfsstats structure. Replace all XFSSTAT_END_* macros by a single helper macro to get the right offset into __xfsstats, and use this helper in xfs_stats_format() directly. The quota stats code, still looks a bit cleaner when using XFSSTAT_* macros, so, this patch also defines XFSSTAT_START_XQMSTAT and XFSSTAT_END_XQMSTAT locally to that code. This also should prevent offset mistakes when updates are done into __xfsstats. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:21:39 +11:00
Carlos Maiolino	41657e5507	xfs: Fix xqmstats offsets in /proc/fs/xfs/xqmstat The addition of FIBT, RMAP and REFCOUNT changed the offsets into __xfssats structure. This caused xqmstat_proc_show() to display garbage data via /proc/fs/xfs/xqmstat, once it relies on the offsets marked via macros. Fix it. Fixes: `00f4e4f9` xfs: add rmap btree stats infrastructure Fixes: `aafc3c24` xfs: support the XFS_BTNUM_FINOBT free inode btree type Fixes: `46eeb521` xfs: introduce refcount btree definitions Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:21:34 +11:00
Dave Chinner	37fd167824	xfs: fix use-after-free race in xfs_buf_rele When looking at a 4.18 based KASAN use after free report, I noticed that racing xfs_buf_rele() may race on dropping the last reference to the buffer and taking the buffer lock. This was the symptom displayed by the KASAN report, but the actual issue that was reported had already been fixed in 4.19-rc1 by commit `e339dd8d8b` ("xfs: use sync buffer I/O for sync delwri queue submission"). Despite this, I think there is still an issue with xfs_buf_rele() in this code: release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock); spin_lock(&bp->b_lock); if (!release) { ..... If two threads race on the b_lock after both dropping a reference and one getting dropping the last reference so release = true, we end up with: CPU 0 CPU 1 atomic_dec_and_lock() atomic_dec_and_lock() spin_lock(&bp->b_lock) spin_lock(&bp->b_lock) <spins> <release = true bp->b_lru_ref = 0> <remove from lists> freebuf = true spin_unlock(&bp->b_lock) xfs_buf_free(bp) <gets lock, reading and writing freed memory> <accesses freed memory> spin_unlock(&bp->b_lock) <reads/writes freed memory> IOWs, we can't safely take bp->b_lock after dropping the hold reference because the buffer may go away at any time after we drop that reference. However, this can be fixed simply by taking the bp->b_lock before we drop the reference. It is safe to nest the pag_buf_lock inside bp->b_lock as the pag_buf_lock is only used to serialise against lookup in xfs_buf_find() and no other locks are held over or under the pag_buf_lock there. Make this clear by documenting the buffer lock orders at the top of the file. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:21:29 +11:00
Allison Henderson	068f985a9e	xfs: Add attibute remove and helper functions This patch adds xfs_attr_remove_args. These sub-routines remove the attributes specified in @args. We will use this later for setting parent pointers as a deferred attribute operation. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:21:23 +11:00
Allison Henderson	2f3cd80919	xfs: Add attibute set and helper functions This patch adds xfs_attr_set_args and xfs_bmap_set_attrforkoff. These sub-routines set the attributes specified in @args. We will use this later for setting parent pointers as a deferred attribute operation. [dgc: remove attr fork init code from xfs_attr_set_args().] [dgc: xfs_attr_try_sf_addname() NULLs args.trans after commit.] [dgc: correct sf add error handling.] Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:21:16 +11:00
Allison Henderson	4c74a56b9d	xfs: Add helper function xfs_attr_try_sf_addname This patch adds a subroutine xfs_attr_try_sf_addname used by xfs_attr_set. This subrotine will attempt to add the attribute name specified in args in shortform, as well and perform error handling previously done in xfs_attr_set. This patch helps to pre-simplify xfs_attr_set for reviewing purposes and reduce indentation. New function will be added in the next patch. [dgc: moved commit to helper function, too.] Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:20:50 +11:00
Allison Henderson	e2421f0b5f	xfs: Move fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h This patch moves fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h since xfs_attr.c is in libxfs. We will need these later in xfsprogs. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:20:45 +11:00
Dave Chinner	56668a5cc4	xfs: issue log message on user force shutdown The kernel only issues a log message that it's been shut down when the filesystem triggers a shutdown itself. Hence there is no trace in the log when a shutdown is triggered manually from userspace. This can make it hard to see sequence of events in the log when things go wrong, so make sure we always log a message when a shutdown is run. While there, clean up the logic flow so we don't have to continually check if the shutdown trigger was user initiated before logging shutdown messages. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:20:39 +11:00
Darrick J. Wong	38b6238eb6	xfs: fix buffer state management in xrep_findroot_block We don't handle buffer state properly in online repair's findroot routine. If a buffer already has b_ops set, we don't ever want to touch that, and we don't want to call the read verifiers on a buffer that could be dirty (CRCs are only recomputed during log checkpoints). Therefore, be more careful about what we do with a buffer -- if someone else already attached ops that are not the ones for this btree type, just ignore the buffer. We only attach our btree type's buf ops if it matches the magic/uuid and structure checks. We also modify xfs_buf_read_map to allow callers to set buffer ops on a DONE buffer with NULL ops so that repair doesn't leave behind buffers which won't have buffers attached to them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:20:35 +11:00
Darrick J. Wong	1aff5696f3	xfs: always assign buffer verifiers when one is provided If a caller supplies buffer ops when trying to read a buffer and the buffer doesn't already have buf ops assigned, ensure that the ops are assigned to the buffer and the verifier is run on that buffer. Note that current XFS code is careful to assign buffer ops after a xfs_{trans_,}buf_read call in which ops were not supplied. However, we should apply ops defensively in case there is ever a coding mistake; and an upcoming repair patch will need to be able to read a buffer without assigning buf ops. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:20:30 +11:00
Darrick J. Wong	1002ff45ef	xfs: xrep_findroot_block should reject root blocks with siblings In xrep_findroot_block, if we find a candidate root block with sibling pointers or sibling blocks on the same tree level, we should not return that block as a tree root because root blocks cannot have siblings. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:20:26 +11:00
Adam Borowski	dddde68b8f	xfs: add a define for statfs magic to uapi Needed by userspace programs that call fstatfs(). It'd be natural to publish XFS_SB_MAGIC in uapi, but while these two have identical values, they have different semantic meaning: one is an enum cookie meant for statfs, the other a signature of the on-disk format. Signed-off-by: Adam Borowski <kilobyte@angband.pl> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:20:19 +11:00
Christoph Hellwig	4831822ff1	xfs: print dangling delalloc extents Instead of just asserting that we have no delalloc space dangling in an inode that gets freed print the actual offenders for debug mode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:20:11 +11:00
Christoph Hellwig	032dc923b2	xfs: fix fork selection in xfs_find_trim_cow_extent We should want to write directly into the data fork for blocks that don't have an extent in the COW fork covering them yet. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:19:58 +11:00
Christoph Hellwig	d392bc81bb	xfs: remove the unused trimmed argument from xfs_reflink_trim_around_shared Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:19:48 +11:00
Christoph Hellwig	fc439464e3	xfs: remove the unused shared argument to xfs_reflink_reserve_cow Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:19:37 +11:00
Christoph Hellwig	0365c5d6c3	xfs: handle zeroing in xfs_file_iomap_begin_delay We only need to allocate blocks for zeroing for reflink inodes, and for we currently have a special case for reflink files in the otherwise direct I/O path that I'd like to get rid of. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:19:26 +11:00
Christoph Hellwig	daa79baefc	xfs: remove suport for filesystems without unwritten extent flag The option to enable unwritten extents was made default in 2003, removed from mkfs in 2007, and cannot be disabled in v5. We also rely on it for a lot of common functionality, so filesystems without it will run a completely untested and buggy code path. Enabling the support also is a simple bit flip using xfs_db, so legacy file systems can still be brought forward. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:18:58 +11:00
Christoph Hellwig	97e5a6e6dc	xfs: remove XFS_IO_INVALID The invalid state isn't any different from a hole, so merge the two states. Use the more descriptive hole name, but keep it as the first value of the enum to catch uninitialized fields. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-18 17:17:50 +11:00
Chengguang Xu	3642b29a63	fs/exofs: only use true/false for asignment of bool type variable Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-10-18 02:04:59 -04:00
Chengguang Xu	515f1867ad	fs/exofs: fix potential memory leak in mount option parsing There are some cases can cause memory leak when parsing option 'osdname'. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-10-18 02:04:59 -04:00
nixiaoming	55338ac2a9	Delete invalid assignment statements in do_sendfile Assigning value -EINVAL to "retval" here, but that stored value is overwritten before it can be used. retval = -EINVAL; .... retval = rw_verify_area(WRITE, out.file, &out_pos, count); value_overwrite: Overwriting previous write to "retval" with value from rw_verify_area delete invalid assignment statements Signed-off-by: n00202754 <nixiaoming@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-10-18 01:42:49 -04:00
Yue Haibing	d65b1f2029	iomap: remove duplicated include from iomap.c Remove duplicated include. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-10-18 01:18:55 -04:00
Mark Fasheh	85c95f208f	vfs: dedupe should return EPERM if permission is not granted Right now we return EINVAL if a process does not have permission to dedupe a file. This was an oversight on my part. EPERM gives a true description of the nature of our error, and EINVAL is already used for the case that the filesystem does not support dedupe. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-10-17 21:15:39 -04:00
Mark Fasheh	5de4480ae7	vfs: allow dedupe of user owned read-only files The permission check in vfs_dedupe_file_range_one() is too coarse - We only allow dedupe of the destination file if the user is root, or they have the file open for write. This effectively limits a non-root user from deduping their own read-only files. In addition, the write file descriptor that the user is forced to hold open can prevent execution of files. As file data during a dedupe does not change, the behavior is unexpected and this has caused a number of issue reports. For an example, see: https://github.com/markfasheh/duperemove/issues/129 So change the check so we allow dedupe on the target if: - the root or admin is asking for it - the process has write access - the owner of the file is asking for the dedupe - the process could get write access That way users can open read-only and still get dedupe. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-10-17 21:15:39 -04:00
Lu Fengqi	0a9df0df17	btrfs: delayed-ref: extract find_first_ref_head from find_ref_head The find_ref_head shouldn't return the first entry even if no exact match is found. So move the hidden behavior to higher level. Besides, remove the useless local variables in the btrfs_select_ref_head. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> [ reformat comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-17 19:21:00 +02:00
Filipe Manana	5ce555578e	Btrfs: fix deadlock when writing out free space caches When writing out a block group free space cache we can end deadlocking with ourselves on an extent buffer lock resulting in a warning like the following: [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs] [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G W I 4.16.8 #1 [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs] [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246 [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001 [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20 [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000 [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630 [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000 [245043.404780] FS: 0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000 [245043.406050] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0 [245043.408670] Call Trace: [245043.409977] btrfs_search_slot+0x761/0xa60 [btrfs] [245043.411278] btrfs_insert_empty_items+0x62/0xb0 [btrfs] [245043.412572] btrfs_insert_item+0x5b/0xc0 [btrfs] [245043.413922] btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs] [245043.415216] do_chunk_alloc+0x1e5/0x2a0 [btrfs] [245043.416487] find_free_extent+0xcd0/0xf60 [btrfs] [245043.417813] btrfs_reserve_extent+0x96/0x1e0 [btrfs] [245043.419105] btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs] [245043.420378] __btrfs_cow_block+0x127/0x550 [btrfs] [245043.421652] btrfs_cow_block+0xee/0x190 [btrfs] [245043.422979] btrfs_search_slot+0x227/0xa60 [btrfs] [245043.424279] ? btrfs_update_inode_item+0x59/0x100 [btrfs] [245043.425538] ? iput+0x72/0x1e0 [245043.426798] write_one_cache_group.isra.49+0x20/0x90 [btrfs] [245043.428131] btrfs_start_dirty_block_groups+0x102/0x420 [btrfs] [245043.429419] btrfs_commit_transaction+0x11b/0x880 [btrfs] [245043.430712] ? start_transaction+0x8e/0x410 [btrfs] [245043.432006] transaction_kthread+0x184/0x1a0 [btrfs] [245043.433341] kthread+0xf0/0x130 [245043.434628] ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs] [245043.435928] ? kthread_create_worker_on_cpu+0x40/0x40 [245043.437236] ret_from_fork+0x1f/0x30 [245043.441054] ---[ end trace 15abaa2aaf36827f ]--- This is because at write_one_cache_group() when we are COWing a leaf from the extent tree we end up allocating a new block group (chunk) and, because we have hit a threshold on the number of bytes reserved for system chunks, we attempt to finalize the creation of new block groups from the current transaction, by calling btrfs_create_pending_block_groups(). However here we also need to modify the extent tree in order to insert a block group item, and if the location for this new block group item happens to be in the same leaf that we were COWing earlier, we deadlock since btrfs_search_slot() tries to write lock the extent buffer that we locked before at write_one_cache_group(). We have already hit similar cases in the past and commit `d9a0540a79` ("Btrfs: fix deadlock when finalizing block group creation") fixed some of those cases by delaying the creation of pending block groups at the known specific spots that could lead to a deadlock. This change reworks that commit to be more generic so that we don't have to add similar logic to every possible path that can lead to a deadlock. This is done by making __btrfs_cow_block() disallowing the creation of new block groups (setting the transaction's can_flush_pending_bgs to false) before it attempts to allocate a new extent buffer for either the extent, chunk or device trees, since those are the trees that pending block creation modifies. Once the new extent buffer is allocated, it allows creation of pending block groups to happen again. This change depends on a recent patch from Josef which is not yet in Linus' tree, named "btrfs: make sure we create all new block groups" in order to avoid occasional warnings at btrfs_trans_release_chunk_metadata(). Fixes: `d9a0540a79` ("Btrfs: fix deadlock when finalizing block group creation") CC: stable@vger.kernel.org # 4.4+ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753 Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/ Reported-by: E V <eliventer@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-17 17:46:24 +02:00
Filipe Manana	7ed586d0a8	Btrfs: fix assertion on fsync of regular file when using no-holes feature When using the NO_HOLES feature and logging a regular file, we were expecting that if we find an inline extent, that either its size in RAM (uncompressed and unenconded) matches the size of the file or if it does not, that it matches the sector size and it represents compressed data. This assertion does not cover a case where the length of the inline extent is smaller than the sector size and also smaller the file's size, such case is possible through fallocate. Example: $ mkfs.btrfs -f -O no-holes /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xb60 0 21" /mnt/foobar $ xfs_io -c "falloc 40 40" /mnt/foobar $ xfs_io -c "fsync" /mnt/foobar In the above example we trigger the assertion because the inline extent's length is 21 bytes while the file size is 80 bytes. The fallocate() call merely updated the file's size and did not touch the existing inline extent, as expected. So fix this by adjusting the assertion so that an inline extent length smaller than the file size is valid if the file size is smaller than the filesystem's sector size. A test case for fstests follows soon. Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Fixes: `a89ca6f24f` ("Btrfs: fix fsync after truncate when no_holes feature is enabled") CC: stable@vger.kernel.org # 4.14+ Link: https://lore.kernel.org/linux-btrfs/CAE5jQCfRSBC7n4pUTFJcmHh109=gwyT9mFkCOL+NKfzswmR=_Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-17 17:44:53 +02:00
Filipe Manana	3527a018c0	Btrfs: fix null pointer dereference on compressed write path error At inode.c:compress_file_range(), under the "free_pages_out" label, we can end up dereferencing the "pages" pointer when it has a NULL value. This case happens when "start" has a value of 0 and we fail to allocate memory for the "pages" pointer. When that happens we jump to the "cont" label and then enter the "if (start == 0)" branch where we immediately call the cow_file_range_inline() function. If that function returns 0 (success creating an inline extent) or an error (like -ENOMEM for example) we jump to the "free_pages_out" label and then access "pages[i]" leading to a NULL pointer dereference, since "nr_pages" has a value greater than zero at that point. Fix this by setting "nr_pages" to 0 when we fail to allocate memory for the "pages" pointer. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119 Fixes: `771ed689d2` ("Btrfs: Optimize compressed writeback and reads") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-17 17:42:46 +02:00
Jens Axboe	b93f654d73	f2fs: remove request_list check in is_idle() This doesn't work on stacked devices, and it doesn't work on blk-mq devices. The request_list is only used on legacy, which we don't have much of anymore, and soon won't have any of. Kill the check. Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: linux-f2fs-devel@lists.sourceforge.net Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-16 19:09:39 -07:00
Jaegeuk Kim	730746ce88	f2fs: allow to mount, if quota is failed Since we can use the filesystem without quotas till next boot. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-16 09:37:00 -07:00
Sahitya Tummala	6390398ec7	f2fs: update REQ_TIME in f2fs_cross_rename() Update REQ_TIME in the missing path - f2fs_cross_rename(). Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> [Jaegeuk Kim: add it in f2fs_rename()] Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-16 09:37:00 -07:00
Sahitya Tummala	c75f2feb80	f2fs: do not update REQ_TIME in case of error conditions The REQ_TIME should be updated only in case of success cases as followed at all other places in the file system. Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-16 09:37:00 -07:00
Chao Yu	3b30eb19dc	f2fs: remove unneeded disable_nat_bits() Commit `7735730d39` ("f2fs: fix to propagate error from __get_meta_page()") added disable_nat_bits() in error path of __get_nat_bitmaps(), but it's unneeded, beause we will fail mount, we won't have chance to change nid usage status w/o nat full/empty bitmaps. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-16 09:36:59 -07:00
Chao Yu	850971b23f	f2fs: remove unused sbi->trigger_ssr_threshold Commit `a2a12b679f` ("f2fs: export SSR allocation threshold") introduced two threshold .min_ssr_sections and .trigger_ssr_threshold, but only .min_ssr_sections is used, so just remove redundant one for cleanup. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-16 09:36:59 -07:00
Chao Yu	ed15ba1415	f2fs: shrink sbi->sb_lock coverage in set_file_temperature() file_set_{cold,hot} doesn't need holding sbi->sb_lock, so moving them out of the lock. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-16 09:36:59 -07:00
Chao Yu	4dada3fd70	f2fs: use rb__cached friends As rbtree supports caching leftmost node natively, update f2fs codes to use rb__cached helpers to speed up leftmost node visiting. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-16 09:36:59 -07:00
Chao Yu	ef2a007134	f2fs: fix to recover cold bit of inode block during POR Testcase to reproduce this bug: 1. mkfs.f2fs /dev/sdd 2. mount -t f2fs /dev/sdd /mnt/f2fs 3. touch /mnt/f2fs/file 4. sync 5. chattr +A /mnt/f2fs/file 6. xfs_io -f /mnt/f2fs/file -c "fsync" 7. godown /mnt/f2fs 8. umount /mnt/f2fs 9. mount -t f2fs /dev/sdd /mnt/f2fs 10. chattr -A /mnt/f2fs/file 11. xfs_io -f /mnt/f2fs/file -c "fsync" 12. umount /mnt/f2fs 13. mount -t f2fs /dev/sdd /mnt/f2fs 14. lsattr /mnt/f2fs/file -----------------N- /mnt/f2fs/file But actually, we expect the corrct result is: -------A---------N- /mnt/f2fs/file The reason is in step 9) we missed to recover cold bit flag in inode block, so later, in fsync, we will skip write inode block due to below condition check, result in lossing data in another SPOR. f2fs_fsync_node_pages() if (!IS_DNODE(page) \|\| !is_cold_node(page)) continue; Note that, I guess that some non-dir inode has already lost cold bit during POR, so in order to reenable recovery for those inode, let's try to recover cold bit in f2fs_iget() to save more fsynced data. Fixes: `c56675750d` ("f2fs: remove unneeded set_cold_node()") Cc: <stable@vger.kernel.org> 4.17+ Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-16 09:36:59 -07:00
Chao Yu	48018b4cfd	f2fs: submit cached bio to avoid endless PageWriteback When migrating encrypted block from background GC thread, we only add them into f2fs inner bio cache, but forget to submit the cached bio, it may cause potential deadlock when we are waiting page writebacked, fix it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-16 09:36:59 -07:00
Daniel Rosenberg	4354994f09	f2fs: checkpoint disabling Note that, it requires "f2fs: return correct errno in f2fs_gc". This adds a lightweight non-persistent snapshotting scheme to f2fs. To use, mount with the option checkpoint=disable, and to return to normal operation, remount with checkpoint=enable. If the filesystem is shut down before remounting with checkpoint=enable, it will revert back to its apparent state when it was first mounted with checkpoint=disable. This is useful for situations where you wish to be able to roll back the state of the disk in case of some critical failure. Signed-off-by: Daniel Rosenberg <drosen@google.com> [Jaegeuk Kim: use SB_RDONLY instead of MS_RDONLY] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-10-16 09:36:39 -07:00
Hou Tao	92e2921f7e	jffs2: free jffs2_sb_info through jffs2_kill_sb() When an invalid mount option is passed to jffs2, jffs2_parse_options() will fail and jffs2_sb_info will be freed, but then jffs2_sb_info will be used (use-after-free) and freeed (double-free) in jffs2_kill_sb(). Fix it by removing the buggy invocation of kfree() when getting invalid mount options. Fixes: `92abc475d8` ("jffs2: implement mount option parsing and compression overriding") Cc: stable@kernel.org Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Richard Weinberger <richard@nod.at> Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>	2018-10-16 10:34:28 +02:00
Bob Peterson	c9e58fb2aa	gfs2: write revokes should traverse sd_ail1_list in reverse All the other functions that deal with the sd_ail_list run the list from the tail back to the head, iow, in reverse. We should do the same while writing revokes, otherwise we might miss removing entries properly from the list when we hit the limit of how many revokes we can write at one time (based on block size, which determines how many block pointers will fit in the revoke block). Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2018-10-15 12:17:30 -05:00
Lu Fengqi	d9352794da	btrfs: switch return_bigger to bool in find_ref_head Using bool is more suitable than int here, and add the comment about the return_bigger. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:41 +02:00
Lu Fengqi	7c8616278b	btrfs: remove fs_info from btrfs_should_throttle_delayed_refs The avg_delayed_ref_runtime can be referenced from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:41 +02:00
Lu Fengqi	af9b8a0e20	btrfs: remove fs_info from btrfs_check_space_for_delayed_refs It can be referenced from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:41 +02:00
Lu Fengqi	9e920a6f03	btrfs: delayed-ref: pass delayed_refs directly to btrfs_delayed_ref_lock Since trans is only used for referring to delayed_refs, there is no need to pass it instead of delayed_refs to btrfs_delayed_ref_lock(). No functional change. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:41 +02:00
Lu Fengqi	5637c74b01	btrfs: delayed-ref: pass delayed_refs directly to btrfs_select_ref_head Since trans is only used for referring to delayed_refs, there is no need to pass it instead of delayed_refs to btrfs_select_ref_head(). No functional change. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:40 +02:00
Lu Fengqi	b90e22ba48	btrfs: qgroup: move the qgroup->members check out from (!qgroup)'s else branch There is no reason to put this check in (!qgroup)'s else branch because if qgroup is null, it will goto out directly. So move it out to reduce indentation level. No functional change. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:40 +02:00
Qu Wenruo	06bbf67244	btrfs: relocation: Remove redundant tree level check Commit `581c176041` ("btrfs: Validate child tree block's level and first key") has made tree block level check mandatory. So if tree block level doesn't match, we won't get a valid extent buffer. The extra WARN_ON() check can be removed completely. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:40 +02:00
Qu Wenruo	98ff7b94e4	btrfs: relocation: Cleanup while loop using rbtree_postorder_for_each_entry_safe And add one line comment explaining what we're doing for each loop. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:40 +02:00
Qu Wenruo	3628b4ca64	btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabled Some qgroup trace events like btrfs_qgroup_release_data() and btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is not enabled. This is caused by the lack of qgroup status check before calling some qgroup functions. Thankfully the functions can handle quota disabled case well and just do nothing for qgroup disabled case. This patch will do earlier check before triggering related trace events. And for enabled <-> disabled race case: 1) For enabled->disabled case Disable will wipe out all qgroups data including reservation and excl/rfer. Even if we leak some reservation or numbers, it will still be cleared, so nothing will go wrong. 2) For disabled -> enabled case Current btrfs_qgroup_release_data() will use extent_io tree to ensure we won't underflow reservation. And for delayed_ref we use head->qgroup_reserved to record the reserved space, so in that case head->qgroup_reserved should be 0 and we won't underflow. CC: stable@vger.kernel.org # 4.14+ Reported-by: Chris Murphy <lists@colorremedies.com> Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:40 +02:00
Filipe Manana	0f375eed92	Btrfs: fix wrong dentries after fsync of file that got its parent replaced In a scenario like the following: mkdir /mnt/A # inode 258 mkdir /mnt/B # inode 259 touch /mnt/B/bar # inode 260 sync mv /mnt/B/bar /mnt/A/bar mv -T /mnt/A /mnt/B fsync /mnt/B/bar <power fail> After replaying the log we end up with file bar having 2 hard links, both with the name 'bar' and one in the directory with inode number 258 and the other in the directory with inode number 259. Also, we end up with the directory inode 259 still existing and with the directory inode 258 still named as 'A', instead of 'B'. In this scenario, file 'bar' should only have one hard link, located at directory inode 258, the directory inode 259 should not exist anymore and the name for directory inode 258 should be 'B'. This incorrect behaviour happens because when attempting to log the old parents of an inode, we skip any parents that no longer exist. Fix this by forcing a full commit if an old parent no longer exists. A test case for fstests follows soon. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:39 +02:00
Filipe Manana	f2d72f42d5	Btrfs: fix warning when replaying log after fsync of a tmpfile When replaying a log which contains a tmpfile (which necessarily has a link count of 0) we end up calling inc_nlink(), at fs/btrfs/tree-log.c:replay_one_buffer(), which produces a warning like the following: [195191.943673] WARNING: CPU: 0 PID: 6924 at fs/inode.c:342 inc_nlink+0x33/0x40 [195191.943723] CPU: 0 PID: 6924 Comm: mount Not tainted 4.19.0-rc6-btrfs-next-38 #1 [195191.943724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [195191.943726] RIP: 0010:inc_nlink+0x33/0x40 [195191.943728] RSP: 0018:ffffb96e425e3870 EFLAGS: 00010246 [195191.943730] RAX: 0000000000000000 RBX: ffff8c0d1e6af4f0 RCX: 0000000000000006 [195191.943731] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8c0d1e6af4f0 [195191.943731] RBP: 0000000000000097 R08: 0000000000000001 R09: 0000000000000000 [195191.943732] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb96e425e3a60 [195191.943733] R13: ffff8c0d10cff0c8 R14: ffff8c0d0d515348 R15: ffff8c0d78a1b3f8 [195191.943735] FS: 00007f570ee24480(0000) GS:ffff8c0dfb200000(0000) knlGS:0000000000000000 [195191.943736] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [195191.943737] CR2: 00005593286277c8 CR3: 00000000bb8f2006 CR4: 00000000003606f0 [195191.943739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [195191.943740] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [195191.943741] Call Trace: [195191.943778] replay_one_buffer+0x797/0x7d0 [btrfs] [195191.943802] walk_up_log_tree+0x1c1/0x250 [btrfs] [195191.943809] ? rcu_read_lock_sched_held+0x3f/0x70 [195191.943825] walk_log_tree+0xae/0x1d0 [btrfs] [195191.943840] btrfs_recover_log_trees+0x1d7/0x4d0 [btrfs] [195191.943856] ? replay_dir_deletes+0x280/0x280 [btrfs] [195191.943870] open_ctree+0x1c3b/0x22a0 [btrfs] [195191.943887] btrfs_mount_root+0x6b4/0x800 [btrfs] [195191.943894] ? rcu_read_lock_sched_held+0x3f/0x70 [195191.943899] ? pcpu_alloc+0x55b/0x7c0 [195191.943906] ? mount_fs+0x3b/0x140 [195191.943908] mount_fs+0x3b/0x140 [195191.943912] ? __init_waitqueue_head+0x36/0x50 [195191.943916] vfs_kern_mount+0x62/0x160 [195191.943927] btrfs_mount+0x134/0x890 [btrfs] [195191.943936] ? rcu_read_lock_sched_held+0x3f/0x70 [195191.943938] ? pcpu_alloc+0x55b/0x7c0 [195191.943943] ? mount_fs+0x3b/0x140 [195191.943952] ? btrfs_remount+0x570/0x570 [btrfs] [195191.943954] mount_fs+0x3b/0x140 [195191.943956] ? __init_waitqueue_head+0x36/0x50 [195191.943960] vfs_kern_mount+0x62/0x160 [195191.943963] do_mount+0x1f9/0xd40 [195191.943967] ? memdup_user+0x4b/0x70 [195191.943971] ksys_mount+0x7e/0xd0 [195191.943974] __x64_sys_mount+0x21/0x30 [195191.943977] do_syscall_64+0x60/0x1b0 [195191.943980] entry_SYSCALL_64_after_hwframe+0x49/0xbe [195191.943983] RIP: 0033:0x7f570e4e524a [195191.943986] RSP: 002b:00007ffd83589478 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5 [195191.943989] RAX: ffffffffffffffda RBX: 0000563f335b2060 RCX: 00007f570e4e524a [195191.943990] RDX: 0000563f335b2240 RSI: 0000563f335b2280 RDI: 0000563f335b2260 [195191.943992] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020 [195191.943993] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000563f335b2260 [195191.943994] R13: 0000563f335b2240 R14: 0000000000000000 R15: 00000000ffffffff [195191.944002] irq event stamp: 8688 [195191.944010] hardirqs last enabled at (8687): [<ffffffff9cb004c3>] console_unlock+0x503/0x640 [195191.944012] hardirqs last disabled at (8688): [<ffffffff9ca037dd>] trace_hardirqs_off_thunk+0x1a/0x1c [195191.944018] softirqs last enabled at (8638): [<ffffffff9cc0a5d1>] __set_page_dirty_nobuffers+0x101/0x150 [195191.944020] softirqs last disabled at (8634): [<ffffffff9cc26bbe>] wb_wakeup_delayed+0x2e/0x60 [195191.944022] ---[ end trace 5d6e873a9a0b811a ]--- This happens because the inode does not have the flag I_LINKABLE set, which is a runtime only flag, not meant to be persisted, set when the inode is created through open(2) if the flag O_EXCL is not passed to it. Except for the warning, there are no other consequences (like corruptions or metadata inconsistencies). Since it's pointless to replay a tmpfile as it would be deleted in a later phase of the log replay procedure (it has a link count of 0), fix this by not logging tmpfiles and if a tmpfile is found in a log (created by a kernel without this change), skip the replay of the inode. A test case for fstests follows soon. Fixes: `471d557afe` ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay") CC: stable@vger.kernel.org # 4.18+ Reported-by: Martin Steigerwald <martin@lichtvoll.de> Link: https://lore.kernel.org/linux-btrfs/3666619.NTnn27ZJZE@merkaba/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:39 +02:00
Josef Bacik	ad80cf50c3	btrfs: drop min_size from evict_refill_and_join We don't need it, rsv->size is set once and never changes throughout its lifetime, so just use that for the reserve size. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:39 +02:00
Josef Bacik	e187831e18	btrfs: assert on non-empty delayed iputs I ran into an issue where there was some reference being held on an inode that I couldn't track. This assert wasn't triggered, but it at least rules out we're doing something stupid. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:39 +02:00
Josef Bacik	545e3366db	btrfs: make sure we create all new block groups Allocating new chunks modifies both the extent and chunk tree, which can trigger new chunk allocations. So instead of doing list_for_each_safe, just do while (!list_empty()) so we make sure we don't exit with other pending bg's still on our list. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:39 +02:00
Josef Bacik	553cceb496	btrfs: reset max_extent_size on clear in a bitmap We need to clear the max_extent_size when we clear bits from a bitmap since it could have been from the range that contains the max_extent_size. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:39 +02:00
Josef Bacik	84de76a2fb	btrfs: protect space cache inode alloc with GFP_NOFS If we're allocating a new space cache inode it's likely going to be under a transaction handle, so we need to use memalloc_nofs_save() in order to avoid deadlocks, and more importantly lockdep messages that make xfstests fail. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:38 +02:00
Josef Bacik	f45c752b65	btrfs: release metadata before running delayed refs We want to release the unused reservation we have since it refills the delayed refs reserve, which will make everything go smoother when running the delayed refs if we're short on our reservation. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:38 +02:00
Liu Bo	5239834016	Btrfs: kill btrfs_clear_path_blocking Btrfs's btree locking has two modes, spinning mode and blocking mode, while searching btree, locking is always acquired in spinning mode and then converted to blocking mode if necessary, and in some hot paths we may switch the locking back to spinning mode by btrfs_clear_path_blocking(). When acquiring locks, both of reader and writer need to wait for blocking readers and writers to complete before doing read_lock()/write_lock(). The problem is that btrfs_clear_path_blocking() needs to switch nodes in the path to blocking mode at first (by btrfs_set_path_blocking) to make lockdep happy before doing its actual clearing blocking job. When switching to blocking mode from spinning mode, it consists of step 1) bumping up blocking readers counter and step 2) read_unlock()/write_unlock(), this has caused serious ping-pong effect if there're a great amount of concurrent readers/writers, as waiters will be woken up and go to sleep immediately. 1) Killing this kind of ping-pong results in a big improvement in my 1600k files creation script, MNT=/mnt/btrfs mkfs.btrfs -f /dev/sdf mount /dev/def $MNT time fsmark -D 10000 -S0 -n 100000 -s 0 -L 1 -l /tmp/fs_log.txt \ -d $MNT/0 -d $MNT/1 \ -d $MNT/2 -d $MNT/3 \ -d $MNT/4 -d $MNT/5 \ -d $MNT/6 -d $MNT/7 \ -d $MNT/8 -d $MNT/9 \ -d $MNT/10 -d $MNT/11 \ -d $MNT/12 -d $MNT/13 \ -d $MNT/14 -d $MNT/15 w/o patch: real 2m27.307s user 0m12.839s sys 13m42.831s w/ patch: real 1m2.273s user 0m15.802s sys 8m16.495s 1.1) latency histogram from funclatency[1] Overall with the patch, there're ~50% less write lock acquisition and the 95% max latency that write lock takes also reduces to ~100ms from >500ms. -------------------------------------------- w/o patch: -------------------------------------------- Function = btrfs_tree_lock msecs : count distribution 0 -> 1 : 2385222 \|**************************************\| 2 -> 3 : 37147 \| \| 4 -> 7 : 20452 \| \| 8 -> 15 : 13131 \| \| 16 -> 31 : 3877 \| \| 32 -> 63 : 3900 \| \| 64 -> 127 : 2612 \| \| 128 -> 255 : 974 \| \| 256 -> 511 : 165 \| \| 512 -> 1023 : 13 \| \| Function = btrfs_tree_read_lock msecs : count distribution 0 -> 1 : 6743860 \|************************************\| 2 -> 3 : 2146 \| \| 4 -> 7 : 190 \| \| 8 -> 15 : 38 \| \| 16 -> 31 : 4 \| \| -------------------------------------------- w/ patch: -------------------------------------------- Function = btrfs_tree_lock msecs : count distribution 0 -> 1 : 1318454 \|************************************\| 2 -> 3 : 6800 \| \| 4 -> 7 : 3664 \| \| 8 -> 15 : 2145 \| \| 16 -> 31 : 809 \| \| 32 -> 63 : 219 \| \| 64 -> 127 : 10 \| \| Function = btrfs_tree_read_lock msecs : count distribution 0 -> 1 : 6854317 \|**************************************\| 2 -> 3 : 2383 \| \| 4 -> 7 : 601 \| \| 8 -> 15 : 92 \| \| 2) dbench also proves the improvement, dbench -t 120 -D /mnt/btrfs 16 w/o patch: Throughput 158.363 MB/sec w/ patch: Throughput 449.52 MB/sec 3) xfstests didn't show any additional failures. One thing to note is that callers may set path->leave_spinning to have all nodes in the path stay in spinning mode, which means callers are ready to not sleep before releasing the path, but it won't cause problems if they don't want to sleep in blocking mode. [1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:38 +02:00
David Sterba	9b142115ed	btrfs: dev-replace: remove pointless assert in write unlock The value of blocking_readers is increased only when the lock is taken for read, no way we can fail the condition with the write lock. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:38 +02:00
David Sterba	7f8d236ae1	btrfs: dev-replace: move replace members out of fs_info The replace_wait and bio_counter were mistakenly added to fs_info in commit `c404e0dc2c` ("Btrfs: fix use-after-free in the finishing procedure of the device replace"), but they logically belong to fs_info::dev_replace. Besides, bio_counter is a very generic name and is confusing in bare fs_info context. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:38 +02:00
David Sterba	aa144bfeaa	btrfs: dev-replace: avoid useless lock on error handling path The exit sequence in btrfs_dev_replace_start does not allow to simply add a label to the right place so the error handling after starting transaction failure jumps there. Currently there's a lock that pairs with the unlock in the section, which is unnecessary and only raises questions. Add a variable to track the locking status and avoid the extra locking. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:38 +02:00
David Sterba	9f6cbcbb09	btrfs: open code btrfs_after_dev_replace_commit Too trivial, the purpose can be simply documented in a comment. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:37 +02:00
David Sterba	e37abe9725	btrfs: open code btrfs_dev_replace_stats_inc The wrapper is too trivial, open coding does not make it less readable. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:37 +02:00
David Sterba	7fb2eced10	btrfs: open code btrfs_dev_replace_clear_lock_blocking There's a single caller and the function name does not say it's actually taking the lock, so open coding makes it more explicit. For now, btrfs_dev_replace_read_lock is used instead of read_lock so it's paired with the unlocking wrapper in the same block. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:37 +02:00
David Sterba	3280f87457	btrfs: remove btrfs_dev_replace::read_locks This member seems to be copied from the extent_buffer locking scheme and is at least used to assert that the read lock/unlock is properly nested. In some way. While the _inc/_dec are called inside the read lock section, the asserts are both inside and outside, so the ordering is not guaranteed and we can see read/inc/dec ordered in any way (theoretically). A missing call of btrfs_dev_replace_clear_lock_blocking could cause unexpected read_locks count, so this at least looks like a valid assertion, but this will become unnecessary with later updates. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:37 +02:00
Qu Wenruo	f556faa46e	btrfs: tree-checker: Check level for leaves and nodes Although we have tree level check at tree read runtime, it's completely based on its parent level. We still need to do accurate level check to avoid invalid tree blocks sneak into kernel space. The check itself is simple, for leaf its level should always be 0. For nodes its level should be in range [1, BTRFS_MAX_LEVEL - 1]. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:37 +02:00
Qu Wenruo	3d0174f78e	btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group For qgroup_trace_extent_swap(), if we find one leaf that needs to be traced, we will also iterate all file extents and trace them. This is OK if we're relocating data block groups, but if we're relocating metadata block groups, balance code itself has ensured that both subtree of file tree and reloc tree contain the same contents. That's to say, if we're relocating metadata block groups, all file extents in reloc and file tree should match, thus no need to trace them. This should reduce the total number of dirty extents processed in metadata block group balance. [[Benchmark]] (with all previous enhancement) Hardware: VM 4G vRAM, 8 vCPUs, disk is using 'unsafe' cache mode, backing device is SAMSUNG 850 evo SSD. Host has 16G ram. Mkfs parameter: --nodesize 4K (To bump up tree size) Initial subvolume contents: 4G data copied from /usr and /lib. (With enough regular small files) Snapshots: 16 snapshots of the original subvolume. each snapshot has 3 random files modified. balance parameter: -m So the content should be pretty similar to a real world root fs layout. \| v4.19-rc1 \| w/ patchset \| diff (*) --------------------------------------------------------------- relocated extents \| 22929 \| 22851 \| -0.3% qgroup dirty extents \| 227757 \| 140886 \| -38.1% time (sys) \| 65.253s \| 37.464s \| -42.6% time (real) \| 74.032s \| 44.722s \| -39.6% Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:36 +02:00
Qu Wenruo	2cd86d309b	btrfs: qgroup: Don't trace subtree if we're dropping reloc tree Reloc tree doesn't contribute to qgroup numbers, as we have accounted them at balance time (see replace_path()). Skipping the unneeded subtree tracing should reduce the overhead. [[Benchmark]] Hardware: VM 4G vRAM, 8 vCPUs, disk is using 'unsafe' cache mode, backing device is SAMSUNG 850 evo SSD. Host has 16G ram. Mkfs parameter: --nodesize 4K (To bump up tree size) Initial subvolume contents: 4G data copied from /usr and /lib. (With enough regular small files) Snapshots: 16 snapshots of the original subvolume. each snapshot has 3 random files modified. balance parameter: -m So the content should be pretty similar to a real world root fs layout. \| v4.19-rc1 \| w/ patchset \| diff (*) --------------------------------------------------------------- relocated extents \| 22929 \| 22900 \| -0.1% qgroup dirty extents \| 227757 \| 167139 \| -26.6% time (sys) \| 65.253s \| 50.123s \| -23.2% time (real) \| 74.032s \| 52.551s \| -29.0% Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:36 +02:00
Qu Wenruo	5f527822be	btrfs: qgroup: Use generation-aware subtree swap to mark dirty extents Before this patch, with quota enabled during balance, we need to mark the whole subtree dirty for quota. E.g. OO = Old tree blocks (from file tree) NN = New tree blocks (from reloc tree) File tree (src) Reloc tree (dst) OO (a) NN (a) / \ / \ (b) OO OO (c) (b) NN NN (c) / \ / \ / \ / \ OO OO OO OO (d) OO OO OO NN (d) For old balance + quota case, quota will mark the whole src and dst tree dirty, including all the 3 old tree blocks in reloc tree. It's doable for small file tree or new tree blocks are all located at lower level. But for large file tree or new tree blocks are all located at higher level, this will lead to mark the whole tree dirty, and be unbelievably slow. This patch will change how we handle such balance with quota enabled case. Now we will search from (b) and (c) for any new tree blocks whose generation is equal to @last_snapshot, and only mark them dirty. In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d). (NN(a) will be traced when COW happens for nodeptr modification). And also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW happens for nodeptr modification.) For above case, we could skip 3 tree blocks, but for larger tree, we can skip tons of unmodified tree blocks, and hugely speed up balance. This patch will introduce a new function, btrfs_qgroup_trace_subtree_swap(), which will do the following main work: 1) Read out real root eb And setup basic dst_path for later calls 2) Call qgroup_trace_new_subtree_blocks() To trace all new tree blocks in reloc tree and their counter parts in the file tree. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:36 +02:00
Qu Wenruo	ea49f3e73c	btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate all new tree blocks in a reloc tree. So that qgroup could skip unrelated tree blocks during balance, which should hugely speedup balance speed when quota is enabled. The function qgroup_trace_new_subtree_blocks() itself only cares about new tree blocks in reloc tree. All its main works are: 1) Read out tree blocks according to parent pointers 2) Do recursive depth-first search Will call the same function on all its children tree blocks, with search level set to current level -1. And will also skip all children whose generation is smaller than @last_snapshot. 3) Call qgroup_trace_extent_swap() to trace tree blocks So although we have parameter list related to source file tree, it's not used at all, but only passed to qgroup_trace_extent_swap(). Thus despite the tree read code, the core should be pretty short and all about recursive depth-first search. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:36 +02:00
Qu Wenruo	25982561db	btrfs: qgroup: Introduce function to trace two swaped extents Introduce a new function, qgroup_trace_extent_swap(), which will be used later for balance qgroup speedup. The basis idea of balance is swapping tree blocks between reloc tree and the real file tree. The swap will happen in highest tree block, but there may be a lot of tree blocks involved. For example: OO = Old tree blocks NN = New tree blocks allocated during balance File tree (257) Reloc tree for 257 L2 OO NN / \ / \ L1 OO OO (a) OO NN (a) / \ / \ / \ / \ L0 OO OO OO OO OO OO NN NN (b) (c) (b) (c) When calling qgroup_trace_extent_swap(), we will pass: @src_eb = OO(a) @dst_path = [ nodes[1] = NN(a), nodes[0] = NN(c) ] @dst_level = 0 @root_level = 1 In that case, qgroup_trace_extent_swap() will search from OO(a) to reach OO(c), then mark both OO(c) and NN(c) as qgroup dirty. The main work of qgroup_trace_extent_swap() can be split into 3 parts: 1) Tree search from @src_eb It should acts as a simplified btrfs_search_slot(). The key for search can be extracted from @dst_path->nodes[dst_level] (first key). 2) Mark the final tree blocks in @src_path and @dst_path qgroup dirty NOTE: In above case, OO(a) and NN(a) won't be marked qgroup dirty. They should be marked during preivous (@dst_level = 1) iteration. 3) Mark file extents in leaves dirty We don't have good way to pick out new file extents only. So we still follow the old method by scanning all file extents in the leave. This function can free us from keeping two pathes, thus later we only need to care about how to iterate all new tree blocks in reloc tree. Signed-off-by: Qu Wenruo <wqu@suse.com> [ copy changelog to function comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:36 +02:00
Qu Wenruo	c337e7b02f	btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted Number of qgroup dirty extents is directly linked to the performance overhead, so add a new trace event, trace_qgroup_num_dirty_extents(), to record how many dirty extents is processed in btrfs_qgroup_account_extents(). This will be pretty handy to analyze later balance performance improvement. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:36 +02:00
Qu Wenruo	fa6ac71524	btrfs: relocation: Add basic extent backref related comments for build_backref_tree fs/btrfs/relocation.c:build_backref_tree() is some code from 2009 era, although it works pretty fine, it's not that easy to understand. Especially combined with the complex btrfs backref format. This patch adds some basic comment for the backref build part of the code, making it less hard to read, at least for backref searching part. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:35 +02:00
Omar Sandoval	4779cc0424	Btrfs: get rid of btrfs_symlink_aops The only aops we define for symlinks are identical to the aops for regular files. This has been the case since symlink support was added in commit `2b8d99a723` ("Btrfs: symlinks and hard links"). As far as I can tell, there wasn't a good reason to have separate aops then, and there isn't now, so let's just do what most other filesystems do and reuse the same structure. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:35 +02:00
Chris Mason	7703bdd8d2	Btrfs: don't clean dirty pages during buffered writes During buffered writes, we follow this basic series of steps: again: lock all the pages wait for writeback on all the pages Take the extent range lock wait for ordered extents on the whole range clean all the pages if (copy_from_user_in_atomic() hits a fault) { drop our locks goto again; } dirty all the pages release all the locks The extra waiting, cleaning and locking are there to make sure we don't modify pages in flight to the drive, after they've been crc'd. If some of the pages in the range were already dirty when the write began, and we need to goto again, we create a window where a dirty page has been cleaned and unlocked. It may be reclaimed before we're able to lock it again, which means we'll read the old contents off the drive and lose any modifications that had been pending writeback. We don't actually need to clean the pages. All of the other locking in place makes sure we don't start IO on the pages, so we can just leave them dirty for the duration of the write. Fixes: `73d59314e6` (the original btrfs merge) CC: stable@vger.kernel.org # v4.4+ Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:35 +02:00
David Sterba	818255feec	btrfs: use common helper instead of open coding a bit test The helper does the same math and we take care about the special case when flags is 0 too. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:35 +02:00
Nikolay Borisov	0110a4c434	btrfs: refactor __btrfs_run_delayed_refs loop Refactor the delayed refs loop by using the newly introduced btrfs_run_delayed_refs_for_head function. This greatly simplifies __btrfs_run_delayed_refs and makes it more obvious what is happening. We now have 1 loop which iterates the existing delayed_heads and then each selected ref head is processed by the new helper. All existing semantics of the code are preserved so no functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:35 +02:00
Nikolay Borisov	e726138676	btrfs: Factor out loop processing all refs of a head This patch introduces a new helper encompassing the implicit inner loop in __btrfs_run_delayed_refs which processes all the refs for a given head. The code is mostly copy/paste, the only difference is that if we detect a newer reference then -EAGAIN is returned so that callers can react correctly. Also, at the end of the loop the head is relocked and btrfs_merge_delayed_refs is run again to retain the pre-refactoring semantics. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:35 +02:00
Nikolay Borisov	b1cdbcb53a	btrfs: Factor out ref head locking code in __btrfs_run_delayed_refs This is in preparation to refactor the giant loop in __btrfs_run_delayed_refs. As a first step define a new function which implements acquiring a reference to a btrfs_delayed_refs_head and use it. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:34 +02:00
David Sterba	b2fa11547b	btrfs: tests: polish ifdefs around testing helper Avoid the inline ifdefs and use two sections for self-tests enabled and disabled. Though there could be no ifdef and unconditional test_bit of BTRFS_FS_STATE_DUMMY_FS_INFO, the static inline can help to optimize out any code that would depend on conditions using btrfs_is_testing. As this is only for the testing code, drop unlikely(). Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:34 +02:00
David Sterba	a654666a34	btrfs: tests: group declarations of self-test helpers Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:34 +02:00
David Sterba	57ec5fb478	btrfs: tests: move testing members of struct btrfs_root to the end The data used only for tests are better placed at the end of the structure so that they don't change the structure layout. All new members of btrfs_root should be placed before. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:34 +02:00
David Sterba	9c36396c2a	btrfs: tests: add separate stub for find_lock_delalloc_range The helper find_lock_delalloc_range is now conditionally built static, dpending on whether the self-tests are enabled or not. There's a macro that is supposed to hide the export, used only once. To discourage further use, drop it an add a public wrapper for the helper needed by tests. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:34 +02:00
Liu Bo	ecf160b424	Btrfs: preftree: use rb_first_cached rb_first_cached() trades an extra pointer "leftmost" for doing the same job as rb_first() but in O(1). While resolving indirect refs and missing refs, it always looks for the first rb entry in a while loop, it's helpful to use rb_first_cached instead. For more details about the optimization see patch "Btrfs: delayed-refs: use rb_first_cached for href_root". Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:33 +02:00
Liu Bo	07e1ce096d	Btrfs: extent_map: use rb_first_cached rb_first_cached() trades an extra pointer "leftmost" for doing the same job as rb_first() but in O(1). As evict_inode_truncate_pages() removes all extent mapping by always looking for the first rb entry, it's helpful to use rb_first_cached instead. For more details about the optimization see patch "Btrfs: delayed-refs: use rb_first_cached for href_root". Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:33 +02:00
Liu Bo	03a1d4c891	Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root rb_first_cached() trades an extra pointer "leftmost" for doing the same job as rb_first() but in O(1). Functions manipulating delayed_item need to get the first entry, this converts it to use rb_first_cached(). For more details about the optimization see patch "Btrfs: delayed-refs: use rb_first_cached for href_root". Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:33 +02:00
Liu Bo	e3d0396563	Btrfs: delayed-refs: use rb_first_cached for ref_tree rb_first_cached() trades an extra pointer "leftmost" for doing the same job as rb_first() but in O(1). Functions manipulating href->ref_tree need to get the first entry, this converts href->ref_tree to use rb_first_cached(). For more details about the optimization see patch "Btrfs: delayed-refs: use rb_first_cached for href_root". Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:33 +02:00
Liu Bo	5c9d028b3b	Btrfs: delayed-refs: use rb_first_cached for href_root rb_first_cached() trades an extra pointer "leftmost" for doing the same job as rb_first() but in O(1). Functions manipulating href_root need to get the first entry, this converts href_root to use rb_first_cached(). This patch is first in the sequenct of similar updates to other rbtrees and this is analysis of the expected behaviour and improvements. There's a common pattern: while (node = rb_first) { entry = rb_entry(node) next = rb_next(node) rb_erase(node) cleanup(entry) } rb_first needs to traverse the tree up to logN depth, rb_erase can completely reshuffle the tree. With the caching we'll skip the traversal in rb_first. That's a cached memory access vs looped pointer dereference trade-off that IMHO has a clear winner. Measurements show there's not much difference in a sample tree with 10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of caching and pointer chasing are unpredictable though. Further optimzations can be done to avoid the expensive rb_erase step. In some cases it's ok to process the nodes in any order, so the tree can be traversed in post-order, not rebalancing the children nodes and just calling free. Care must be taken regarding the next node. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog from mail discussions ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:33 +02:00

... 3 4 5 6 7 ...

56409 Commits