2018-01-29 19:41:30 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
|
|
|
#ifndef _LINUX_IVERSION_H
|
|
|
|
#define _LINUX_IVERSION_H
|
|
|
|
|
|
|
|
#include <linux/fs.h>
|
|
|
|
|
|
|
|
/*
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* The inode->i_version field:
|
|
|
|
* ---------------------------
|
2018-01-29 19:41:30 +08:00
|
|
|
* The change attribute (i_version) is mandated by NFSv4 and is mostly for
|
|
|
|
* knfsd, but is also used for other purposes (e.g. IMA). The i_version must
|
|
|
|
* appear different to observers if there was a change to the inode's data or
|
|
|
|
* metadata since it was last queried.
|
|
|
|
*
|
|
|
|
* Observers see the i_version as a 64-bit number that never decreases. If it
|
|
|
|
* remains the same since it was last checked, then nothing has changed in the
|
|
|
|
* inode. If it's different then something has changed. Observers cannot infer
|
|
|
|
* anything about the nature or magnitude of the changes from the value, only
|
|
|
|
* that the inode has changed in some fashion.
|
|
|
|
*
|
|
|
|
* Not all filesystems properly implement the i_version counter. Subsystems that
|
|
|
|
* want to use i_version field on an inode should first check whether the
|
|
|
|
* filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
|
|
|
|
*
|
|
|
|
* Those that set SB_I_VERSION will automatically have their i_version counter
|
|
|
|
* incremented on writes to normal files. If the SB_I_VERSION is not set, then
|
|
|
|
* the VFS will not touch it on writes, and the filesystem can use it how it
|
|
|
|
* wishes. Note that the filesystem is always responsible for updating the
|
|
|
|
* i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
|
|
|
|
* We consider these sorts of filesystems to have a kernel-managed i_version.
|
|
|
|
*
|
|
|
|
* It may be impractical for filesystems to keep i_version updates atomic with
|
|
|
|
* respect to the changes that cause them. They should, however, guarantee
|
|
|
|
* that i_version updates are never visible before the changes that caused
|
|
|
|
* them. Also, i_version updates should never be delayed longer than it takes
|
|
|
|
* the original change to reach disk.
|
|
|
|
*
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* This implementation uses the low bit in the i_version field as a flag to
|
|
|
|
* track when the value has been queried. If it has not been queried since it
|
|
|
|
* was last incremented, we can skip the increment in most cases.
|
|
|
|
*
|
|
|
|
* In the event that we're updating the ctime, we will usually go ahead and
|
|
|
|
* bump the i_version anyway. Since that has to go to stable storage in some
|
|
|
|
* fashion, we might as well increment it as well.
|
|
|
|
*
|
|
|
|
* With this implementation, the value should always appear to observers to
|
|
|
|
* increase over time if the file has changed. It's recommended to use
|
2018-02-01 21:15:25 +08:00
|
|
|
* inode_eq_iversion() helper to compare values.
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
*
|
2018-01-29 19:41:30 +08:00
|
|
|
* Note that some filesystems (e.g. NFS and AFS) just use the field to store
|
|
|
|
* a server-provided value (for the most part). For that reason, those
|
|
|
|
* filesystems do not set SB_I_VERSION. These filesystems are considered to
|
|
|
|
* have a self-managed i_version.
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
*
|
|
|
|
* Persistently storing the i_version
|
|
|
|
* ----------------------------------
|
|
|
|
* Queries of the i_version field are not gated on them hitting the backing
|
|
|
|
* store. It's always possible that the host could crash after allowing
|
|
|
|
* a query of the value but before it has made it to disk.
|
|
|
|
*
|
|
|
|
* To mitigate this problem, filesystems should always use
|
|
|
|
* inode_set_iversion_queried when loading an existing inode from disk. This
|
|
|
|
* ensures that the next attempted inode increment will result in the value
|
|
|
|
* changing.
|
|
|
|
*
|
|
|
|
* Storing the value to disk therefore does not count as a query, so those
|
|
|
|
* filesystems should use inode_peek_iversion to grab the value to be stored.
|
|
|
|
* There is no need to flag the value as having been queried in that case.
|
2018-01-29 19:41:30 +08:00
|
|
|
*/
|
|
|
|
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
/*
|
|
|
|
* We borrow the lowest bit in the i_version to use as a flag to tell whether
|
|
|
|
* it has been queried since we last incremented it. If it has, then we must
|
|
|
|
* increment it on the next change. After that, we can clear the flag and
|
|
|
|
* avoid incrementing it again until it has again been queried.
|
|
|
|
*/
|
|
|
|
#define I_VERSION_QUERIED_SHIFT (1)
|
|
|
|
#define I_VERSION_QUERIED (1ULL << (I_VERSION_QUERIED_SHIFT - 1))
|
|
|
|
#define I_VERSION_INCREMENT (1ULL << I_VERSION_QUERIED_SHIFT)
|
|
|
|
|
2018-01-29 19:41:30 +08:00
|
|
|
/**
|
|
|
|
* inode_set_iversion_raw - set i_version to the specified raw value
|
|
|
|
* @inode: inode to set
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* @val: new i_version value to set
|
2018-01-29 19:41:30 +08:00
|
|
|
*
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* Set @inode's i_version field to @val. This function is for use by
|
2018-01-29 19:41:30 +08:00
|
|
|
* filesystems that self-manage the i_version.
|
|
|
|
*
|
|
|
|
* For example, the NFS client stores its NFSv4 change attribute in this way,
|
|
|
|
* and the AFS client stores the data_version from the server here.
|
|
|
|
*/
|
|
|
|
static inline void
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
inode_set_iversion_raw(struct inode *inode, u64 val)
|
|
|
|
{
|
|
|
|
atomic64_set(&inode->i_version, val);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* inode_peek_iversion_raw - grab a "raw" iversion value
|
|
|
|
* @inode: inode from which i_version should be read
|
|
|
|
*
|
|
|
|
* Grab a "raw" inode->i_version value and return it. The i_version is not
|
|
|
|
* flagged or converted in any way. This is mostly used to access a self-managed
|
|
|
|
* i_version.
|
|
|
|
*
|
|
|
|
* With those filesystems, we want to treat the i_version as an entirely
|
|
|
|
* opaque value.
|
|
|
|
*/
|
|
|
|
static inline u64
|
|
|
|
inode_peek_iversion_raw(const struct inode *inode)
|
2018-01-29 19:41:30 +08:00
|
|
|
{
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
return atomic64_read(&inode->i_version);
|
2018-01-29 19:41:30 +08:00
|
|
|
}
|
|
|
|
|
2019-06-06 05:24:22 +08:00
|
|
|
/**
|
|
|
|
* inode_set_max_iversion_raw - update i_version new value is larger
|
|
|
|
* @inode: inode to set
|
|
|
|
* @val: new i_version to set
|
|
|
|
*
|
|
|
|
* Some self-managed filesystems (e.g Ceph) will only update the i_version
|
|
|
|
* value if the new value is larger than the one we already have.
|
|
|
|
*/
|
|
|
|
static inline void
|
|
|
|
inode_set_max_iversion_raw(struct inode *inode, u64 val)
|
|
|
|
{
|
|
|
|
u64 cur, old;
|
|
|
|
|
|
|
|
cur = inode_peek_iversion_raw(inode);
|
|
|
|
for (;;) {
|
|
|
|
if (cur > val)
|
|
|
|
break;
|
|
|
|
old = atomic64_cmpxchg(&inode->i_version, cur, val);
|
|
|
|
if (likely(old == cur))
|
|
|
|
break;
|
|
|
|
cur = old;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-01-29 19:41:30 +08:00
|
|
|
/**
|
|
|
|
* inode_set_iversion - set i_version to a particular value
|
|
|
|
* @inode: inode to set
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* @val: new i_version value to set
|
2018-01-29 19:41:30 +08:00
|
|
|
*
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* Set @inode's i_version field to @val. This function is for filesystems with
|
|
|
|
* a kernel-managed i_version, for initializing a newly-created inode from
|
|
|
|
* scratch.
|
2018-01-29 19:41:30 +08:00
|
|
|
*
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* In this case, we do not set the QUERIED flag since we know that this value
|
|
|
|
* has never been queried.
|
2018-01-29 19:41:30 +08:00
|
|
|
*/
|
|
|
|
static inline void
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
inode_set_iversion(struct inode *inode, u64 val)
|
2018-01-29 19:41:30 +08:00
|
|
|
{
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
inode_set_iversion_raw(inode, val << I_VERSION_QUERIED_SHIFT);
|
2018-01-29 19:41:30 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* inode_set_iversion_queried - set i_version to a particular value as quereied
|
2018-01-29 19:41:30 +08:00
|
|
|
* @inode: inode to set
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* @val: new i_version value to set
|
2018-01-29 19:41:30 +08:00
|
|
|
*
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* Set @inode's i_version field to @val, and flag it for increment on the next
|
|
|
|
* change.
|
2018-01-29 19:41:30 +08:00
|
|
|
*
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* Filesystems that persistently store the i_version on disk should use this
|
|
|
|
* when loading an existing inode from disk.
|
2018-01-29 19:41:30 +08:00
|
|
|
*
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* When loading in an i_version value from a backing store, we can't be certain
|
|
|
|
* that it wasn't previously viewed before being stored. Thus, we must assume
|
|
|
|
* that it was, to ensure that we don't end up handing out the same value for
|
|
|
|
* different versions of the same inode.
|
2018-01-29 19:41:30 +08:00
|
|
|
*/
|
|
|
|
static inline void
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
inode_set_iversion_queried(struct inode *inode, u64 val)
|
2018-01-29 19:41:30 +08:00
|
|
|
{
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
inode_set_iversion_raw(inode, (val << I_VERSION_QUERIED_SHIFT) |
|
|
|
|
I_VERSION_QUERIED);
|
2018-01-29 19:41:30 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* inode_maybe_inc_iversion - increments i_version
|
|
|
|
* @inode: inode with the i_version that should be updated
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* @force: increment the counter even if it's not necessary?
|
2018-01-29 19:41:30 +08:00
|
|
|
*
|
|
|
|
* Every time the inode is modified, the i_version field must be seen to have
|
|
|
|
* changed by any observer.
|
|
|
|
*
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* If "force" is set or the QUERIED flag is set, then ensure that we increment
|
|
|
|
* the value, and clear the queried flag.
|
2018-01-29 19:41:30 +08:00
|
|
|
*
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* In the common case where neither is set, then we can return "false" without
|
|
|
|
* updating i_version.
|
|
|
|
*
|
|
|
|
* If this function returns false, and no other metadata has changed, then we
|
|
|
|
* can avoid logging the metadata.
|
2018-01-29 19:41:30 +08:00
|
|
|
*/
|
|
|
|
static inline bool
|
|
|
|
inode_maybe_inc_iversion(struct inode *inode, bool force)
|
|
|
|
{
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
u64 cur, old, new;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The i_version field is not strictly ordered with any other inode
|
|
|
|
* information, but the legacy inode_inc_iversion code used a spinlock
|
|
|
|
* to serialize increments.
|
|
|
|
*
|
|
|
|
* Here, we add full memory barriers to ensure that any de-facto
|
|
|
|
* ordering with other info is preserved.
|
|
|
|
*
|
|
|
|
* This barrier pairs with the barrier in inode_query_iversion()
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
cur = inode_peek_iversion_raw(inode);
|
|
|
|
for (;;) {
|
|
|
|
/* If flag is clear then we needn't do anything */
|
|
|
|
if (!force && !(cur & I_VERSION_QUERIED))
|
|
|
|
return false;
|
2017-12-18 19:25:31 +08:00
|
|
|
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
/* Since lowest bit is flag, add 2 to avoid it */
|
|
|
|
new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
|
|
|
|
|
|
|
|
old = atomic64_cmpxchg(&inode->i_version, cur, new);
|
|
|
|
if (likely(old == cur))
|
|
|
|
break;
|
|
|
|
cur = old;
|
|
|
|
}
|
2018-01-29 19:41:30 +08:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2017-12-18 19:25:31 +08:00
|
|
|
|
2018-01-29 19:41:30 +08:00
|
|
|
/**
|
|
|
|
* inode_inc_iversion - forcibly increment i_version
|
|
|
|
* @inode: inode that needs to be updated
|
|
|
|
*
|
|
|
|
* Forcbily increment the i_version field. This always results in a change to
|
|
|
|
* the observable value.
|
|
|
|
*/
|
|
|
|
static inline void
|
|
|
|
inode_inc_iversion(struct inode *inode)
|
|
|
|
{
|
|
|
|
inode_maybe_inc_iversion(inode, true);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* inode_iversion_need_inc - is the i_version in need of being incremented?
|
|
|
|
* @inode: inode to check
|
|
|
|
*
|
|
|
|
* Returns whether the inode->i_version counter needs incrementing on the next
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* change. Just fetch the value and check the QUERIED flag.
|
2018-01-29 19:41:30 +08:00
|
|
|
*/
|
|
|
|
static inline bool
|
|
|
|
inode_iversion_need_inc(struct inode *inode)
|
|
|
|
{
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
return inode_peek_iversion_raw(inode) & I_VERSION_QUERIED;
|
2018-01-29 19:41:30 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* inode_inc_iversion_raw - forcibly increment raw i_version
|
|
|
|
* @inode: inode that needs to be updated
|
|
|
|
*
|
|
|
|
* Forcbily increment the raw i_version field. This always results in a change
|
|
|
|
* to the raw value.
|
|
|
|
*
|
|
|
|
* NFS will use the i_version field to store the value from the server. It
|
|
|
|
* mostly treats it as opaque, but in the case where it holds a write
|
|
|
|
* delegation, it must increment the value itself. This function does that.
|
|
|
|
*/
|
|
|
|
static inline void
|
|
|
|
inode_inc_iversion_raw(struct inode *inode)
|
|
|
|
{
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
atomic64_inc(&inode->i_version);
|
2018-01-29 19:41:30 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* inode_peek_iversion - read i_version without flagging it to be incremented
|
|
|
|
* @inode: inode from which i_version should be read
|
|
|
|
*
|
|
|
|
* Read the inode i_version counter for an inode without registering it as a
|
|
|
|
* query.
|
|
|
|
*
|
|
|
|
* This is typically used by local filesystems that need to store an i_version
|
|
|
|
* on disk. In that situation, it's not necessary to flag it as having been
|
|
|
|
* viewed, as the result won't be used to gauge changes from that point.
|
|
|
|
*/
|
|
|
|
static inline u64
|
|
|
|
inode_peek_iversion(const struct inode *inode)
|
|
|
|
{
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
return inode_peek_iversion_raw(inode) >> I_VERSION_QUERIED_SHIFT;
|
2018-01-29 19:41:30 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* inode_query_iversion - read i_version for later use
|
|
|
|
* @inode: inode from which i_version should be read
|
|
|
|
*
|
|
|
|
* Read the inode i_version counter. This should be used by callers that wish
|
|
|
|
* to store the returned i_version for later comparison. This will guarantee
|
|
|
|
* that a later query of the i_version will result in a different value if
|
|
|
|
* anything has changed.
|
|
|
|
*
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
* In this implementation, we fetch the current value, set the QUERIED flag and
|
|
|
|
* then try to swap it into place with a cmpxchg, if it wasn't already set. If
|
|
|
|
* that fails, we try again with the newly fetched value from the cmpxchg.
|
2018-01-29 19:41:30 +08:00
|
|
|
*/
|
|
|
|
static inline u64
|
|
|
|
inode_query_iversion(struct inode *inode)
|
|
|
|
{
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
u64 cur, old, new;
|
|
|
|
|
|
|
|
cur = inode_peek_iversion_raw(inode);
|
|
|
|
for (;;) {
|
|
|
|
/* If flag is already set, then no need to swap */
|
|
|
|
if (cur & I_VERSION_QUERIED) {
|
|
|
|
/*
|
|
|
|
* This barrier (and the implicit barrier in the
|
|
|
|
* cmpxchg below) pairs with the barrier in
|
|
|
|
* inode_maybe_inc_iversion().
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
new = cur | I_VERSION_QUERIED;
|
|
|
|
old = atomic64_cmpxchg(&inode->i_version, cur, new);
|
|
|
|
if (likely(old == cur))
|
|
|
|
break;
|
|
|
|
cur = old;
|
|
|
|
}
|
|
|
|
return cur >> I_VERSION_QUERIED_SHIFT;
|
2018-01-29 19:41:30 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2018-02-01 21:15:25 +08:00
|
|
|
* inode_eq_iversion_raw - check whether the raw i_version counter has changed
|
2018-01-29 19:41:30 +08:00
|
|
|
* @inode: inode to check
|
|
|
|
* @old: old value to check against its i_version
|
|
|
|
*
|
2018-02-01 21:15:25 +08:00
|
|
|
* Compare the current raw i_version counter with a previous one. Returns true
|
|
|
|
* if they are the same or false if they are different.
|
2018-01-29 19:41:30 +08:00
|
|
|
*/
|
2018-01-31 04:32:21 +08:00
|
|
|
static inline bool
|
2018-02-01 21:15:25 +08:00
|
|
|
inode_eq_iversion_raw(const struct inode *inode, u64 old)
|
2018-01-29 19:41:30 +08:00
|
|
|
{
|
2018-02-01 21:15:25 +08:00
|
|
|
return inode_peek_iversion_raw(inode) == old;
|
2018-01-29 19:41:30 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2018-02-01 21:15:25 +08:00
|
|
|
* inode_eq_iversion - check whether the i_version counter has changed
|
2018-01-29 19:41:30 +08:00
|
|
|
* @inode: inode to check
|
|
|
|
* @old: old value to check against its i_version
|
|
|
|
*
|
2018-02-01 21:15:25 +08:00
|
|
|
* Compare an i_version counter with a previous one. Returns true if they are
|
|
|
|
* the same, and false if they are different.
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 20:45:44 +08:00
|
|
|
*
|
|
|
|
* Note that we don't need to set the QUERIED flag in this case, as the value
|
|
|
|
* in the inode is not being recorded for later use.
|
2018-01-29 19:41:30 +08:00
|
|
|
*/
|
2018-01-31 04:32:21 +08:00
|
|
|
static inline bool
|
2018-02-01 21:15:25 +08:00
|
|
|
inode_eq_iversion(const struct inode *inode, u64 old)
|
2018-01-29 19:41:30 +08:00
|
|
|
{
|
2018-02-01 21:15:25 +08:00
|
|
|
return inode_peek_iversion(inode) == old;
|
2018-01-29 19:41:30 +08:00
|
|
|
}
|
|
|
|
#endif
|