Remove BLKIO_POLICY_* enums and let blkio_policy_register() allocate
@pol->plid dynamically on registration. The maximum number of blkcg
policies which can be registered at the same time is defined by
BLKCG_MAX_POLS constant added to include/linux/blkdev.h.
Note that blkio_policy_register() now may fail. Policy init functions
updated accordingly and unnecessary ifdefs removed from cfq_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The two functions were taking "enum blkio_policy_id plid". Make them
take "const struct blkio_policy_type *pol" instead.
This is to prepare for per-queue policy activation and doesn't cause
any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With blkio_policy[], blkio_list is redundant and hinders with
per-queue policy activation. Remove it. Also, replace
blkio_list_lock with a mutex blkcg_pol_mutex and let it protect the
whole [un]registration.
This is to prepare for per-queue policy activation and doesn't cause
any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that all stat handling code lives in policy implementations,
there's no need to encode policy ID in cft->private.
* Export blkcg_prfill_[rw]stat() from blkcg, remove
blkcg_print_[rw]stat(), and implement cfqg_print_[rw]stat() which
use hard-code BLKIO_POLICY_PROP.
* Use cft->private for offset of the target field directly and drop
BLKCG_STAT_{PRIV|POL|OFF}().
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that all conf and stat fields are moved into policy specific
blkio_policy_data->pdata areas, there's no reason to use
blkio_policy_data itself in prfill functions. Pass around @pd->pdata
instead of @pd.
Signed-off-by: Tejun Heo <tj@kernel.org>
blkio_cgroup_conf->iops and ->bps are owned by blk-throttle and has no
reason to be defined in blkcg core. Drop them and let conf setting
functions directly manipulate throtl_grp->bps[] and ->iops[].
This makes blkio_group_conf empty. Drop it.
Signed-off-by: Tejun Heo <tj@kernel.org>
blkio_group_conf->weight is owned by cfq and has no reason to be
defined in blkcg core. Replace it with cfq_group->dev_weight and let
conf setting functions directly set it. If dev_weight is zero, the
cfqg doesn't have device specific weight configured.
Also, rename BLKIO_WEIGHT_* constants to CFQ_WEIGHT_* and rename
blkio_cgroup->weight to blkio_cgroup->cfq_weight. We eventually want
per-policy storage in blkio_cgroup but just mark the ownership of the
field for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
blkio_group_stats_cpu is used only by blk-throtl and has no reason to
be defined in blkcg core.
* Move blkio_group_stats_cpu to blk-throttle.c and rename it to
tg_stats_cpu.
* blkg_policy_data->stats_cpu is replaced with throtl_grp->stats_cpu.
prfill functions updated accordingly.
* All related macros / functions are renamed so that they have tg_
prefix and the unnecessary @pol arguments are dropped.
* Per-cpu stats allocation code is also moved from blk-cgroup.c to
blk-throttle.c and gets simplified to only deal with
BLKIO_POLICY_THROTL. percpu stat free is performed by the exit
method throtl_exit_blkio_group().
* throtl_reset_group_stats() implemented for
blkio_reset_group_stats_fn method so that tg->stats_cpu can be
reset.
Signed-off-by: Tejun Heo <tj@kernel.org>
blkio_group_stats contains only fields used by cfq and has no reason
to be defined in blkcg core.
* Move blkio_group_stats to cfq-iosched.c and rename it to cfqg_stats.
* blkg_policy_data->stats is replaced with cfq_group->stats.
blkg_prfill_[rw]stat() are updated to use offset against pd->pdata
instead.
* All related macros / functions are renamed so that they have cfqg_
prefix and the unnecessary @pol arguments are dropped.
* All stat functions now take cfq_group * instead of blkio_group *.
* lockdep assertion on queue lock dropped. Elevator runs under queue
lock by default. There isn't much to be gained by adding lockdep
assertions at stat function level.
* cfqg_stats_reset() implemented for blkio_reset_group_stats_fn method
so that cfqg->stats can be reset.
Signed-off-by: Tejun Heo <tj@kernel.org>
Add blkio_policy_ops->blkio_exit_group_fn() and
->blkio_reset_group_stats_fn(). These will be used to further
modularize blkcg policy implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
blkio_group_stats_cpu is used to count dispatch stats using per-cpu
counters. This is used by both blk-throtl and cfq-iosched but the
sharing is rather silly.
* cfq-iosched doesn't need per-cpu dispatch stats. cfq always updates
those stats while holding queue_lock.
* blk-throtl needs per-cpu dispatch stats but only service_bytes and
serviced. It doesn't make use of sectors.
This patch makes cfq add and use global stats for service_bytes,
serviced and sectors, removes per-cpu sectors counter and moves
per-cpu stat printing code to blk-throttle.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
As with conf/stats file handling code, there's no reason for stat
update code to live in blkcg core with policies calling into update
them. The current organization is both inflexible and complex.
This patch moves stat update code to specific policies. All
blkiocg_update_*_stats() functions which deal with BLKIO_POLICY_PROP
stats are collapsed into their cfq_blkiocg_update_*_stats()
counterparts. blkiocg_update_dispatch_stats() is used by both
policies and duplicated as throtl_update_dispatch_stats() and
cfq_blkiocg_update_dispatch_stats(). This will be cleaned up later.
Signed-off-by: Tejun Heo <tj@kernel.org>
blkcg conf/stat handling is convoluted in that details which belong to
specific policy implementations are all out in blkcg core and then
policies hook into core layer to access and manipulate confs and
stats. This sadly achieves both inflexibility (confs/stats can't be
modified without messing with blkcg core) and complexity (all the
call-ins and call-backs).
The previous patches restructured conf and stat handling code such
that they can be separated out. This patch relocates the file
handling part. All conf/stat file handling code which belongs to
BLKIO_POLICY_PROP is moved to cfq-iosched.c and all
BKLIO_POLICY_THROTL code to blk-throtl.c.
The move is verbatim except for blkio_update_group_{weight|bps|iops}()
callbacks which relays conf changes to policies. The configuration
settings are handled in policies themselves so the relaying isn't
necessary. Conf setting functions are modified to directly call
per-policy update functions and the relaying mechanism is dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Add blkiop->cftypes which is added and removed together with the
policy. This will be used to move conf/stat handling to the policies.
Signed-off-by: Tejun Heo <tj@kernel.org>
conf/stat handling is about to be moved to policy implementation from
blkcg core. Export conf/stat helpers from blkcg core so that
blk-throttle and cfq-iosched can use them.
Signed-off-by: Tejun Heo <tj@kernel.org>
As part of userland interface restructuring, this patch updates
per-blkio_group configuration setting. Instead of funneling
everything through a master function which has hard-coded cases for
each config file it may handle, the common part is factored into
blkg_conf_prep() and blkg_conf_finish() and different configuration
setters are implemented using the helpers.
While this doesn't result in immediate LOC reduction, this enables
further cleanups and more modular implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Similarly to the previous stat restructuring, this patch restructures
conf printing code such that,
* Conf printing uses the same helpers as stat.
* Printing function doesn't require hardcoded switching on the config
being printed. Note that this isn't complete yet for throttle
confs. The next patch will convert setting for these confs and will
complete the transition.
* Printing uses read_seq_string callback (other methods will be phased
out).
Note that blkio_group_conf.iops[2] is changed to u64 so that they can
be manipulated with the same functions. This is transitional and will
go away later.
After this patch, per-device configurations - weight, bps and iops -
use __blkg_prfill_u64() for printing which uses white space as
delimiter instead of tab.
Signed-off-by: Tejun Heo <tj@kernel.org>
blkcg stats handling is a mess. None of the stats has much to do with
blkcg core but they are all implemented in blkcg core. Code sharing
is achieved by mixing common code with hard-coded cases for each stat
counter.
This patch restructures statistics printing such that
* Common logic exists as helper functions and specific print functions
use the helpers to implement specific cases.
* Printing functions serving multiple counters don't require hardcoded
switching on specific counters.
* Printing uses read_seq_string callback (other methods will be phased
out).
This change enables further cleanups and relocating stats code to the
policy implementation it belongs to.
Signed-off-by: Tejun Heo <tj@kernel.org>
blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values
on 32bit archs and some stat counters have subtypes to distinguish
read/writes and sync/async IOs. The stat code paths are confusing and
involve a lot of going back and forth between blkcg core and specific
policy implementations, and synchronization and subtype handling are
open coded in blkcg core.
This patch introduces struct blkg_stat and blkg_rwstat which, with
accompanying operations, encapsulate stat updating and accessing with
proper synchronization.
blkg_stat is simple u64 counter with 64bit read-access protection.
blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw
flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and
replaces stat_sub_type indexed arrays.
All counters in blkio_group_stats and blkio_group_stats_cpu are
replaced with either blkg_stat or blkg_rwstat along with all users.
This does add one u64_stats_sync per counter and increase stats_sync
operations but they're empty/noops on 64bit archs and blkcg doesn't
have too many counters, especially with DEBUG_BLK_CGROUP off.
While the currently resulting code isn't necessarily simpler at the
moment, this will enable further clean up of blkcg stats code.
- BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to
BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}.
- blkg_stat_add() replaces blkio_add_stat() and
blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the
latter function no longer exists. It's *way* better to have
underflowed stat counters than oopsing.
- blkio_group_stats->dequeue is now a proper u64 stat counter instead
of ulong.
- reset_stats() updated to clear each stat counters individually and
BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed.
- Some functions reconstruct rw flags from direction and sync
booleans. This will be removed by future patches.
Signed-off-by: Tejun Heo <tj@kernel.org>
BLKIO_STAT_CPU_SECTORS doesn't need read/write/sync/async subcounters
and is counted by blkio_group_stats_cpu->sectors; however, it still
holds a member in blkio_group_stats_cpu->stat_arr_cpu.
Rearrange stat_type_cpu and define BLKIO_STAT_CPU_ARR_NR and use it
for stat_arr_cpu[] size so that only SERVICE_BYTES and SERVICED have
subcounters.
Signed-off-by: Tejun Heo <tj@kernel.org>
Add 64bit unique id to blkcg. This will be used by policies which
want blkcg identity test to tell whether the associated blkcg has
changed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With recent plug merge updates, all non-percpu stat updates happen
under queue_lock making stats_lock unnecessary to synchronize stat
updates. The only synchronization necessary is stat reading, which
can be done using u64_stats_sync instead.
This patch removes blkio_group->stats_lock and adds
blkio_group_stats->syncp for reader synchronization.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Restructure blkio_get_stat() to prepare for removal of stats_lock.
* Define BLKIO_STAT_ARR_NR explicitly to denote which stats have
subtypes instead of using BLKIO_STAT_QUEUED.
* Separate out stat acquisition and printing. After this, there are
only two users of blkio_fill_stat(). Just open code it.
* The code was mixing MAX_KEY_LEN and MAX_KEY_LEN - 1. There's no
need to subtract one. Use MAX_KEY_LEN consistently.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkiocg_reset_stats() implements stat reset for blkio.reset_stats
cgroupfs file. This feature is very unconventional and something
which shouldn't have been merged. It's only useful when there's only
one user or tool looking at the stats. As soon as multiple users
and/or tools are involved, it becomes useless as resetting disrupts
other usages. There are very good reasons why all other stats expect
readers to read values at the start and end of a period and subtract
to determine delta over the period.
The implementation is rather complex - some fields shouldn't be
cleared and it saves some fields, resets whole and restores for some
reason. Reset of percpu stats is also racy. The comment points to
64bit store atomicity for the reason but even without that stores for
zero can simply race with other CPUs doing RMW and get clobbered.
Simplify reset by
* Clear selectively instead of resetting and restoring.
* Grouping debug stat fields to be reset and using memset() over them.
* Not caring about stats_lock.
* Using memset() to reset percpu stats.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With recent plug merge updates, merged stats are no longer called for
plug merges and now only updated while holding queue_lock. As
stats_lock is scheduled to be removed, there's no reason to use percpu
for merged stats. Don't use percpu for merged stats.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in
IO path there are times when we want GFP_NOIO semantics. As there is no
way to pass the allocation flags to alloc_percpu(), this patch delays the
allocation of stats using a worker thread.
v2-> tejun suggested following changes. Changed the patch accordingly.
- move alloc_node location in structure
- reduce the size of names of some of the fields
- Reduce the scope of locking of alloc_list_lock
- Simplified stat_alloc_fn() by allocating stats for all
policies in one go and then assigning these to a group.
v3 -> Andrew suggested to put some comments in the code. Also raised
concerns about trying to allocate infinitely in case of allocation
failure. I have changed the logic to sleep for 10ms before retrying.
That should take care of non-preemptible UP kernels.
v4 -> Tejun had more suggestions.
- drop list_for_each_entry_all()
- instead of msleep() use queue_delayed_work()
- Some cleanups realted to more compact coding.
v5-> tejun suggested more cleanups leading to more compact code.
tj: - Relocated pcpu_stats into blkio_stat_alloc_fn().
- Minor comment update.
- This also fixes suspicious RCU usage warning caused by invoking
cgroup_path() from blkg_alloc() without holding RCU read lock.
Now that blkg_alloc() doesn't require sleepable context, RCU
read lock from blkg_lookup_create() is maintained throughout
blkg_alloc().
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Implement bio_blkio_cgroup() which returns the blkcg associated with
the bio if exists or %current's blkcg, and use it in blk-throttle and
cfq-iosched propio. This makes both cgroup policies honor task
association for the bio instead of always assuming %current.
As nobody is using bio_set_task() yet, this doesn't introduce any
behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that blkg additions / removals are always done under both q and
blkcg locks, the only places RCU locking is necessary are
blkg_lookup[_create]() for lookup w/o blkcg lock. This patch drops
unncessary RCU locking replacing it with plain blkcg locking as
necessary.
* blkiocg_pre_destroy() already perform proper locking and don't need
RCU. Dropped.
* blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read
lock. This isn't a hot path.
* Now unnecessary synchronize_rcu() from queue exit paths removed.
This makes q->nr_blkgs unnecessary. Dropped.
* RCU annotation on blkg->q removed.
-v2: Vivek pointed out that blkg_lookup_create() still needs to be
called under rcu_read_lock(). Updated.
-v3: After the update, stats_lock locking in blkio_read_blkg_stats()
shouldn't be using _irq variant as it otherwise ends up enabling
irq while blkcg->lock is locked. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkgs are chained from both blkcgs and request_queues and thus
subjected to two locks - blkcg->lock and q->queue_lock. As both blkcg
and q can go away anytime, locking during removal is tricky. It's
currently solved by wrapping removal inside RCU, which makes the
synchronization complex. There are three locks to worry about - the
outer RCU, q lock and blkcg lock, and it leads to nasty subtle
complications like conditional synchronize_rcu() on queue exit paths.
For all other paths, blkcg lock is naturally nested inside q lock and
the only exception is blkcg removal path, which is a very cold path
and can be implemented as clumsy but conceptually-simple reverse
double lock dancing.
This patch updates blkg removal path such that blkgs are removed while
holding both q and blkcg locks, which is trivial for request queue
exit path - blkg_destroy_all(). The blkcg removal path,
blkiocg_pre_destroy(), implements reverse double lock dancing
essentially identical to ioc_release_fn().
This simplifies blkg locking - no half-dead blkgs to worry about. Now
unnecessary RCU annotations will be removed by the next patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blkg is per cgroup-queue-policy combination. This is
unnatural and leads to various convolutions in partially used
duplicate fields in blkg, config / stat access, and general management
of blkgs.
This patch make blkg's per cgroup-queue and let them serve all
policies. blkgs are now created and destroyed by blkcg core proper.
This will allow further consolidation of common management logic into
blkcg core and API with better defined semantics and layering.
As a transitional step to untangle blkg management, elvswitch and
policy [de]registration, all blkgs except the root blkg are being shot
down during elvswitch and bypass. This patch adds blkg_root_update()
to update root blkg in place on policy change. This is hacky and racy
but should be good enough as interim step until we get locking
simplified and switch over to proper in-place update for all blkgs.
-v2: Root blkgs need to be updated on elvswitch too and blkg_alloc()
comment wasn't updated according to the function change. Fixed.
Both pointed out by Vivek.
-v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for
all policies. This freed root pd during elvswitch before the
last queue finished exiting and led to oops. Directly invoke
update_root_blkg_pd() only on BLKIO_POLICY_PROP from
cfq_exit_queue(). This also is closer to what will be done with
proper in-place blkg update. Reported by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the previous patch to move blkg list heads and counters to
request_queue and blkg, logic to manage them in both policies are
almost identical and can be moved to blkcg core.
This patch moves blkg link logic into blkg_lookup_create(), implements
common blkg unlink code in blkg_destroy(), and updates
blkg_destory_all() so that it's policy specific and can skip root
group. The updated blkg_destroy_all() is now used to both clear queue
for bypassing and elv switching, and release all blkgs on q exit.
This patch introduces a race window where policy [de]registration may
race against queue blkg clearing. This can only be a problem on cfq
unload and shouldn't be a real problem in practice (and we have many
other places where this race already exists). Future patches will
remove these unlikely races.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, specific policy implementations are responsible for
maintaining list and number of blkgs. This duplicates code
unnecessarily, and hinders factoring common code and providing blkcg
API with better defined semantics.
After this patch, request_queue hosts list heads and counters and blkg
has list nodes for both policies. This patch only relocates the
necessary fields and the next patch will actually move management code
into blkcg core.
Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to
have 2 elements. This is to avoid include dependency and will be
removed by the next patch.
This patch doesn't introduce any behavior change.
-v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed
as pointed out by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkg is scheduled to be unified for all policies and thus there won't
be one-to-one mapping from blkg to policy. Update stat related
functions to take explicit @pol or @plid arguments and not use
blkg->plid.
This is painful for now but most of specific stat interface functions
will be replaced with a handful of generic helpers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To prepare for unifying blkgs for different policies, make blkg->pd an
array with BLKIO_NR_POLICIES elements and move blkg->conf, ->stats,
and ->stats_cpu into blkg_policy_data.
This patch doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blkcg policy implementations manage blkg refcnt duplicating
mostly identical code in both policies. This patch moves refcnt to
blkg and let blkcg core handle refcnt and freeing of blkgs.
* cfq blkgs now also get freed via RCU.
* cfq blkgs lose RB_EMPTY_ROOT() sanity check on blkg free. If
necessary, we can add blkio_exit_group_fn() to resurrect this.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blkg's are embedded in private data blkcg policy private
data structure and thus allocated and freed by policies. This leads
to duplicate codes in policies, hinders implementing common part in
blkcg core with strong semantics, and forces duplicate blkg's for the
same cgroup-q association.
This patch introduces struct blkg_policy_data which is a separate data
structure chained from blkg. Policies specifies the amount of private
data it needs in its blkio_policy_type->pdata_size and blkcg core
takes care of allocating them along with blkg which can be accessed
using blkg_to_pdata(). blkg can be determined from pdata using
pdata_to_blkg(). blkio_alloc_group_fn() method is accordingly updated
to blkio_init_group_fn().
For consistency, tg_of_blkg() and cfqg_of_blkg() are replaced with
blkg_to_tg() and blkg_to_cfqg() respectively, and functions to map in
the reverse direction are added.
Except that policy specific data now lives in a separate data
structure from blkg, this patch doesn't introduce any functional
difference.
This will be used to unify blkg's for different policies.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently block core calls directly into blk-throttle for init, drain
and exit. This patch adds blkcg_{init|drain|exit}_queue() which wraps
the blk-throttle functions. This is to give more control and
visiblity to blkcg core layer for proper layering. Further patches
will add logic common to blkcg policies to the functions.
While at it, collapse blk_throtl_release() into blk_throtl_exit().
There's no reason to keep them separate.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blkg points to the associated blkcg via its css_id. This
unnecessarily complicates dereferencing blkcg. Let blkg hold a
reference to the associated blkcg and point directly to it and disable
css_id on blkio_subsys.
This change requires splitting blkiocg_destroy() into
blkiocg_pre_destroy() and blkiocg_destroy() so that all blkg's can be
destroyed and all the blkcg references held by them dropped during
cgroup removal.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkg->dev is dev_t recording the device number of the block device for
the associated request_queue. It is used to identify the associated
block device when printing out configuration or stats.
This is redundant to begin with. A blkg is an association between a
cgroup and a request_queue and it of course is possible to reach
request_queue from blkg and synchronization conventions are in place
for safe q dereferencing, so this shouldn't be necessary from the
beginning. Furthermore, it's initialized by sscanf()ing the device
name of backing_dev_info. The mind boggles.
Anyways, if blkg is visible under rcu lock, we *know* that the
associated request_queue hasn't gone away yet and its bdi is
registered and alive - blkg can't be created for request_queue which
hasn't been fully initialized and it can't go away before blkg is
removed.
Let stat and conf read functions get device name from
blkg->q->backing_dev_info.dev and pass it down to printing functions
and remove blkg->dev.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that blkcg configuration lives in blkg's, blkio_policy_node is no
longer necessary. Kill it.
blkio_policy_parse_and_set() now fails if invoked for missing device
and functions to print out configurations are updated to print from
blkg's.
cftype_blkg_same_policy() is dropped along with other policy functions
for consistency. Its one line is open coded in the only user -
blkio_read_blkg_stats().
-v2: Update to reflect the retry-on-bypass logic change of the
previous patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkcg is very peculiar in that it allows setting and remembering
configurations for non-existent devices by maintaining separate data
structures for configuration.
This behavior is completely out of the usual norms and outright
confusing; furthermore, it uses dev_t number to match the
configuration to devices, which is unpredictable to begin with and
becomes completely unuseable if EXT_DEVT is fully used.
It is wholely unnecessary - we already have fully functional userland
mechanism to program devices being hotplugged which has full access to
device identification, connection topology and filesystem information.
Add a new struct blkio_group_conf which contains all blkcg
configurations to blkio_group and let blkio_group, which can be
created iff the associated device exists and is removed when the
associated device goes away, carry all configurations.
Note that, after this patch, all newly created blkg's will always have
the default configuration (unlimited for throttling and blkcg's weight
for propio).
This patch makes blkio_policy_node meaningless but doesn't remove it.
The next patch will.
-v2: Updated to retry after short sleep if blkg lookup/creation failed
due to the queue being temporarily bypassed as indicated by
-EBUSY return. Pointed out by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently both blk-throttle and cfq-iosched implement their own
blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This
patch factors out the common code into blkg_lookup_create(), which
returns ERR_PTR value so that transitional failures due to queue
bypass can be distinguished from other failures.
* New plkio_policy_ops methods blkio_alloc_group_fn() and
blkio_link_group_fn added. Both are transitional and will be
removed once the blkg management code is fully moved into
blk-cgroup.c.
* blkio_alloc_group_fn() allocates policy-specific blkg which is
usually a larger data structure with blkg as the first entry and
intiailizes it. Note that initialization of blkg proper, including
percpu stats, is responsibility of blk-cgroup proper.
Note that default config (weight, bps...) initialization is done
from this method; otherwise, we end up violating locking order
between blkcg and q locks via blkcg_get_CONF() functions.
* blkio_link_group_fn() is called under queue_lock and responsible for
linking the blkg to the queue. blkcg side is handled by blk-cgroup
proper.
* The common blkg creation function is named blkg_lookup_create() and
blkiocg_lookup_group() is renamed to blkg_lookup() for consistency.
Also, throtl / cfq related functions are similarly [re]named for
consistency.
This simplifies blkcg policy implementations and enables further
cleanup.
-v2: Vivek noticed that blkg_lookup_create() incorrectly tested
blk_queue_dead() instead of blk_queue_bypass() leading a user of
the function ending up creating a new blkg on bypassing queue.
This is a bug introduced while relocating bypass patches before
this one. Fixed.
-v3: ERR_PTR patch folded into this one. @for_root added to
blkg_lookup_create() to allow creating root group on a bypassed
queue during elevator switch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Block cgroup policies are maintained in a linked list and,
theoretically, multiple policies sharing the same policy ID are
allowed.
This patch temporarily restricts one policy per plid and adds
blkio_policy[] array which indexes registered policy types by plid.
Both the restriction and blkio_policy[] array are transitional and
will be removed once API cleanup is complete.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkgio_group is association between a block cgroup and a queue for a
given policy. Using opaque void * for association makes things
confusing and hinders factoring of common code. Use request_queue *
and, if necessary, policy id instead.
This will help block cgroup API cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Elevator switch may involve changes to blkcg policies. Implement
shoot down of blkio_groups.
Combined with the previous bypass updates, the end goal is updating
blkcg core such that it can ensure that blkcg's being affected become
quiescent and don't have any per-blkg data hanging around before
commencing any policy updates. Until queues are made aware of the
policies that applies to them, as an interim step, all per-policy blkg
data will be shot down.
* blk-throtl doesn't need this change as it can't be disabled for a
live queue; however, update it anyway as the scheduled blkg
unification requires this behavior change. This means that
blk-throtl configuration will be unnecessarily lost over elevator
switch. This oddity will be removed after blkcg learns to associate
individual policies with request_queues.
* blk-throtl dosen't shoot down root_tg. This is to ease transition.
Unified blkg will always have persistent root group and not shooting
down root_tg for now eases transition to that point by avoiding
having to update td->root_tg and is safe as blk-throtl can never be
disabled
-v2: Vivek pointed out that group list is not guaranteed to be empty
on return from clear function if it raced cgroup removal and
lost. Fix it by waiting a bit and retrying. This kludge will
soon be removed once locking is updated such that blkg is never
in limbo state between blkcg and request_queue locks.
blk-throtl no longer shoots down root_tg to avoid breaking
td->root_tg.
Also, Nest queue_lock inside blkio_list_lock not the other way
around to avoid introduce possible deadlock via blkcg lock.
-v3: blkcg_clear_queue() repositioned and renamed to
blkg_destroy_all() to increase consistency with later changes.
cfq_clear_queue() updated to check q->elevator before
dereferencing it to avoid NULL dereference on not fully
initialized queues (used by later change).
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Block cgroup core can be built as module; however, it isn't too useful
as blk-throttle can only be built-in and cfq-iosched is usually the
default built-in scheduler. Scheduled blkcg cleanup requires calling
into blkcg from block core. To simplify that, disallow building blkcg
as module by making CONFIG_BLK_CGROUP bool.
If building blkcg core as module really matters, which I doubt, we can
revisit it after blkcg API cleanup.
-v2: Vivek pointed out that IOSCHED_CFQ was incorrectly updated to
depend on BLK_CGROUP. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make BLKIO_STAT_MERGED per cpu hence gettring rid of need of taking
blkg->stats_lock.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Some of the stats are 64bit and updation will be non atomic on 32bit
architecture. Use sequence counters on 32bit arch to make reading
of stats safe.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently we take blkg_stat lock for even updating the stats. So even if
a group has no throttling rules (common case for root group), we end
up taking blkg_lock, for updating the stats.
Make dispatch stats per cpu so that these can be updated without taking
blkg lock.
If cpu goes offline, these stats simply disappear. No protection has
been provided for that yet. Do we really need anything for that?
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
cgroup unaccounted_time file is created only if CONFIG_DEBUG_BLK_CGROUP=y.
there are some fields which are out side this config option. Fix that.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currentlly we first map the task to cgroup and then cgroup to
blkio_cgroup. There is a more direct way to get to blkio_cgroup
from task using task_subsys_state(). Use that.
The real reason for the fix is that it also avoids a race in generic
cgroup code. During remount/umount rebind_subsystems() is called and
it can do following with and rcu protection.
cgrp->subsys[i] = NULL;
That means if somebody got hold of cgroup under rcu and then it tried
to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
is wrong. I was running into this race condition with ltp running on a
upstream derived kernel and that lead to crash.
So ideally we should also fix cgroup generic code to wait for rcu
grace period before setting pointer to NULL. Li Zefan is not very keen
on introducing synchronize_wait() as he thinks it will slow
down moun/remount/umount operations.
So for the time being atleast fix the kernel crash by taking a more
direct route to blkio_cgroup.
One tester had reported a crash while running LTP on a derived kernel
and with this fix crash is no more seen while the test has been
running for over 6 days.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There are two kind of times that tasks are not charged for: the first
seek and the extra time slice used over the allocated timeslice. Both
of these exported as a new unaccounted_time stat.
I think it would be good to have this reported in 'time' as well, but
that is probably a separate discussion.
Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We've found that we still get good, useful isolation at weights this
low. I'd like to adjust the minimum so that any other changes can take
these values into account.
Signed-off-by: Justin TerAvest <teravest@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
- Limit max iops value to UINT_MAX and return error to user if value is more
than that instead of accepting bigger values and truncating implicitly.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Currently any cgroup throttle limit changes are processed asynchronousy and
the change does not take affect till a new bio is dispatched from same group.
o It might happen that a user sets a redicuously low limit on throttling.
Say 1 bytes per second on reads. In such cases simple operations like mount
a disk can wait for a very long time.
o Once bio is throttled, there is no easy way to come out of that wait even if
user increases the read limit later.
o This patch fixes it. Now if a user changes the cgroup limits, we recalculate
the bio dispatch time according to new limits.
o Can't take queueu lock under blkcg_lock, hence after the change I wake
up the dispatch thread again which recalculates the time. So there are some
variables being synchronized across two threads without lock and I had to
make use of barriers. Hoping I have used barriers correctly. Any review of
memory barrier code especially will help.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o cgroup chagnes for throttle policy.
o Introduces READ and WRITE bytes per second throttling rules.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o This patch prepares the base for introducing new IO control policies.
Currently all the code is written knowing there is only one policy
and that is proportional bandwidth. Creating infrastructure for newer
policies to come in.
o Also there were many functions which were generated using macro. It was
very confusing. Got rid of those.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch fixes few usability and configurability issues.
o All the cgroup based controller options are configurable from
"Genral Setup/Control Group Support/" menu. blkio is the only exception.
Hence make this option visible in above menu and make it configurable from
there to bring it inline with rest of the cgroup based controllers.
o Get rid of CONFIG_DEBUG_CFQ_IOSCHED.
This option currently does two things.
- Enable printing of cgroup paths in blktrace
- Enables CONFIG_DEBUG_BLK_CGROUP, which in turn displays additional stat
files in cgroup.
If we are using group scheduling, blktrace data is of not really much use
if cgroup information is not present. To get this data, currently one has to
also enable CONFIG_DEBUG_CFQ_IOSCHED, which in turn brings the overhead of
all the additional debug stat files which is not desired.
Hence, this patch moves printing of cgroup paths under
CONFIG_CFQ_GROUP_IOSCHED.
This allows us to get rid of CONFIG_DEBUG_CFQ_IOSCHED completely. Now all
the debug stat files are controlled only by CONFIG_DEBUG_BLK_CGROUP which
can be enabled through config menu.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Divyesh Shah <dpshah@google.com>
Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Changelog from v1:
o Call blkiocg_update_idle_time_stats() at cfq_rq_enqueued() instead of at
dispatch time.
Changelog from original patchset: (in response to Vivek Goyal's comments)
o group blkiocg_update_blkio_group_dequeue_stats() with other DEBUG functions
o rename blkiocg_update_set_active_queue_stats() to
blkiocg_update_avg_queue_size_stats()
o s/request/io/ in blkiocg_update_request_add_stats() and
blkiocg_update_request_remove_stats()
o Call cfq_del_timer() at request dispatch() instead of
blkiocg_update_idle_time_stats()
Signed-off-by: Divyesh Shah<dpshah@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently, IO Controller makes use of blkio.weight to assign weight for
all devices. Here a new user interface "blkio.weight_device" is introduced to
assign different weights for different devices. blkio.weight becomes the
default value for devices which are not configured by "blkio.weight_device"
You can use the following format to assigned specific weight for a given
device:
#echo "major:minor weight" > blkio.weight_device
major:minor represents device number.
And you can remove weight for a given device as following:
#echo "major:minor 0" > blkio.weight_device
V1->V2 changes:
- use user interface "weight_device" instead of "policy" suggested by Vivek
- rename some struct suggested by Vivek
- rebase to 2.6-block "for-linus" branch
- remove an useless list_empty check pointed out by Li Zefan
- some trivial typo fix
V2->V3 changes:
- Move policy_*_node() functions up to get rid of forward declarations
- rename related functions by adding prefix "blkio_"
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
1) group_wait_time - This is the amount of time the cgroup had to wait to get a
timeslice for one of its queues from when it became busy, i.e., went from 0
to 1 request queued. This is different from the io_wait_time which is the
cumulative total of the amount of time spent by each IO in that cgroup waiting
in the scheduler queue. This stat is a great way to find out any jobs in the
fleet that are being starved or waiting for longer than what is expected (due
to an IO controller bug or any other issue).
2) empty_time - This is the amount of time a cgroup spends w/o any pending
requests. This stat is useful when a job does not seem to be able to use its
assigned disk share by helping check if that is happening due to an IO
controller bug or because the job is not submitting enough IOs.
3) idle_time - This is the amount of time spent by the IO scheduler idling
for a given cgroup in anticipation of a better request than the exising ones
from other queues/cgroups.
All these stats are recorded using start and stop events. When reading these
stats, we do not add the delta between the current time and the last start time
if we're between the start and stop events. We avoid doing this to make sure
that these numbers are always monotonically increasing when read. Since we're
using sched_clock() which may use the tsc as its source, it may induce some
inconsistency (due to tsc resync across cpus) if we included the current delta.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
These stats are useful for getting a feel for the queue depth of the cgroup,
i.e., how filled up its queues are at a given instant and over the existence of
the cgroup. This ability is useful when debugging problems in the wild as it
helps understand the application's IO pattern w/o having to read through the
userspace code (coz its tedious or just not available) or w/o the ability
to run blktrace (since you may not have root access and/or not want to disturb
performance).
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This includes both the number of bios merged into requests belonging to this
cgroup as well as the number of requests merged together.
In the past, we've observed different merging behavior across upstream kernels,
some by design some actual bugs. This stat helps a lot in debugging such
problems when applications report decreased throughput with a new kernel
version.
This needed adding an extra elevator function to capture bios being merged as I
did not want to pollute elevator code with blkiocg knowledge and hence needed
the accounting invocation to come from CFQ.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
that include some minor fixes and addresses all comments.
Changelog: (most based on Vivek Goyal's comments)
o renamed blkiocg_reset_write to blkiocg_reset_stats
o more clarification in the documentation on io_service_time and io_wait_time
o Initialize blkg->stats_lock
o rename io_add_stat to blkio_add_stat and declare it static
o use bool for direction and sync
o derive direction and sync info from existing rq methods
o use 12 for major:minor string length
o define io_service_time better to cover the NCQ case
o add a separate reset_stats interface
o make the indexed stats a 2d array to simplify macro and function pointer code
o blkio.time now exports in jiffies as before
o Added stats description in patch description and
Documentation/cgroup/blkio-controller.txt
o Prefix all stats functions with blkio and make them static as applicable
o replace IO_TYPE_MAX with IO_TYPE_TOTAL
o Moved #define constant to top of blk-cgroup.c
o Pass dev_t around instead of char *
o Add note to documentation file about resetting stats
o use BLK_CGROUP_MODULE in addition to BLK_CGROUP config option in #ifdef
statements
o Avoid struct request specific knowledge in blk-cgroup. blk-cgroup.h now has
rq_direction() and rq_sync() functions which are used by CFQ and when using
io-controller at a higher level, bio_* functions can be added.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We also add start_time_ns and io_start_time_ns fields to struct request
here to record the time when a request is created and when it is
dispatched to device. We use ns uints here as ms and jiffies are
not very useful for non-rotational media.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
- io_service_time
- io_wait_time
- io_serviced
- io_service_bytes
These stats are accumulated per operation type helping us to distinguish between
read and write, and sync and async IO. This patch does not increment any of
these stats.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
that info at request dispatch with other stats now. This patch removes the
existing support for accounting sectors for a blkio_group. This will be added
back differently in the next two patches.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Modify the Block I/O cgroup subsystem to be able to be built as a module.
As the CFQ disk scheduler optionally depends on blk-cgroup, config options
in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are
enhanced to support the new module dependency.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's no need to take css reference here, for the caller
has already called rcu_read_lock() to prevent cgroup from
being removed.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o One of the goals of block IO controller is that it should be able to
support mulitple io control policies, some of which be operational at
higher level in storage hierarchy.
o To begin with, we had one io controlling policy implemented by CFQ, and
I hard coded the CFQ functions called by blkio. This created issues when
CFQ is compiled as module.
o This patch implements a basic dynamic io controlling policy registration
functionality in blkio. This is similar to elevator functionality where
ioschedulers register the functions dynamically.
o Now in future, when more IO controlling policies are implemented, these
can dynakically register with block IO controller.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o blkio controller is inside the kernel and cfq makes use of interfaces
exported by blkio. CFQ can be a module too, hence export symbols used
by CFQ.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Export disk time and sector used by a group to user space through cgroup
interface.
o Also export a "dequeue" interface to cgroup which keeps track of how many
a times a group was deleted from service tree. Helps in debugging.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o One can choose to change elevator or delete a cgroup. Implement group
reference counting so that both elevator exit and cgroup deletion can
take place gracefully.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o This is basic implementation of blkio controller cgroup interface. This is
the common interface visible to user space and should be used by different
IO control policies as we implement those.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>