Commit Graph

2245 Commits

Author SHA1 Message Date
Tejun Heo b276a876a0 blkcg: add blkg_policy_data->plid
Add pd->plid so that the policy a pd belongs to can be identified
easily.  This will be used to implement hierarchical blkg_[rw]stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09 08:05:12 -08:00
Tejun Heo d02f7aa8dc cfq-iosched: enable full blkcg hierarchy support
With the previous two patches, all cfqg scheduling decisions are based
on vfraction and ready for hierarchy support.  The only thing which
keeps the behavior flat is cfqg_flat_parent() which makes vfraction
calculation consider all non-root cfqgs children of the root cfqg.

Replace it with cfqg_parent() which returns the real parent.  This
enables full blkcg hierarchy support for cfq-iosched.  For example,
consider the following hierarchy.

        root
      /      \
   A:500      B:250
  /     \
 AA:500  AB:1000

For simplicity, let's say all the leaf nodes have active tasks and are
on service tree.  For each leaf node, vfraction would be

 AA: (500  / 1500) * (500 / 750) =~ 0.2222
 AB: (1000 / 1500) * (500 / 750) =~ 0.4444
  B:                 (250 / 750) =~ 0.3333

and vdisktime will be distributed accordingly.  For more detail,
please refer to Documentation/block/cfq-iosched.txt.

v2: cfq-iosched.txt updated to describe group scheduling as suggested
    by Vivek.

v3: blkio-controller.txt updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09 08:05:11 -08:00
Tejun Heo 41cad6ab2c cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction
cfq_group_slice() calculates slice by taking a fraction of
cfq_target_latency according to the ratio of cfqg->weight against
service_tree->total_weight.  This currently works only because all
cfqgs are treated to be at the same level.

To prepare for proper hierarchy support, convert cfq_group_slice() to
base the calculation on cfqg->vfraction.  As cfqg->vfraction is always
a fraction of 1 and represents the fraction allocated to the cfqg with
hierarchy considered, the slice can be simply calculated by
multiplying cfqg->vfraction to cfq_target_latency (with fixed point
shift factored in).

As vfraction calculation currently treats all non-root cfqgs as
children of the root cfqg, this patch doesn't introduce noticeable
behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09 08:05:11 -08:00
Tejun Heo 1d3650f713 cfq-iosched: implement hierarchy-ready cfq_group charge scaling
Currently, cfqg charges are scaled directly according to cfqg->weight.
Regardless of the number of active cfqgs or the amount of active
weights, a given weight value always scales charge the same way.  This
works fine as long as all cfqgs are treated equally regardless of
their positions in the hierarchy, which is what cfq currently
implements.  It can't work in hierarchical settings because the
interpretation of a given weight value depends on where the weight is
located in the hierarchy.

This patch reimplements cfqg charge scaling so that it can be used to
support hierarchy properly.  The scheme is fairly simple and
light-weight.

* When a cfqg is added to the service tree, v(disktime)weight is
  calculated.  It walks up the tree to root calculating the fraction
  it has in the hierarchy.  At each level, the fraction can be
  calculated as

    cfqg->weight / parent->level_weight

  By compounding these, the global fraction of vdisktime the cfqg has
  claim to - vfraction - can be determined.

* When the cfqg needs to be charged, the charge is scaled inversely
  proportionally to the vfraction.

The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
representation as before; however, the smallest scaling factor is now
1 (ie. 1 << CFQ_SERVICE_SHIFT).  This is different from before where 1
was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
scaling factor.

While this shifts the global scale of vdisktime a bit, it doesn't
change the relative relationships among cfqgs and the scheduling
result isn't different.

cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
new cfqg to the service tree.  The specific value of CFQ_IDLE_DELAY
didn't have any relevance to vdisktime before and is unlikely to cause
any visible behavior difference now especially as the scale shift
isn't that large.

As the new scheme now makes proper distinction between cfqg->weight
and ->leaf_weight, reverse the weight aliasing for root cfqgs.  For
root, both weights are now mapped to ->leaf_weight instead of the
other way around.

Because we're still using cfqg_flat_parent(), this patch shouldn't
change the scheduling behavior in any noticeable way.

v2: Beefed up comments on vfraction as requested by Vivek.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09 08:05:11 -08:00
Tejun Heo 7918ffb5b8 cfq-iosched: implement cfq_group->nr_active and ->children_weight
To prepare for blkcg hierarchy support, add cfqg->nr_active and
->children_weight.  cfqg->nr_active counts the number of active cfqgs
at the cfqg's level and ->children_weight is sum of weights of those
cfqgs.  The level covers itself (cfqg->leaf_weight) and immediate
children.

The two values are updated when a cfqg enters and leaves the group
service tree.  Unless the hierarchy is very deep, the added overhead
should be negligible.

Currently, the parent is determined using cfqg_flat_parent() which
makes the root cfqg the parent of all other cfqgs.  This is to make
the transition to hierarchy-aware scheduling gradual.  Scheduling
logic will be converted to use cfqg->children_weight without actually
changing the behavior.  When everything is ready,
blkcg_weight_parent() will be replaced with proper parent function.

This patch doesn't introduce any behavior chagne.

v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09 08:05:11 -08:00
Tejun Heo e71357e118 cfq-iosched: add leaf_weight
cfq blkcg is about to grow proper hierarchy handling, where a child
blkg's weight would nest inside the parent's.  This makes tasks in a
blkg to compete against both tasks in the sibling blkgs and the tasks
of child blkgs.

We're gonna use the existing weight as the group weight which decides
the blkg's weight against its siblings.  This patch introduces a new
weight - leaf_weight - which decides the weight of a blkg against the
child blkgs.

It's named leaf_weight because another way to look at it is that each
internal blkg nodes have a hidden child leaf node which contains all
its tasks and leaf_weight is the weight of the leaf node and handled
the same as the weight of the child blkgs.

This patch only adds leaf_weight fields and exposes it to userland.
The new weight isn't actually used anywhere yet.  Note that
cfq-iosched currently offcially supports only single level hierarchy
and root blkgs compete with the first level blkgs - ie. root weight is
basically being used as leaf_weight.  For root blkgs, the two weights
are kept in sync for backward compatibility.

v2: cfqd->root_group->leaf_weight initialization was missing from
    cfq_init_queue() causing divide by zero when
    !CONFIG_CFQ_GROUP_SCHED.  Fix it.  Reported by Fengguang.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
2013-01-09 08:05:10 -08:00
Tejun Heo 3c54786590 blkcg: make blkcg_gq's hierarchical
Currently a child blkg (blkcg_gq) can be created even if its parent
doesn't exist.  ie. Given a blkg, it's not guaranteed that its
ancestors will exist.  This makes it difficult to implement proper
hierarchy support for blkcg policies.

Always create blkgs recursively and make a child blkg hold a reference
to its parent.  blkg->parent is added so that finding the parent is
easy.  blkcg_parent() is also added in the process.

This change can be visible to userland.  e.g. while issuing IO in a
nested cgroup didn't affect the ancestors at all, now it will
initialize all ancestor blkgs and zero stats for the request_queue
will always appear on them.  While this is userland visible, this
shouldn't cause any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09 08:05:10 -08:00
Tejun Heo 93e6d5d8f5 blkcg: cosmetic updates to blkg_create()
* Rename out_* labels to err_*.

* Do ERR_PTR() conversion once in the error return path.

This patch is cosmetic and to prepare for the hierarchy support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09 08:05:10 -08:00
Tejun Heo 86cde6b623 blkcg: reorganize blkg_lookup_create() and friends
Reorganize such that

* __blkg_lookup() takes bool param @update_hint to determine whether
  to update hint.

* __blkg_lookup_create() no longer performs lookup before trying to
  create.  Renamed to blkg_create().

* blkg_lookup_create() now performs lookup and then invokes
  blkg_create() if lookup fails.

* root_blkg creation in blkcg_activate_policy() updated accordingly.
  Note that blkcg_activate_policy() no longer updates lookup hint if
  root_blkg already exists.

Except for the last lookup hint bit which is immaterial, this is pure
reorganization and doesn't introduce any visible behavior change.
This is to prepare for proper hierarchy support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09 08:05:10 -08:00
Tejun Heo 356d2e5810 blkcg: fix minor bug in blkg_alloc()
blkg_alloc() was mistakenly checking blkcg_policy_enabled() twice.
The latter test should have been on whether pol->pd_init_fn() exists.
This doesn't cause actual problems because both blkcg policies
implement pol->pd_init_fn().  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09 08:05:10 -08:00
Vivek Goyal b226e5c411 cfq-iosched: Print sync-noidle information in blktrace messages
Currently we attach a character "S" or "A" to the cfqq<pid>, to represent
whether queues is sync or async. Add one more character "N" to represent
whether it is sync-noidle queue or sync queue. So now three different
type of queues will look as follows.

cfq1234S   --> sync queus
cfq1234SN  --> sync noidle queue
cfq1234A   --> Async queue

Previously S/A classification was being printed only if group scheduling
was enabled. This patch also makes sure that this classification is
displayed even if group idling is disabled.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-01-09 08:05:09 -08:00
Vivek Goyal 1f23f12151 cfq-iosched: Get rid of unnecessary local variable
Use of local varibale "n" seems to be unnecessary. Remove it. This brings
it inline with function __cfq_group_st_add(), which is also doing the
similar operation of adding a group to a rb tree.

No functionality change here.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-01-09 08:05:09 -08:00
Vivek Goyal 6d816ec7c8 cfq-iosched: Rename few functions related to selecting workload
choose_service_tree() selects/sets both wl_class and wl_type.  Rename it to
choose_wl_class_and_type() to make it very clear.

cfq_choose_wl() only selects and sets wl_type. It is easy to confuse
it with choose_st(). So rename it to cfq_choose_wl_type() to make
it clear what does it do.

Just renaming. No functionality change.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-01-09 08:05:09 -08:00
Vivek Goyal 34b98d03bd cfq-iosched: Rename "service_tree" to "st" at some places
At quite a few places we use the keyword "service_tree". At some places,
especially local variables, I have abbreviated it to "st".

Also at couple of places moved binary operator "+" from beginning of line
to end of previous line, as per Tejun's feedback.

v2:
 Reverted most of the service tree name change based on Jeff Moyer's feedback.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-01-09 08:05:09 -08:00
Vivek Goyal 4d2ceea4cb cfq-iosched: More renaming to better represent wl_class and wl_type
Some more renaming. Again making the code uniform w.r.t use of
wl_class/class to represent IO class (RT, BE, IDLE) and using
wl_type/type to represent subclass (SYNC, SYNC-IDLE, ASYNC).

At places this patch shortens the string "workload" to "wl".
Renamed "saved_workload" to "saved_wl_type". Renamed
"saved_serving_class" to "saved_wl_class".

For uniformity with "saved_wl_*" variables, renamed "serving_class"
to "serving_wl_class" and renamed "serving_type" to "serving_wl_type".

Again, just trying to improve upon code uniformity and improve
readability. No functional change.

v2:
- Restored the usage of keyword "service" based on Jeff Moyer's feedback.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-01-09 08:05:09 -08:00
Vivek Goyal 3bf10fea3b cfq-iosched: Properly name all references to IO class
Currently CFQ has three IO classes, RT, BE and IDLE. At many a places we
are calling workloads belonging to these classes as "prio". This gets
very confusing as one starts to associate it with ioprio.

So this patch just does bunch of renaming so that reading code becomes
easier. All reference to RT, BE and IDLE workload are done using keyword
"class" and all references to subclass, SYNC, SYNC-IDLE, ASYNC are made
using keyword "type".

This makes me feel much better while I am reading the code. There is no
functionality change due to this patch.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-01-09 08:05:08 -08:00
Derek Basehore 12c2bdb232 block: prevent race/cleanup
Remove a race condition which causes a warning in disk_clear_events.  This
is a race between disk_clear_events() and disk_flush_events().
ev->clearing will be altered by disk_flush_events() even though we are
blocking event checking through disk_flush_events().  If this happens
after ev->clearing was cleared for disk_clear_events(), this can cause the
WARN_ON_ONCE() in that function to be triggered.

This change also has disk_clear_events() not go through a workqueue.
Since we have to wait for the work to complete, we should just call the
function directly.  Also, since this work cannot be put on a freezable
workqueue, it will have to contend with increased demand, so calling the
function directly avoids this.

[akpm@linux-foundation.org: fix spello in comment]
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-19 20:36:10 +01:00
Derek Basehore aea24a8bbc block: remove deadlock in disk_clear_events
In disk_clear_events, do not put work on system_nrt_freezable_wq.
Instead, put it on system_nrt_wq.

There is a race between probing a usb and suspending the device.  Since
probing a usb calls disk_clear_events, which puts work on a frozen
workqueue, probing cannot finish after the workqueue is frozen.  However,
suspending cannot finish until the usb probe is finished, so we get a
deadlock, causing the system to reboot.

The way to reproduce this bug is to wake up from suspend with a usb
storage device plugged in, or plugging in a usb storage device right
before suspend.  The window of time is on the order of time it takes to
probe the usb device.  As long as the workqueues are frozen before the
call to add_disk within sd_probe_async finishes, there will be a deadlock
(which calls blkdev_get, sd_open, check_disk_change, then
disk_clear_events).  This is not difficult to reproduce after figuring out
the timings.

[akpm@linux-foundation.org: fix up comment]
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Reviewed-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-19 20:36:05 +01:00
Linus Torvalds 848b81415c Merge branch 'akpm' (Andrew's patch-bomb)
Merge misc patches from Andrew Morton:
 "Incoming:

   - lots of misc stuff

   - backlight tree updates

   - lib/ updates

   - Oleg's percpu-rwsem changes

   - checkpatch

   - rtc

   - aoe

   - more checkpoint/restart support

  I still have a pile of MM stuff pending - Pekka should be merging
  later today after which that is good to go.  A number of other things
  are twiddling thumbs awaiting maintainer merges."

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (180 commits)
  scatterlist: don't BUG when we can trivially return a proper error.
  docs: update documentation about /proc/<pid>/fdinfo/<fd> fanotify output
  fs, fanotify: add @mflags field to fanotify output
  docs: add documentation about /proc/<pid>/fdinfo/<fd> output
  fs, notify: add procfs fdinfo helper
  fs, exportfs: add exportfs_encode_inode_fh() helper
  fs, exportfs: escape nil dereference if no s_export_op present
  fs, epoll: add procfs fdinfo helper
  fs, eventfd: add procfs fdinfo helper
  procfs: add ability to plug in auxiliary fdinfo providers
  tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
  breakpoint selftests: print failure status instead of cause make error
  kcmp selftests: print fail status instead of cause make error
  kcmp selftests: make run_tests fix
  mem-hotplug selftests: print failure status instead of cause make error
  cpu-hotplug selftests: print failure status instead of cause make error
  mqueue selftests: print failure status instead of cause make error
  vm selftests: print failure status instead of cause make error
  ubifs: use prandom_bytes
  mtd: nandsim: use prandom_bytes
  ...
2012-12-17 20:58:12 -08:00
Oleg Nesterov 22b361d1df percpu_rw_semaphore: introduce CONFIG_PERCPU_RWSEM
Currently only block_dev and uprobes use percpu_rw_semaphore,
add the config option selected by BLOCK || UPROBES.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:18 -08:00
Linus Torvalds 9228ff9038 Merge branch 'for-3.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver update from Jens Axboe:
 "Now that the core bits are in, here are the driver bits for 3.8.  The
  branch contains:

   - A huge pile of drbd bits that were dumped from the 3.7 merge
     window.  Following that, it was both made perfectly clear that
     there is going to be no more over-the-wall pulls and how the
     situation on individual pulls can be improved.

   - A few cleanups from Akinobu Mita for drbd and cciss.

   - Queue improvement for loop from Lukas.  This grew into adding a
     generic interface for waiting/checking an even with a specific
     lock, allowing this to be pulled out of md and now loop and drbd is
     also using it.

   - A few fixes for xen back/front block driver from Roger Pau Monne.

   - Partition improvements from Stephen Warren, allowing partiion UUID
     to be used as an identifier."

* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
  drbd: update Kconfig to match current dependencies
  drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
  drbd: close race between drbd_set_role and drbd_connect
  drbd: respect no-md-barriers setting also when changed online via disk-options
  drbd: Remove obsolete check
  drbd: fixup after wait_even_lock_irq() addition to generic code
  loop: Limit the number of requests in the bio list
  wait: add wait_event_lock_irq() interface
  xen-blkfront: free allocated page
  xen-blkback: move free persistent grants code
  block: partition: msdos: provide UUIDs for partitions
  init: reduce PARTUUID min length to 1 from 36
  block: store partition_meta_info.uuid as a string
  cciss: use check_signature()
  cciss: cleanup bitops usage
  drbd: use copy_highpage
  drbd: if the replication link breaks during handshake, keep retrying
  drbd: check return of kmalloc in receive_uuids
  drbd: Broadcast sync progress no more often than once per second
  drbd: don't try to clear bits once the disk has failed
  ...
2012-12-17 13:39:11 -08:00
Linus Torvalds 60da5bf47d Merge branch 'for-3.8/core' of git://git.kernel.dk/linux-block
Pull block layer core updates from Jens Axboe:
 "Here are the core block IO bits for 3.8.  The branch contains:

   - The final version of the surprise device removal fixups from Bart.

   - Don't hide EFI partitions under advanced partition types.  It's
     fairly wide spread these days.  This is especially dangerous for
     systems that have both msdos and efi partition tables, where you
     want to keep them in sync.

   - Cleanup of using -1 instead of the proper NUMA_NO_NODE

   - Export control of bdi flusher thread CPU mask and default to using
     the home node (if known) from Jeff.

   - Export unplug tracepoint for MD.

   - Core improvements from Shaohua.  Reinstate the recursive merge, as
     the original bug has been fixed.  Add plugging for discard and also
     fix a problem handling non pow-of-2 discard limits.

  There's a trivial merge in block/blk-exec.c due to a fix that went
  into 3.7-rc at a later point than -rc4 where this is based."

* 'for-3.8/core' of git://git.kernel.dk/linux-block:
  block: export block_unplug tracepoint
  block: add plug for blkdev_issue_discard
  block: discard granularity might not be power of 2
  deadline: Allow 0ms deadline latency, increase the read speed
  partitions: enable EFI/GPT support by default
  bsg: Remove unused function bsg_goose_queue()
  block: Make blk_cleanup_queue() wait until request_fn finished
  block: Avoid scheduling delayed work on a dead queue
  block: Avoid that request_fn is invoked on a dead queue
  block: Let blk_drain_queue() caller obtain the queue lock
  block: Rename queue dead flag
  bdi: add a user-tunable cpu_list for the bdi flusher threads
  block: use NUMA_NO_NODE instead of -1
  block: recursive merge requests
  block CFQ: avoid moving request to different queue
2012-12-17 08:27:23 -08:00
NeilBrown cbae8d45d6 block: export block_unplug tracepoint
This allows stacked devices (like md/raid5) to provide blktrace
tracing, including unplug events.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-14 20:49:27 +01:00
Shaohua Li 0cfbcafcae block: add plug for blkdev_issue_discard
Last post of this patch appears lost, so I resend this.

Now discard merge works, add plug for blkdev_issue_discard. This will help
discard request merge especially for raid0 case. In raid0, a big discard
request is split to small requests, and if correct plug is added, such small
requests can be merged in low layer.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-14 20:46:04 +01:00
Shaohua Li 8dd2cb7e88 block: discard granularity might not be power of 2
In MD raid case, discard granularity might not be power of 2, for example, a
4-disk raid5 has 3*chunk_size discard granularity. Correct the calculation for
such cases.

Reported-by: Neil Brown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-14 20:46:04 +01:00
Linus Torvalds d206e09036 Merge branch 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup changes from Tejun Heo:
 "A lot of activities on cgroup side.  The big changes are focused on
  making cgroup hierarchy handling saner.

   - cgroup_rmdir() had peculiar semantics - it allowed cgroup
     destruction to be vetoed by individual controllers and tried to
     drain refcnt synchronously.  The vetoing never worked properly and
     caused good deal of contortions in cgroup.  memcg was the last
     reamining user.  Michal Hocko removed the usage and cgroup_rmdir()
     path has been simplified significantly.  This was done in a
     separate branch so that the memcg people can base further memcg
     changes on top.

   - The above allowed cleaning up cgroup lifecycle management and
     implementation of generic cgroup iterators which are used to
     improve hierarchy support.

   - cgroup_freezer updated to allow migration in and out of a frozen
     cgroup and handle hierarchy.  If a cgroup is frozen, all descendant
     cgroups are frozen.

   - netcls_cgroup and netprio_cgroup updated to handle hierarchy
     properly.

   - Various fixes and cleanups.

   - Two merge commits.  One to pull in memcg and rmdir cleanups (needed
     to build iterators).  The other pulled in cgroup/for-3.7-fixes for
     device_cgroup fixes so that further device_cgroup patches can be
     stacked on top."

Fixed up a trivial conflict in mm/memcontrol.c as per Tejun (due to
commit bea8c150a7 ("memcg: fix hotplugged memory zone oops") in master
touching code close to commit 2ef37d3fe4 ("memcg: Simplify
mem_cgroup_force_empty_list error handling") in for-3.8)

* 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (65 commits)
  cgroup: update Documentation/cgroups/00-INDEX
  cgroup_rm_file: don't delete the uncreated files
  cgroup: remove subsystem files when remounting cgroup
  cgroup: use cgroup_addrm_files() in cgroup_clear_directory()
  cgroup: warn about broken hierarchies only after css_online
  cgroup: list_del_init() on removed events
  cgroup: fix lockdep warning for event_control
  cgroup: move list add after list head initilization
  netprio_cgroup: allow nesting and inherit config on cgroup creation
  netprio_cgroup: implement netprio[_set]_prio() helpers
  netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx
  netprio_cgroup: reimplement priomap expansion
  netprio_cgroup: shorten variable names in extend_netdev_table()
  netprio_cgroup: simplify write_priomap()
  netcls_cgroup: move config inheritance to ->css_online() and remove .broken_hierarchy marking
  cgroup: remove obsolete guarantee from cgroup_task_migrate.
  cgroup: add cgroup->id
  cgroup, cpuset: remove cgroup_subsys->post_clone()
  cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/
  cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
  ...
2012-12-12 08:18:24 -08:00
xiaobing tu 75274551c8 deadline: Allow 0ms deadline latency, increase the read speed
Change a timer compare from after to after-equals, thus allowing
0 timeout and making deadline schedule FIFO.

Signed-off-by: xiaobing tu <xiaobing.tu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-09 19:19:23 +01:00
Diego Calleja 5f6f38dbb0 partitions: enable EFI/GPT support by default
The Kconfig currently enables MSDOS partitions by default because they
are assumed to be essential, but it's necessary to enable "advanced
partition selection" in order to get GPT support. IMO GPT partitions
are becoming common enought to deserve the same treatment MSDOS
partitions get.

(Side note: I got bit by a disk that had MSDOS and GPT partition
tables, but for some reason the MSDOS table was different from the
GPT one. I was stupid enought to disable "advanced partition
selection" in my .config, which disabled GPT partitioning and made
my btrfs pool unbootable because it couldn't find the partitions)

Signed-off-by: Diego Calleja <diegocg@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:34:58 +01:00
Bart Van Assche 80729beb33 bsg: Remove unused function bsg_goose_queue()
The function bsg_goose_queue() does not have any in-tree callers,
so let's remove it.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:33:02 +01:00
Bart Van Assche 24faf6f604 block: Make blk_cleanup_queue() wait until request_fn finished
Some request_fn implementations, e.g. scsi_request_fn(), unlock
the queue lock internally. This may result in multiple threads
executing request_fn for the same queue simultaneously. Keep
track of the number of active request_fn calls and make sure that
blk_cleanup_queue() waits until all active request_fn invocations
have finished. A block driver may start cleaning up resources
needed by its request_fn as soon as blk_cleanup_queue() finished,
so blk_cleanup_queue() must wait for all outstanding request_fn
invocations to finish.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reported-by: Chanho Min <chanho.min@lge.com>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:33:00 +01:00
Bart Van Assche 704605711e block: Avoid scheduling delayed work on a dead queue
Running a queue must continue after it has been marked dying until
it has been marked dead. So the function blk_run_queue_async() must
not schedule delayed work after blk_cleanup_queue() has marked a queue
dead. Hence add a test for that queue state in blk_run_queue_async()
and make sure that queue_unplugged() invokes that function with the
queue lock held. This avoids that the queue state can change after
it has been tested and before mod_delayed_work() is invoked. Drop
the queue dying test in queue_unplugged() since it is now
superfluous: __blk_run_queue() already tests whether or not the
queue is dead.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:32:30 +01:00
Bart Van Assche c246e80d86 block: Avoid that request_fn is invoked on a dead queue
A block driver may start cleaning up resources needed by its
request_fn as soon as blk_cleanup_queue() finished, so request_fn
must not be invoked after draining finished. This is important
when blk_run_queue() is invoked without any requests in progress.
As an example, if blk_drain_queue() and scsi_run_queue() run in
parallel, blk_drain_queue() may have finished all requests after
scsi_run_queue() has taken a SCSI device off the starved list but
before that last function has had a chance to run the queue.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Chanho Min <chanho.min@lge.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:32:01 +01:00
Bart Van Assche 807592a4fa block: Let blk_drain_queue() caller obtain the queue lock
Let the caller of blk_drain_queue() obtain the queue lock to improve
readability of the patch called "Avoid that request_fn is invoked on
a dead queue".

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Chanho Min <chanho.min@lge.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:30:59 +01:00
Bart Van Assche 3f3299d5c0 block: Rename queue dead flag
QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
stop. After this flag has been set queue draining starts. However,
during the queue draining phase it is still safe to invoke the
queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
flag.

This patch has been generated by running the following command
over the kernel source tree:

git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
    xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g'      \
        -e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g';                \
sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
    include/linux/blkdev.h;                                       \
sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
    -e 's/Dead queue/A dying queue/' block/blk-core.c

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Chanho Min <chanho.min@lge.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:30:58 +01:00
Roland Dreier 893d290f1d block: Don't access request after it might be freed
After we've done __elv_add_request() and __blk_run_queue() in
blk_execute_rq_nowait(), the request might finish and be freed
immediately.  Therefore checking if the type is REQ_TYPE_PM_RESUME
isn't safe afterwards, because if it isn't, rq might be gone.
Instead, check beforehand and stash the result in a temporary.

This fixes crashes in blk_execute_rq_nowait() I get occasionally when
running with lots of memory debugging options enabled -- I think this
race is usually harmless because the window for rq to be reallocated
is so small.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:55 +01:00
Stephen Warren d33b98fc82 block: partition: msdos: provide UUIDs for partitions
The MSDOS/MBR partition table includes a 32-bit unique ID, often referred
to as the NT disk signature.  When combined with a partition number within
the table, this can form a unique ID similar in concept to EFI/GPT's
partition UUID.  Constructing and recording this value in struct
partition_meta_info allows MSDOS partitions to be referred to on the
kernel command-line using the following syntax:

root=PARTUUID=0002dd75-01

Signed-off-by: Stephen Warren <swarren@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Will Drewry <wad@chromium.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:28:58 +01:00
Stephen Warren 1ad7e89940 block: store partition_meta_info.uuid as a string
This will allow other types of UUID to be stored here, aside from true
UUIDs.  This also simplifies code that uses this field, since it's usually
constructed from a, used as a, or compared to other, strings.

Note: A simplistic approach here would be to set uuid_str[36]=0 whenever a
/PARTNROFF option was found to be present.  However, this modifies the
input string, and causes subsequent calls to devt_from_partuuid() not to
see the /PARTNROFF option, which causes different results.  In order to
avoid misleading future maintainers, this parameter is marked const.

Signed-off-by: Stephen Warren <swarren@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Will Drewry <wad@chromium.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:28:53 +01:00
Tejun Heo 92fb97487a cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
Rename cgroup_subsys css lifetime related callbacks to better describe
what their roles are.  Also, update documentation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:38 -08:00
Ezequiel Garcia c304a51bf4 block: use NUMA_NO_NODE instead of -1
Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com>

Modified by me to cover blk_init_queue() as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-10 10:41:13 +01:00
Shaohua Li bee0393cc1 block: recursive merge requests
In a workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1, a+3,....
When the requests are flushed to queue, a and a+1 are merged to (a, a+1), a+2
and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3) aren't merged.

If we do recursive merge for such interleave access, some workloads throughput
get improvement. A recent worload I'm checking on is swap, below change
boostes the throughput around 5% ~ 10%.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-09 08:44:27 +01:00
Tejun Heo 5b805f2a76 Merge branch 'cgroup/for-3.7-fixes' into cgroup/for-3.8
This is to receive device_cgroup fixes so that further device_cgroup
changes can be made in cgroup/for-3.8.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-11-06 12:26:23 -08:00
Shaohua Li 3d106fba2e block CFQ: avoid moving request to different queue
request is queued in cfqq->fifo list. Looks it's possible we are moving a
request from one cfqq to another in request merge case. In such case, adjusting
the fifo list order doesn't make sense and is impossible if we don't iterate
the whole fifo list.

My test does hit one case the two cfqq are different, but didn't cause kernel
crash, maybe it's because fifo list isn't used frequently. Anyway, from the
code logic, this is buggy.

I thought we can re-enable the recusive merge logic after this is fixed.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-06 12:39:51 +01:00
Tejun Heo 1db1e31b1e Merge branch 'cgroup-rmdir-updates' into cgroup/for-3.8
Pull rmdir updates into for-3.8 so that further callback updates can
be put on top.  This pull created a trivial conflict between the
following two commits.

  8c7f6edbda ("cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them")
  ed95779340 ("cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs")

The former added a field to cgroup_subsys and the latter removed one
from it.  They happen to be colocated causing the conflict.  Keeping
what's added and removing what's removed resolves the conflict.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-11-05 09:21:51 -08:00
Tejun Heo bcf6de1b91 cgroup: make ->pre_destroy() return void
All ->pre_destory() implementations return 0 now, which is the only
allowed return value.  Make it return void.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
2012-11-05 09:16:59 -08:00
Jianpeng Ma 975927b942 block: Add blk_rq_pos(rq) to sort rq when plushing
My workload is a raid5 which had 16 disks. And used our filesystem to
write using direct-io mode.

I used the blktrace to find those message:
8,16   0     6647     2.453665504  2579  M   W 7493152 + 8 [md0_raid5]
8,16   0     6648     2.453672411  2579  Q   W 7493160 + 8 [md0_raid5]
8,16   0     6649     2.453672606  2579  M   W 7493160 + 8 [md0_raid5]
8,16   0     6650     2.453679255  2579  Q   W 7493168 + 8 [md0_raid5]
8,16   0     6651     2.453679441  2579  M   W 7493168 + 8 [md0_raid5]
8,16   0     6652     2.453685948  2579  Q   W 7493176 + 8 [md0_raid5]
8,16   0     6653     2.453686149  2579  M   W 7493176 + 8 [md0_raid5]
8,16   0     6654     2.453693074  2579  Q   W 7493184 + 8 [md0_raid5]
8,16   0     6655     2.453693254  2579  M   W 7493184 + 8 [md0_raid5]
8,16   0     6656     2.453704290  2579  Q   W 7493192 + 8 [md0_raid5]
8,16   0     6657     2.453704482  2579  M   W 7493192 + 8 [md0_raid5]
8,16   0     6658     2.453715016  2579  Q   W 7493200 + 8 [md0_raid5]
8,16   0     6659     2.453715247  2579  M   W 7493200 + 8 [md0_raid5]
8,16   0     6660     2.453721730  2579  Q   W 7493208 + 8 [md0_raid5]
8,16   0     6661     2.453721974  2579  M   W 7493208 + 8 [md0_raid5]
8,16   0     6662     2.453728202  2579  Q   W 7493216 + 8 [md0_raid5]
8,16   0     6663     2.453728436  2579  M   W 7493216 + 8 [md0_raid5]
8,16   0     6664     2.453734782  2579  Q   W 7493224 + 8 [md0_raid5]
8,16   0     6665     2.453735019  2579  M   W 7493224 + 8 [md0_raid5]
8,16   0     6666     2.453741401  2579  Q   W 7493232 + 8 [md0_raid5]
8,16   0     6667     2.453741632  2579  M   W 7493232 + 8 [md0_raid5]
8,16   0     6668     2.453748148  2579  Q   W 7493240 + 8 [md0_raid5]
8,16   0     6669     2.453748386  2579  M   W 7493240 + 8 [md0_raid5]
8,16   0     6670     2.453851843  2579  I   W 7493144 + 104 [md0_raid5]
8,16   0        0     2.453853661     0  m   N cfq2579 insert_request
8,16   0     6671     2.453854064  2579  I   W 7493120 + 24 [md0_raid5]
8,16   0        0     2.453854439     0  m   N cfq2579 insert_request
8,16   0     6672     2.453854793  2579  U   N [md0_raid5] 2
8,16   0        0     2.453855513     0  m   N cfq2579 Not idling.st->count:1
8,16   0        0     2.453855927     0  m   N cfq2579 dispatch_insert
8,16   0        0     2.453861771     0  m   N cfq2579 dispatched a request
8,16   0        0     2.453862248     0  m   N cfq2579 activate rq,drv=1
8,16   0     6673     2.453862332  2579  D   W 7493120 + 24 [md0_raid5]
8,16   0        0     2.453865957     0  m   N cfq2579 Not idling.st->count:1
8,16   0        0     2.453866269     0  m   N cfq2579 dispatch_insert
8,16   0        0     2.453866707     0  m   N cfq2579 dispatched a request
8,16   0        0     2.453867061     0  m   N cfq2579 activate rq,drv=2
8,16   0     6674     2.453867145  2579  D   W 7493144 + 104 [md0_raid5]
8,16   0     6675     2.454147608     0  C   W 7493120 + 24 [0]
8,16   0        0     2.454149357     0  m   N cfq2579 complete rqnoidle 0
8,16   0     6676     2.454791505     0  C   W 7493144 + 104 [0]
8,16   0        0     2.454794803     0  m   N cfq2579 complete rqnoidle 0
8,16   0        0     2.454795160     0  m   N cfq schedule dispatch

From above messages,we can find rq[W 7493144 + 104] and rq[W
7493120 + 24] do not merge.
Because the bio order is:
  8,16   0     6638     2.453619407  2579  Q   W 7493144 + 8 [md0_raid5]
  8,16   0     6639     2.453620460  2579  G   W 7493144 + 8 [md0_raid5]
  8,16   0     6640     2.453639311  2579  Q   W 7493120 + 8 [md0_raid5]
  8,16   0     6641     2.453639842  2579  G   W 7493120 + 8 [md0_raid5]
The bio(7493144) first and bio(7493120) later.So the subsequent
bios will be divided into two parts.
When flushing plug-list,because elv_attempt_insert_merge only support
backmerge,not supporting frontmerge.
So rq[7493120 + 24] can't merge with rq[7493144 + 104].

From my test,i found those situation can count 25% in our system.
Using this patch, there is no this situation.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
CC:Shaohua Li <shli@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-25 21:58:17 +02:00
Kees Cook 8e42e0a23d block: remove CONFIG_EXPERIMENTAL
This config item has not carried much meaning for a while now and is
almost always enabled by default. As agreed during the Linux kernel
summit, remove it.

CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-23 22:30:34 +02:00
Jun'ichi Nomura 65c77fd9e8 blkcg: stop iteration early if root_rl is the only request list
__blk_queue_next_rl() finds next request list based on blkg_list
while skipping root_blkg in the list.
OTOH, root_rl is special as it may exist even without root_blkg.

Though the later part of the function handles such a case correctly,
exiting early is good for readability of the code.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-22 22:00:26 +02:00
Jun'ichi Nomura 65635cbc37 blkcg: Fix use-after-free of q->root_blkg and q->root_rl.blkg
blk_put_rl() does not call blkg_put() for q->root_rl because we
don't take request list reference on q->root_blkg.
However, if root_blkg is once attached then detached (freed),
blk_put_rl() is confused by the bogus pointer in q->root_blkg.

For example, with !CONFIG_BLK_DEV_THROTTLING &&
CONFIG_CFQ_GROUP_IOSCHED,
switching IO scheduler from cfq to deadline will cause system stall
after the following warning with 3.6:

> WARNING: at /work/build/linux/block/blk-cgroup.h:250
> blk_put_rl+0x4d/0x95()
> Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf
> ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
> Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1
> Call Trace:
>  <IRQ>  [<ffffffff810453bd>] warn_slowpath_common+0x85/0x9d
>  [<ffffffff810453ef>] warn_slowpath_null+0x1a/0x1c
>  [<ffffffff811d5f8d>] blk_put_rl+0x4d/0x95
>  [<ffffffff811d614a>] __blk_put_request+0xc3/0xcb
>  [<ffffffff811d71a3>] blk_finish_request+0x232/0x23f
>  [<ffffffff811d76c3>] ? blk_end_bidi_request+0x34/0x5d
>  [<ffffffff811d76d1>] blk_end_bidi_request+0x42/0x5d
>  [<ffffffff811d7728>] blk_end_request+0x10/0x12
>  [<ffffffff812cdf16>] scsi_io_completion+0x207/0x4d5
>  [<ffffffff812c6fcf>] scsi_finish_command+0xfa/0x103
>  [<ffffffff812ce2f8>] scsi_softirq_done+0xff/0x108
>  [<ffffffff811dcea5>] blk_done_softirq+0x8d/0xa1
>  [<ffffffff810915d5>] ?
>  generic_smp_call_function_single_interrupt+0x9f/0xd7
>  [<ffffffff8104cf5b>] __do_softirq+0x102/0x213
>  [<ffffffff8108a5ec>] ? lock_release_holdtime+0xb6/0xbb
>  [<ffffffff8104d2b4>] ? raise_softirq_irqoff+0x9/0x3d
>  [<ffffffff81424dfc>] call_softirq+0x1c/0x30
>  [<ffffffff81011beb>] do_softirq+0x4b/0xa3
>  [<ffffffff8104cdb0>] irq_exit+0x53/0xd5
>  [<ffffffff8102d865>] smp_call_function_single_interrupt+0x34/0x36
>  [<ffffffff8142486f>] call_function_single_interrupt+0x6f/0x80
>  <EOI>  [<ffffffff8101800b>] ? mwait_idle+0x94/0xcd
>  [<ffffffff81018002>] ? mwait_idle+0x8b/0xcd
>  [<ffffffff81017811>] cpu_idle+0xbb/0x114
>  [<ffffffff81401fbd>] rest_init+0xc1/0xc8
>  [<ffffffff81401efc>] ? csum_partial_copy_generic+0x16c/0x16c
>  [<ffffffff81cdbd3d>] start_kernel+0x3d4/0x3e1
>  [<ffffffff81cdb79e>] ? kernel_init+0x1f7/0x1f7
>  [<ffffffff81cdb2dd>] x86_64_start_reservations+0xb8/0xbd
>  [<ffffffff81cdb3e3>] x86_64_start_kernel+0x101/0x110

This patch clears q->root_blkg and q->root_rl.blkg when root blkg
is destroyed.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-22 22:00:26 +02:00
Linus Torvalds ce40be7a82 Merge branch 'for-3.7/core' of git://git.kernel.dk/linux-block
Pull block IO update from Jens Axboe:
 "Core block IO bits for 3.7.  Not a huge round this time, it contains:

   - First series from Kent cleaning up and generalizing bio allocation
     and freeing.

   - WRITE_SAME support from Martin.

   - Mikulas patches to prevent O_DIRECT crashes when someone changes
     the block size of a device.

   - Make bio_split() work on data-less bio's (like trim/discards).

   - A few other minor fixups."

Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
Morton.  It is due to the VM no longer using a prio-tree (see commit
6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree").

So make set_blocksize() use mapping_mapped() instead of open-coding the
internal VM knowledge that has changed.

* 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
  block: makes bio_split support bio without data
  scatterlist: refactor the sg_nents
  scatterlist: add sg_nents
  fs: fix include/percpu-rwsem.h export error
  percpu-rw-semaphore: fix documentation typos
  fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
  blockdev: turn a rw semaphore into a percpu rw semaphore
  Fix a crash when block device is read and block size is changed at the same time
  block: fix request_queue->flags initialization
  block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
  block: ioctl to zero block ranges
  block: Make blkdev_issue_zeroout use WRITE SAME
  block: Implement support for WRITE SAME
  block: Consolidate command flag and queue limit checks for merges
  block: Clean up special command handling logic
  block/blk-tag.c: Remove useless kfree
  block: remove the duplicated setting for congestion_threshold
  block: reject invalid queue attribute values
  block: Add bio_clone_bioset(), bio_clone_kmalloc()
  block: Consolidate bio_alloc_bioset(), bio_kmalloc()
  ...
2012-10-11 09:04:23 +09:00
Linus Torvalds 68d47a137c Merge branch 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup hierarchy update from Tejun Heo:
 "Currently, different cgroup subsystems handle nested cgroups
  completely differently.  There's no consistency among subsystems and
  the behaviors often are outright broken.

  People at least seem to agree that the broken hierarhcy behaviors need
  to be weeded out if any progress is gonna be made on this front and
  that the fallouts from deprecating the broken behaviors should be
  acceptable especially given that the current behaviors don't make much
  sense when nested.

  This patch makes cgroup emit warning messages if cgroups for
  subsystems with broken hierarchy behavior are nested to prepare for
  fixing them in the future.  This was put in a separate branch because
  more related changes were expected (didn't make it this round) and the
  memory cgroup wanted to pull in this and make changes on top."

* 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
2012-10-02 10:52:28 -07:00
Linus Torvalds 033d9959ed Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "This is workqueue updates for v3.7-rc1.  A lot of activities this
  round including considerable API and behavior cleanups.

   * delayed_work combines a timer and a work item.  The handling of the
     timer part has always been a bit clunky leading to confusing
     cancelation API with weird corner-case behaviors.  delayed_work is
     updated to use new IRQ safe timer and cancelation now works as
     expected.

   * Another deficiency of delayed_work was lack of the counterpart of
     mod_timer() which led to cancel+queue combinations or open-coded
     timer+work usages.  mod_delayed_work[_on]() are added.

     These two delayed_work changes make delayed_work provide interface
     and behave like timer which is executed with process context.

   * A work item could be executed concurrently on multiple CPUs, which
     is rather unintuitive and made flush_work() behavior confusing and
     half-broken under certain circumstances.  This problem doesn't
     exist for non-reentrant workqueues.  While non-reentrancy check
     isn't free, the overhead is incurred only when a work item bounces
     across different CPUs and even in simulated pathological scenario
     the overhead isn't too high.

     All workqueues are made non-reentrant.  This removes the
     distinction between flush_[delayed_]work() and
     flush_[delayed_]_work_sync().  The former is now as strong as the
     latter and the specified work item is guaranteed to have finished
     execution of any previous queueing on return.

   * In addition to the various bug fixes, Lai redid and simplified CPU
     hotplug handling significantly.

   * Joonsoo introduced system_highpri_wq and used it during CPU
     hotplug.

  There are two merge commits - one to pull in IRQ safe timer from
  tip/timers/core and the other to pull in CPU hotplug fixes from
  wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
  workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
  workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
  workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
  workqueue: remove @delayed from cwq_dec_nr_in_flight()
  workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
  workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
  workqueue: use __cpuinit instead of __devinit for cpu callbacks
  workqueue: rename manager_mutex to assoc_mutex
  workqueue: WORKER_REBIND is no longer necessary for idle rebinding
  workqueue: WORKER_REBIND is no longer necessary for busy rebinding
  workqueue: reimplement idle worker rebinding
  workqueue: deprecate __cancel_delayed_work()
  workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
  workqueue: use mod_delayed_work() instead of __cancel + queue
  workqueue: use irqsafe timer for delayed_work
  workqueue: clean up delayed_work initializers and add missing one
  workqueue: make deferrable delayed_work initializer names consistent
  workqueue: cosmetic whitespace updates for macro definitions
  workqueue: deprecate system_nrt[_freezable]_wq
  workqueue: deprecate flush[_delayed]_work_sync()
  ...
2012-10-02 09:54:49 -07:00
Stefan Weinhuber 46e8894786 s390/partitions: make partition detection independent from DASD ioctls
In some usage scenarios it is desireable to work with disk images or
virtualized DASD devices. One problem that prevents such applications
is the partition detection in ibm.c. Currently it works only for
devices that support the BIODASDINFO2 ioctl, in other words, it only
works for devices that belong to the DASD device driver.

The information gained from the BIODASDINFO2 ioctl is only for a small
set of legacy cases abolutely necessary. All current VOL1, LNX1 and
CMS1 type of disk labels can be interpreted correctly without this
information, as long as the generic HDIO_GETGEO ioctl works and
provides a correct disk geometry.

This patch makes the ibm.c partition detection as independent as
possible from the BIODASDINFO2 ioctl. Only the following two cases are
still restricted to real DASDs:
- An FBA DASD, or LDL formatted ECKD DASD without any disk label.
- An old style LNX1 label (without large volume support) on a disk
  with inconsistent device geometry.

Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-09-26 15:45:05 +02:00
Tejun Heo 60ea8226cb block: fix request_queue->flags initialization
A queue newly allocated with blk_alloc_queue_node() has only
QUEUE_FLAG_BYPASS set.  For request-based drivers,
blk_init_allocated_queue() is called and q->queue_flags is overwritten
with QUEUE_FLAG_DEFAULT which doesn't include BYPASS even though the
initial bypass is still in effect.

In blk_init_allocated_queue(), or QUEUE_FLAG_DEFAULT to q->queue_flags
instead of overwriting.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-21 15:33:12 +02:00
Tejun Heo 749fefe677 block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
b82d4b197c ("blkcg: make request_queue bypassing on allocation") made
request_queues bypassed on allocation to avoid switching on and off
bypass mode on a queue being initialized.  Some drivers allocate and
then destroy a lot of queues without fully initializing them and
incurring bypass latency overhead on each of them could add upto
significant overhead.

Unfortunately, blk_init_allocated_queue() is never used by queues of
bio-based drivers, which means that all bio-based driver queues are in
bypass mode even after initialization and registration complete
successfully.

Due to the limited way request_queues are used by bio drivers, this
problem is hidden pretty well but it shows up when blk-throttle is
used in combination with a bio-based driver.  Trying to configure
(echoing to cgroupfs file) blk-throttle for a bio-based driver hangs
indefinitely in blkg_conf_prep() waiting for bypass mode to end.

This patch moves the initial blk_queue_bypass_end() call from
blk_init_allocated_queue() to blk_register_queue() which is called for
any userland-visible queues regardless of its type.

I believe this is correct because I don't think there is any block
driver which needs or wants working elevator and blk-cgroup on a queue
which isn't visible to userland.  If there are such users, we need a
different solution.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Joseph Glanville <joseph.glanville@orionvm.com.au>
Cc: stable@vger.kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-21 15:32:57 +02:00
Martin K. Petersen 66ba32dc16 block: ioctl to zero block ranges
Introduce a BLKZEROOUT ioctl which can be used to clear block ranges by
way of blkdev_issue_zeroout().

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-20 14:31:53 +02:00
Martin K. Petersen 579e8f3c7b block: Make blkdev_issue_zeroout use WRITE SAME
If the device supports WRITE SAME, use that to optimize zeroing of
blocks. If the device does not support WRITE SAME or if the operation
fails, fall back to writing zeroes the old-fashioned way.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-20 14:31:49 +02:00
Martin K. Petersen 4363ac7c13 block: Implement support for WRITE SAME
The WRITE SAME command supported on some SCSI devices allows the same
block to be efficiently replicated throughout a block range. Only a
single logical block is transferred from the host and the storage device
writes the same data to all blocks described by the I/O.

This patch implements support for WRITE SAME in the block layer. The
blkdev_issue_write_same() function can be used by filesystems and block
drivers to replicate a buffer across a block range. This can be used to
efficiently initialize software RAID devices, etc.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-20 14:31:45 +02:00
Martin K. Petersen f31dc1cd49 block: Consolidate command flag and queue limit checks for merges
- blk_check_merge_flags() verifies that cmd_flags / bi_rw are
   compatible. This function is called for both req-req and req-bio
   merging.

 - blk_rq_get_max_sectors() and blk_queue_get_max_sectors() can be used
   to query the maximum sector count for a given request or queue. The
   calls will return the right value from the queue limits given the
   type of command (RW, discard, write same, etc.)

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-20 14:31:41 +02:00
Martin K. Petersen e2a60da74f block: Clean up special command handling logic
Remove special-casing of non-rw fs style requests (discard). The nomerge
flags are consolidated in blk_types.h, and rq_mergeable() and
bio_mergeable() have been modified to use them.

bio_is_rw() is used in place of bio_has_data() a few places. This is
done to to distinguish true reads and writes from other fs type requests
that carry a payload (e.g. write same).

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-20 14:31:38 +02:00
Alan Cox 2bd6efad25 blk: add an upper sanity check on partition adding
65536 should be ludicrous anyway but without it we overflow the
memory computation doing the allocation and badness occurs.

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-18 11:56:29 +02:00
Tejun Heo 8c7f6edbda cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
Currently, cgroup hierarchy support is a mess.  cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children.  blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup.  Others show yet different behaviors.

These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.

Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy.  Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.

This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support.  The goal of
this patch is two-fold.

* Move users away from using hierarchy on currently non-hierarchical
  subsystems, so that implementing proper hierarchy support on those
  doesn't surprise them.

* Keep track of which controllers are broken how and nudge the
  subsystems to implement proper hierarchy support.

For now, start with a single warning message.  We can whine louder
later on.

v2: Fixed a typo spotted by Michal. Warning message updated.

v3: Updated memcg part so that it doesn't generate warning in the
    cases where .use_hierarchy=false doesn't make the behavior
    different from root.use_hierarchy=true.  Fixed a typo spotted by
    Glauber.

v4: Check ->broken_hierarchy after cgroup creation is complete so that
    ->create() can affect the result per Michal.  Dropped unnecessary
    memcg root handling per Michal.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2012-09-14 12:01:16 -07:00
Peter Senna Tschudin d41570b746 block/blk-tag.c: Remove useless kfree
Remove useless kfree() and clean up code related to the removal.

The semantic patch that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@r exists@
position p1,p2;
expression x;
@@

if (x@p1 == NULL) { ... kfree@p2(x); ... return ...; }

@unchanged exists@
position r.p1,r.p2;
expression e <= r.x,x,e1;
iterator I;
statement S;
@@

if (x@p1 == NULL) { ... when != I(x,...) S
                        when != e = e1
                        when != e += e1
                        when != e -= e1
                        when != ++e
                        when != --e
                        when != e++
                        when != e--
                        when != &e
   kfree@p2(x); ... return ...; }

@ok depends on unchanged exists@
position any r.p1;
position r.p2;
expression x;
@@

... when != true x@p1 == NULL
kfree@p2(x);

@depends on !ok && unchanged@
position r.p2;
expression x;
@@

*kfree@p2(x);
// </smpl>

Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:25:12 +02:00
Jaehoon Chung e32463b2f7 block: remove the duplicated setting for congestion_threshold
Before call the blk_queue_congestion_threshold(),
the blk_queue_congestion_threshold() is already called at blk_queue_make_rquest().
Because this code is the duplicated, it has removed.

Signed-off-by: Jaehoon Chung <jh80.chung@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09 12:44:10 +02:00
Dave Reisner b1f3b64d76 block: reject invalid queue attribute values
Instead of using simple_strtoul which "converts" invalid numbers to 0,
use strict_strtoul and perform error checking to ensure that userspace
passes us a valid unsigned long. This addresses problems with functions
such as writev, which might want to write a trailing newline -- the
newline should rightfully be rejected, but the value preceeding it
should be preserved.

Fixes BZ#46981.

Signed-off-by: Dave Reisner <dreisner@archlinux.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09 10:39:18 +02:00
Kent Overstreet bf800ef181 block: Add bio_clone_bioset(), bio_clone_kmalloc()
Previously, there was bio_clone() but it only allocated from the fs bio
set; as a result various users were open coding it and using
__bio_clone().

This changes bio_clone() to become bio_clone_bioset(), and then we add
bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
the functionality the last patch adedd.

This will also help in a later patch changing how bio cloning works.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: Boaz Harrosh <bharrosh@panasas.com>
CC: Jeff Garzik <jeff@garzik.org>
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09 10:35:39 +02:00
Kent Overstreet 4254bba17d block: Kill bi_destructor
Now that we've got generic code for freeing bios allocated from bio
pools, this isn't needed anymore.

This patch also makes bio_free() static, since without bi_destructor
there should be no need for it to be called anywhere else.

bio_free() is now only called from bio_put, so we can refactor those a
bit - move some code from bio_put() to bio_free() and kill the redundant
bio->bi_next = NULL.

v5: Switch to BIO_KMALLOC_POOL ((void *)~0), per Boaz
v6: BIO_KMALLOC_POOL now NULL, drop bio_free's EXPORT_SYMBOL
v7: No #define BIO_KMALLOC_POOL anymore

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09 10:35:39 +02:00
Kent Overstreet 1e2a410ff7 block: Ues bi_pool for bio_integrity_alloc()
Now that bios keep track of where they were allocated from,
bio_integrity_alloc_bioset() becomes redundant.

Remove bio_integrity_alloc_bioset() and drop bio_set argument from the
related functions and make them use bio->bi_pool.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09 10:35:38 +02:00
Yi Zou 37d7b34f05 block: rate-limit the error message from failing commands
When performing a cable pull test w/ active stress I/O using fio over
a dual port Intel 82599 FCoE CNA, w/ 256LUNs on one port and about 32LUNs
on the other, it is observed that the system becomes not usable due to
scsi-ml being busy printing the error messages for all the failing commands.
I don't believe this problem is specific to FCoE and these commands are
anyway failing due to link being down (DID_NO_CONNECT), just rate-limit
the messages here to solve this issue.

v2->v1: use __ratelimit() as Tomas Henzl mentioned as the proper way for
rate-limit per function. However, in this case, the failed i/o gets to
blk_end_request_err() and then blk_update_request(), which also has to
be rate-limited, as added in the v2 of this patch.

v3-v2: resolved conflict to apply on current 3.6-rc3 upstream tip.

Signed-off-by: Yi Zou <yi.zou@intel.com>
Cc: www.Open-FCoE.org <devel@open-fcoe.org>
Cc: Tomas Henzl <thenzl@redhat.com>
Cc: <linux-scsi@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-30 16:26:25 -07:00
Tejun Heo 136b5721d7 workqueue: deprecate __cancel_delayed_work()
Now that cancel_delayed_work() can be safely called from IRQ handlers,
there's no reason to use __cancel_delayed_work().  Use
cancel_delayed_work() instead of __cancel_delayed_work() and mark the
latter deprecated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Roland Dreier <roland@kernel.org>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
2012-08-21 13:18:24 -07:00
Tejun Heo e7c2f96744 workqueue: use mod_delayed_work() instead of __cancel + queue
Now that mod_delayed_work() is safe to call from IRQ handlers,
__cancel_delayed_work() followed by queue_delayed_work() can be
replaced with mod_delayed_work().

Most conversions are straight-forward except for the following.

* net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
  elaborate dancing around its delayed_work.  Collapse it such that
  linkwatch_work is queued for immediate execution if LW_URGENT and
  existing timer is kept otherwise.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
2012-08-21 13:18:24 -07:00
Tejun Heo 3b07e9ca26 workqueue: deprecate system_nrt[_freezable]_wq
system_nrt[_freezable]_wq are now spurious.  Mark them deprecated and
convert all users to system[_freezable]_wq.

If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
Please use system[_freezable]_wq instead.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-By: Lai Jiangshan <laijs@cn.fujitsu.com>

Cc: Jens Axboe <axboe@kernel.dk>
Cc: David Airlie <airlied@linux.ie>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
2012-08-20 14:51:24 -07:00
Tejun Heo 41f63c5359 workqueue: use mod_delayed_work() instead of cancel + queue
Convert delayed_work users doing cancel_delayed_work() followed by
queue_delayed_work() to mod_delayed_work().

Most conversions are straight-forward.  Ones worth mentioning are,

* drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
  use mod_delayed_work() and cancel loop in
  edac_mc_reset_delay_period() is dropped.

* drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
  watchdog is active or not.  @fan_watchdog_active and related code
  dropped.

* drivers/power/charger-manager.c: Seemingly a lot of
  delayed_work_pending() abuse going on here.
  [delayed_]work_pending() are unsynchronized and racy when used like
  this.  I converted one instance in fullbatt_handler().  Please
  conver the rest so that it invokes workqueue APIs for the intended
  target state rather than trying to game work item pending state
  transitions.  e.g. if timer should be modified - call
  mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().

* drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
  simplified.  Note that round_jiffies() calls in this function are
  meaningless.  round_jiffies() work on absolute jiffies not delta
  delay used by delayed_work.

v2: Tomi pointed out that __cancel_delayed_work() users can't be
    safely converted to mod_delayed_work().  They could be calling it
    from irq context and if that happens while delayed_work_timer_fn()
    is running, it could deadlock.  __cancel_delayed_work() users are
    dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Anton Vorontsov <cbouatmailru@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Roland Dreier <roland@kernel.org>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
2012-08-13 16:27:37 -07:00
Jianpeng Ma 0676806707 block: Don't use static to define "void *p" in show_partition_start()
I met a odd prblem:read /proc/partitions may return zero.

I wrote a file test.c:
int main()
{
	char buff[4096];
	int ret;
	int fd;
	printf("pid=%d\n",getpid());
	while (1) {
		fd = open("/proc/partitions", O_RDONLY);
		if (fd < 0) {
			printf("open error %s\n", strerror(errno));
			return 0;
		}
		ret = read(fd, buff, 4096);
		if (ret <= 0)
			printf("ret=%d, %s, %ld\n", ret,
				strerror(errno), lseek(fd,0,SEEK_CUR));
		close(fd);
	}
	exit(0);
}

You can reproduce by:
1:while true;do cat /proc/partitions > /dev/null ;done
2:./test

I reviewed the code and found:

>> static void *show_partition_start(struct seq_file *seqf, loff_t *pos)
>> {
>> 	static void *p;
>>
>> 	p = disk_seqf_start(seqf, pos);
>> 	if (!IS_ERR_OR_NULL(p) && !*pos)
>> 		seq_puts(seqf, "major minor  #blocks  name\n\n");
>> 	return p;
>> }
		test								cat /proc/partitions
	p = disk_seqf_start()(Not NULL)
									p = disk_seqf_start()(NULL because pos)
	if (!IS_ERR_OR_NULL(p) && !*pos)

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-03 10:42:00 +02:00
Asias He 85b9f66a41 block: Add blk_bio_map_sg() helper
Add a helper to map a bio to a scatterlist, modelled after
blk_rq_map_sg.

This helper is useful for any driver that wants to create
a scatterlist from its ->make_request_fn method.

Changes in v2:
 - Use __blk_segment_map_sg to avoid duplicated code
 - Add cocbook style function comment

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-02 23:42:04 +02:00
Asias He 963ab9e5da block: Introduce __blk_segment_map_sg() helper
Split the mapping code in blk_rq_map_sg() to a helper
__blk_segment_map_sg(), so that other mapping function, e.g.
blk_bio_map_sg(), can share the code.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Suggested-by: Jens Axboe <axboe@kernel.dk>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-02 23:42:03 +02:00
Paolo Bonzini c6e666345e block: split discard into aligned requests
When a disk has large discard_granularity and small max_discard_sectors,
discards are not split with optimal alignment.  In the limit case of
discard_granularity == max_discard_sectors, no request could be aligned
correctly, so in fact you might end up with no discarded logical blocks
at all.

Another example that helps showing the condition in the patch is with
discard_granularity == 64, max_discard_sectors == 128.  A request that is
submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257.
However, only 2 aligned blocks out of 3 are included in the request;
128..191 may be left intact and not discarded.  With this patch, the
first request will be truncated to ensure good alignment of what's left,
and the split will be 2..127, 128..255, 256..257.  The patch will also
take into account the discard_alignment.

At most one extra request will be introduced, because the first request
will be reduced by at most granularity-1 sectors, and granularity
must be less than max_discard_sectors.  Subsequent requests will run
on round_down(max_discard_sectors, granularity) sectors, as in the
current code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-02 09:48:50 +02:00
Paolo Bonzini f6ff53d361 block: reorganize rounding of max_discard_sectors
Mostly a preparation for the next patch.

In principle this fixes an infinite loop if max_discard_sectors < granularity,
but that really shouldn't happen.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-02 09:48:49 +02:00
Linus Torvalds eff0d13f38 Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-block
Pull block driver changes from Jens Axboe:

 - Making the plugging support for drivers a bit more sane from Neil.
   This supersedes the plugging change from Shaohua as well.

 - The usual round of drbd updates.

 - Using a tail add instead of a head add in the request completion for
   ndb, making us find the most completed request more quickly.

 - A few floppy changes, getting rid of a duplicated flag and also
   running the floppy init async (since it takes forever in boot terms)
   from Andi.

* 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
  floppy: remove duplicated flag FD_RAW_NEED_DISK
  blk: pass from_schedule to non-request unplug functions.
  block: stack unplug
  blk: centralize non-request unplug handling.
  md: remove plug_cnt feature of plugging.
  block/nbd: micro-optimization in nbd request completion
  drbd: announce FLUSH/FUA capability to upper layers
  drbd: fix max_bio_size to be unsigned
  drbd: flush drbd work queue before invalidate/invalidate remote
  drbd: fix potential access after free
  drbd: call local-io-error handler early
  drbd: do not reset rs_pending_cnt too early
  drbd: reset congestion information before reporting it in /proc/drbd
  drbd: report congestion if we are waiting for some userland callback
  drbd: differentiate between normal and forced detach
  drbd: cleanup, remove two unused global flags
  floppy: Run floppy initialization asynchronous
2012-08-01 09:06:47 -07:00
Linus Torvalds 8cf1a3fce0 Merge branch 'for-3.6/core' of git://git.kernel.dk/linux-block
Pull core block IO bits from Jens Axboe:
 "The most complicated part if this is the request allocation rework by
  Tejun, which has been queued up for a long time and has been in
  for-next ditto as well.

  There are a few commits from yesterday and today, mostly trivial and
  obvious fixes.  So I'm pretty confident that it is sound.  It's also
  smaller than usual."

* 'for-3.6/core' of git://git.kernel.dk/linux-block:
  block: remove dead func declaration
  block: add partition resize function to blkpg ioctl
  block: uninitialized ioc->nr_tasks triggers WARN_ON
  block: do not artificially constrain max_sectors for stacking drivers
  blkcg: implement per-blkg request allocation
  block: prepare for multiple request_lists
  block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
  blkcg: inline bio_blkcg() and friends
  block: allocate io_context upfront
  block: refactor get_request[_wait]()
  block: drop custom queue draining used by scsi_transport_{iscsi|fc}
  mempool: add @gfp_mask to mempool_create_node()
  blkcg: make root blkcg allocation use %GFP_KERNEL
  blkcg: __blkg_lookup_create() doesn't need radix preload
2012-08-01 09:02:41 -07:00
Yuanhan Liu 80799fbb7d block: remove dead func declaration
__generic_unplug_device() function is removed with commit
7eaceaccab, which forgot to
remove the declaration at meantime. Here remove it.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01 12:25:54 +02:00
Vivek Goyal c83f6bf98d block: add partition resize function to blkpg ioctl
Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
allows altering the size of an existing partition, even if it is currently
in use.

This patch converts hd_struct->nr_sects into sequence counter because
One might extend a partition while IO is happening to it and update of
nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
can lead to issues like reading inconsistent size of a partition. Sequence
counter have been used so that readers don't have to take bdev mutex lock
as we call sector_in_part() very frequently.

Now all the access to hd_struct->nr_sects should happen using sequence
counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
There is one exception though, set_capacity()/get_capacity(). I think
theoritically race should exist there too but this patch does not
modify set_capacity()/get_capacity() due to sheer number of call sites
and I am afraid that change might break something. I have left that as a
TODO item. We can handle it later if need be. This patch does not introduce
any new races as such w.r.t set_capacity()/get_capacity().

v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Phillip Susi <psusi@ubuntu.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01 12:24:18 +02:00
Olof Johansson 4638a83e86 block: uninitialized ioc->nr_tasks triggers WARN_ON
Hi,

I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
below warning as of 3.5-rc1 and later (3.4 is fine):

[   10.886893] ------------[ cut here ]------------
[   10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
[   10.886905] Hardware name: Bochs
[   10.886906] Modules linked in:
[   10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
[   10.886908] Call Trace:
[   10.886911]  [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0
[   10.886912]  [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20
[   10.886913]  [<ffffffff8107c088>] copy_process+0x1488/0x1560
[   10.886914]  [<ffffffff8107c244>] do_fork+0xb4/0x340
[   10.886918]  [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50
[   10.886919]  [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80
[   10.886920]  [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60
[   10.886923]  [<ffffffff81051db3>] sys_clone+0x23/0x30
[   10.886925]  [<ffffffff8179bd73>] stub_clone+0x13/0x20
[   10.886927]  [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b
[   10.886928] ---[ end trace 32a14af7ee6a590b ]---

Reproducing is easy, I can hit it on a KVM system with a very basic
config (x86_64 make defconfig + enable the drivers needed). To hit it,
just install dump (on debian/ubuntu, not sure what the package might be
called on Fedora), and:

dump -o -f /tmp/foo /

You'll see the warning in dmesg once it forks off the I/O process and
starts dumping filesystem contents.

I bisected it down to the following commit:

commit f6e8d01bee
Author: Tejun Heo <tj@kernel.org>
Date:   Mon Mar 5 13:15:26 2012 -0800

    block: add io_context->active_ref

    Currently ioc->nr_tasks is used to decide two things - whether an ioc
    is done issuing IOs and whether it's shared by multiple tasks.  This
    patch separate out the first into ioc->active_ref, which is acquired
    and released using {get|put}_io_context_active() respectively.

    This will be used to associate bio's with a given task.  This patch
    doesn't introduce any visible behavior change.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Vivek Goyal <vgoyal@redhat.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

It seems like the init of ioc->nr_tasks was removed in that patch,
so it starts out at 0 instead of 1.

Tejun, is the right thing here to add back the init, or should something else
be done?

The below patch removes the warning, but I haven't done any more extensive
testing on it.

Signed-off-by: Olof Johansson <olof@lixom.net>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01 12:17:27 +02:00
Mike Snitzer fe86cdcef7 block: do not artificially constrain max_sectors for stacking drivers
blk_set_stacking_limits is intended to allow stacking drivers to build
up the limits of the stacked device based on the underlying devices'
limits.  But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
doesn't allow the stacking driver to inherit a max_sectors larger than
1024 -- due to blk_stack_limits' use of min_not_zero.

It is now clear that this artificial limit is getting in the way so
change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
stacking drivers like dm-multipath to inherit 'max_sectors' from the
underlying paths).

Reported-by: Vijay Chauhan <vijay.chauhan@netapp.com>
Tested-by: Vijay Chauhan <vijay.chauhan@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01 10:44:28 +02:00
NeilBrown 74018dc306 blk: pass from_schedule to non-request unplug functions.
This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31 09:08:15 +02:00
Shaohua Li 2a7d5559b3 block: stack unplug
MD raid1 prepares to dispatch request in unplug callback. If make_request in
low level queue also uses unplug callback to dispatch request, the low level
queue's unplug callback will not be called. Recheck the callback list helps
this case.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31 09:08:15 +02:00
NeilBrown 9cbb175088 blk: centralize non-request unplug handling.
Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31 09:08:14 +02:00
Muthukumar Ratty e81ca6fe85 [SCSI] block: Fix blk_execute_rq_nowait() dead queue handling
If the queue is dead blk_execute_rq_nowait() doesn't invoke the done()
callback function. That will result in blk_execute_rq() being stuck
in wait_for_completion(). Avoid this by initializing rq->end_io to the
done() callback before we check the queue state. Also, make sure the
queue lock is held around the invocation of the done() callback. Found
this through source code review.

Signed-off-by: Muthukumar Ratty <muthur@gmail.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2012-07-20 08:58:39 +01:00
Tejun Heo a051661ca6 blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued.  When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.

This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup.  As soon as the
request pool is used up, the sequential IO bandwidth crashes.

This patch implements per-blkg request_list.  Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.

* Root blkcg uses the request_list embedded in each request_queue,
  which was renamed to @q->root_rl from @q->rq.  While making blkcg rl
  handling a bit harier, this enables avoiding most overhead for root
  blkcg.

* Queue fullness is properly per request_list but bdi isn't blkcg
  aware yet, so congestion state currently just follows the root
  blkcg.  As writeback isn't aware of blkcg yet, this works okay for
  async congestion but readahead may get the wrong signals.  It's
  better than blkcg completely collapsing with shared request_list but
  needs to be improved with future changes.

* After this change, each block cgroup gets a full request pool making
  resource consumption of each cgroup higher.  This makes allowing
  non-root users to create cgroups less desirable; however, note that
  allowing non-root users to directly manage cgroups is already
  severely broken regardless of this patch - each block cgroup
  consumes kernel memory and skews IO weight (IO weights are not
  hierarchical).

v2: queue-sysfs.txt updated and patch description udpated as suggested
    by Vivek.

v3: blk_get_rl() wasn't checking error return from
    blkg_lookup_create() and may cause oops on lookup failure.  Fix it
    by falling back to root_rl on blkg lookup failures.  This problem
    was spotted by Rakesh Iyer <rni@google.com>.

v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
    request waitqueue".  blk_drain_queue() now wakes up waiters on all
    blkg->rl on the target queue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-26 18:42:49 -04:00
Tejun Heo 5b788ce3e2 block: prepare for multiple request_lists
Request allocation is about to be made per-blkg meaning that there'll
be multiple request lists.

* Make queue full state per request_list.  blk_*queue_full() functions
  are renamed to blk_*rl_full() and takes @rl instead of @q.

* Rename blk_init_free_list() to blk_init_rl() and make it take @rl
  instead of @q.  Also add @gfp_mask parameter.

* Add blk_exit_rl() instead of destroying rl directly from
  blk_release_queue().

* Add request_list->q and make request alloc/free functions -
  blk_free_request(), [__]freed_request(), __get_request() - take @rl
  instead of @q.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:52 +02:00
Tejun Heo 8a5ecdd428 block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
move q->rq.elvpriv to q->nr_rqs_elvpriv.  blk_drain_queue() is updated
to use q->nr_rqs[] instead of q->rq.count[].

These counters separates queue-wide request statistics from the
request list and allow implementation of per-queue request allocation.

While at it, properly indent fields of struct request_list.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:51 +02:00
Tejun Heo b1208b56f3 blkcg: inline bio_blkcg() and friends
Make bio_blkcg() and friends inline.  They all are very simple and
used only in few places.

This patch is to prepare for further updates to request allocation
path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:50 +02:00
Tejun Heo 7f4b35d155 block: allocate io_context upfront
Block layer very lazy allocation of ioc.  It waits until the moment
ioc is absolutely necessary; unfortunately, that time could be inside
queue lock and __get_request() performs unlock - try alloc - retry
dancing.

Just allocate it up-front on entry to block layer.  We're not saving
the rain forest by deferring it to the last possible moment and
complicating things unnecessarily.

This patch is to prepare for further updates to request allocation
path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:50 +02:00
Tejun Heo a06e05e6af block: refactor get_request[_wait]()
Currently, there are two request allocation functions - get_request()
and get_request_wait().  The former tries to allocate a request once
and the latter keeps retrying until it succeeds.  The latter wraps the
former and keeps retrying until allocation succeeds.

The combination of two functions deliver fallible non-wait allocation,
fallible wait allocation and unfailing wait allocation.  However,
given that forward progress is guaranteed, fallible wait allocation
isn't all that useful and in fact nobody uses it.

This patch simplifies the interface as follows.

* get_request() is renamed to __get_request() and is only used by the
  wrapper function.

* get_request_wait() is renamed to get_request().  It now takes
  @gfp_mask and retries iff it contains %__GFP_WAIT.

This patch doesn't introduce any functional change and is to prepare
for further updates to request allocation path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:49 +02:00
Tejun Heo 86072d8112 block: drop custom queue draining used by scsi_transport_{iscsi|fc}
iscsi_remove_host() uses bsg_remove_queue() which implements custom
queue draining.  fc_bsg_remove() open-codes mostly identical logic.

The draining logic isn't correct in that blk_stop_queue() doesn't
prevent new requests from being queued - it just stops processing, so
nothing prevents new requests to be queued after the logic determines
that the queue is drained.

blk_cleanup_queue() now implements proper queue draining and these
custom draining logics aren't necessary.  Drop them and use
bsg_unregister_queue() + blk_cleanup_queue() instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: James Smart <james.smart@emulex.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:48 +02:00
Tejun Heo a91a5ac685 mempool: add @gfp_mask to mempool_create_node()
mempool_create_node() currently assumes %GFP_KERNEL.  Its only user,
blk_init_free_list(), is about to be updated to use other allocation
flags - add @gfp_mask argument to the function.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:47 +02:00
Tejun Heo 159749937a blkcg: make root blkcg allocation use %GFP_KERNEL
Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation
from __blkg_lookup_create() for root blkcg creation.  This could make
policy fail unnecessarily.

Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an
optional @new_blkg for preallocated blkg, and blkcg_activate_policy()
preload radix tree and preallocate blkg with %GFP_KERNEL before trying
to create the root blkg.

v2: __blkg_lookup_create() was returning %NULL on blkg alloc failure
   instead of ERR_PTR() value.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:46 +02:00
Tejun Heo 13589864be blkcg: __blkg_lookup_create() doesn't need radix preload
There's no point in calling radix_tree_preload() if preloading doesn't
use more permissible GFP mask.  Drop preloading from
__blkg_lookup_create().

While at it, drop sparse locking annotation which no longer applies.

v2: Vivek pointed out the odd preload usage.  Instead of updating,
    just drop it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:45 +02:00
Jan Kara 6d93592807 scsi: Silence unnecessary warnings about ioctl to partition
Sometimes, warnings about ioctls to partition happen often enough that they
form majority of the warnings in the kernel log and users complain. In some
cases warnings are about ioctls such as SG_IO so it's not good to get rid of
the warnings completely as they can ease debugging of userspace problems
when ioctl is refused.

Since I have seen warnings from lots of commands, including some proprietary
userspace applications, I don't think disallowing the ioctls for processes
with CAP_SYS_RAWIO will happen in the near future if ever. So lets just
stop warning for processes with CAP_SYS_RAWIO for which ioctl is allowed.

CC: Paolo Bonzini <pbonzini@redhat.com>
CC: James Bottomley <JBottomley@parallels.com>
CC: linux-scsi@vger.kernel.org
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-15 12:52:46 +02:00
Asias He 76aaa5101f block: Drop dead function blk_abort_queue()
This function was only used by btrfs code in btrfs_abort_devices()
(seems in a wrong way).

It was removed in commit d07eb91170,
So, Let's remove the dead code to avoid any confusion.

Changes in v2: update commit log, btrfs_abort_devices() was removed
already.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-kernel@vger.kernel.org
Cc: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs@vger.kernel.org
Cc: David Sterba <dave@jikos.cz>
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-15 08:46:23 +02:00
Asias He 5e5cfac0c6 block: Mitigate lock unbalance caused by lock switching
Commit 777eb1bf15 disconnects externally
supplied queue_lock before blk_drain_queue(). Switching the lock would
introduce lock unbalance because theads which have taken the external
lock might unlock the internal lock in the during the queue drain. This
patch mitigate this by disconnecting the lock after the queue draining
since queue draining makes a lot of request_queue users go away.

However, please note, this patch only makes the problem less likely to
happen. Anyone who still holds a ref might try to issue a new request on
a dead queue after the blk_cleanup_queue() finishes draining, the lock
unbalance might still happen in this case.

 =====================================
 [ BUG: bad unlock balance detected! ]
 3.4.0+ #288 Not tainted
 -------------------------------------
 fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at:
 [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
 but there are no more locks to release!

 other info that might help us debug this:
 1 lock held by fio/17706:
  #0:  (&(&vblk->lock)->rlock){......}, at: [<ffffffff81327f1a>]
 get_request_wait+0x19a/0x250

 stack backtrace:
 Pid: 17706, comm: fio Not tainted 3.4.0+ #288
 Call Trace:
  [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
  [<ffffffff810dea49>] print_unlock_inbalance_bug+0xf9/0x100
  [<ffffffff810dfe4f>] lock_release_non_nested+0x1df/0x330
  [<ffffffff811dae24>] ? dio_bio_end_aio+0x34/0xc0
  [<ffffffff811d6935>] ? bio_check_pages_dirty+0x85/0xe0
  [<ffffffff811daea1>] ? dio_bio_end_aio+0xb1/0xc0
  [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
  [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
  [<ffffffff810e0079>] lock_release+0xd9/0x250
  [<ffffffff81a74553>] _raw_spin_unlock_irq+0x23/0x40
  [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
  [<ffffffff81328faa>] generic_make_request+0xca/0x100
  [<ffffffff81329056>] submit_bio+0x76/0xf0
  [<ffffffff8115470c>] ? set_page_dirty_lock+0x3c/0x60
  [<ffffffff811d69e1>] ? bio_set_pages_dirty+0x51/0x70
  [<ffffffff811dd1a8>] do_blockdev_direct_IO+0xbf8/0xee0
  [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
  [<ffffffff811dd4e5>] __blockdev_direct_IO+0x55/0x60
  [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
  [<ffffffff811d92e7>] blkdev_direct_IO+0x57/0x60
  [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
  [<ffffffff8114c6ae>] generic_file_aio_read+0x70e/0x760
  [<ffffffff810df7c5>] ? __lock_acquire+0x215/0x5a0
  [<ffffffff811e9924>] ? aio_run_iocb+0x54/0x1a0
  [<ffffffff8114bfa0>] ? grab_cache_page_nowait+0xc0/0xc0
  [<ffffffff811e82cc>] aio_rw_vect_retry+0x7c/0x1e0
  [<ffffffff811e8250>] ? aio_fsync+0x30/0x30
  [<ffffffff811e9936>] aio_run_iocb+0x66/0x1a0
  [<ffffffff811ea9b0>] do_io_submit+0x6f0/0xb80
  [<ffffffff8134de2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [<ffffffff811eae50>] sys_io_submit+0x10/0x20
  [<ffffffff81a7c9e9>] system_call_fastpath+0x16/0x1b

Changes since v2: Update commit log to explain how the code is still
                  broken even if we delay the lock switching after the drain.
Changes since v1: Update commit log as Tejun suggested.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-15 08:46:22 +02:00
Asias He 458f27a982 block: Avoid missed wakeup in request waitqueue
After hot-unplug a stressed disk, I found that rl->wait[] is not empty
while rl->count[] is empty and there are theads still sleeping on
get_request after the queue cleanup. With simple debug code, I found
there are exactly nr_sleep - nr_wakeup of theads in D state. So there
are missed wakeup.

  $ dmesg | grep nr_sleep
  [   52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173
  $ vmstat 1
  1 173  0 712640  24292  96172 0 0  0  0  419  757  0  0  0 100  0

To quote Tejun:

  Ah, okay, freed_request() wakes up single waiter with the assumption
  that after the wakeup there will at least be one successful allocation
  which in turn will continue the wakeup chain until the wait list is
  empty - ie. waiter wakeup is dependent on successful request
  allocation happening after each wakeup.  With queue marked dead, any
  woken up waiter fails the allocation path, so the wakeup chaining is
  lost and we're left with hung waiters. What we need is wake_up_all()
  after drain completion.

This patch fixes the missed wakeup by waking up all the theads which
are sleeping on wait queue after queue drain.

Changes in v2: Drop waitqueue_active() optimization

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Asias He <asias@redhat.com>

Fixed a bug by me, where stacked devices would oops on calling
blk_drain_queue() since ->rq.wait[] do not get initialized unless
it's a full queue setup.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-15 08:45:25 +02:00
Tejun Heo 27e1f9d1cc blkcg: drop local variable @q from blkg_destroy()
blkg_destroy() caches @blkg->q in local variable @q.  While there are
two places which needs @blkg->q, only lockdep_assert_held() used the
local variable leading to unused local variable warning if lockdep is
configured out.  Drop the local variable and just use @blkg->q
directly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rakesh Iyer <rni@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-06 08:35:31 +02:00
Tejun Heo 9b2ea86bc9 blkcg: fix blkg_alloc() failure path
When policy data allocation fails in the middle, blkg_alloc() invokes
blkg_free() to destroy the half constructed blkg.  This ends up
calling pd_exit_fn() on policy datas which didn't go through
pd_init_fn().  Fix it by making blkg_alloc() call pd_init_fn()
immediately after each policy data allocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-04 10:03:21 +02:00
Tejun Heo ffea73fc72 block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED
cfq may be built w/ or w/o blkcg support depending on
CONFIG_CFQ_CGROUP_IOSCHED.  If blkcg support is disabled, most of
related code is ifdef'd out but some part is left dangling -
blkcg_policy_cfq is left zero-filled and blkcg_policy_[un]register()
calls are made on it.

Feeding zero filled policy to blkcg_policy_register() is incorrect and
triggers the following WARN_ON() if CONFIG_BLK_CGROUP &&
!CONFIG_CFQ_GROUP_IOSCHED.

 ------------[ cut here ]------------
 WARNING: at block/blk-cgroup.c:867
 Modules linked in:
 Modules linked in:
 CPU: 3 Not tainted 3.4.0-09547-gfb21aff #1
 Process swapper/0 (pid: 1, task: 000000003ff80000, ksp: 000000003ff7f8b8)
 Krnl PSW : 0704100180000000 00000000003d76ca (blkcg_policy_register+0xca/0xe0)
	    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
 Krnl GPRS: 0000000000000000 00000000014b85ec 00000000014b85b0 0000000000000000
	    000000000096fb60 0000000000000000 00000000009a8e78 0000000000000048
	    000000000099c070 0000000000b6f000 0000000000000000 000000000099c0b8
	    00000000014b85b0 0000000000667580 000000003ff7fd98 000000003ff7fd70
 Krnl Code: 00000000003d76be: a7280001           lhi     %r2,1
	    00000000003d76c2: a7f4ffdf           brc     15,3d7680
	   #00000000003d76c6: a7f40001           brc     15,3d76c8
	   >00000000003d76ca: a7c8ffea           lhi     %r12,-22
	    00000000003d76ce: a7f4ffce           brc     15,3d766a
	    00000000003d76d2: a7f40001           brc     15,3d76d4
	    00000000003d76d6: a7c80000           lhi     %r12,0
	    00000000003d76da: a7f4ffc2           brc     15,3d765e
 Call Trace:
 ([<0000000000b6f000>] initcall_debug+0x0/0x4)
  [<0000000000989e8a>] cfq_init+0x62/0xd4
  [<00000000001000ba>] do_one_initcall+0x3a/0x170
  [<000000000096fb60>] kernel_init+0x214/0x2bc
  [<0000000000623202>] kernel_thread_starter+0x6/0xc
  [<00000000006231fc>] kernel_thread_starter+0x0/0xc
 no locks held by swapper/0/1.
 Last Breaking-Event-Address:
  [<00000000003d76c6>] blkcg_policy_register+0xc6/0xe0
 ---[ end trace b8ef4903fcbf9dd3 ]---

This patch fixes the problem by ensuring all blkcg support code is
inside CONFIG_CFQ_GROUP_IOSCHED.

* blkcg_policy_cfq declaration and blkg_to_cfqg() definition are moved
  inside the first CONFIG_CFQ_GROUP_IOSCHED block.  __maybe_unused is
  dropped from blkcg_policy_cfq decl.

* blkcg_deactivate_poilcy() invocation is moved inside ifdef.  This
  also makes the activation logic match cfq_init_queue().

* All blkcg_policy_[un]register() invocations are moved inside ifdef.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <20120601112954.GC3535@osiris.boeblingen.de.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-04 10:02:29 +02:00
Tejun Heo fd7949564c block: fix return value on cfq_init() failure
cfq_init() would return zero after kmem cache creation failure.  Fix
so that it returns -ENOMEM.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-04 10:01:38 +02:00
Eric Dumazet 3c9c708c9f block: avoid infinite loop in get_task_io_context()
Calling get_task_io_context() on a exiting task which isn't %current can
loop forever. This triggers at boot time on my dev machine.

BUG: soft lockup - CPU#3 stuck for 22s ! [mountall.1603]

Fix this by making create_task_io_context() returns -EBUSY in this case
to break the loop.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 13:39:05 +02:00
Linus Torvalds 0d167518e0 Merge branch 'for-3.5/core' of git://git.kernel.dk/linux-block
Merge block/IO core bits from Jens Axboe:
 "This is a bit bigger on the core side than usual, but that is purely
  because we decided to hold off on parts of Tejun's submission on 3.4
  to give it a bit more time to simmer.  As a consequence, it's seen a
  long cycle in for-next.

  It contains:

   - Bug fix from Dan, wrong locking type.
   - Relax splice gifting restriction from Eric.
   - A ton of updates from Tejun, primarily for blkcg.  This improves
     the code a lot, making the API nicer and cleaner, and also includes
     fixes for how we handle and tie policies and re-activate on
     switches.  The changes also include generic bug fixes.
   - A simple fix from Vivek, along with a fix for doing proper delayed
     allocation of the blkcg stats."

Fix up annoying conflict just due to different merge resolution in
Documentation/feature-removal-schedule.txt

* 'for-3.5/core' of git://git.kernel.dk/linux-block: (92 commits)
  blkcg: tg_stats_alloc_lock is an irq lock
  vmsplice: relax alignement requirements for SPLICE_F_GIFT
  blkcg: use radix tree to index blkgs from blkcg
  blkcg: fix blkcg->css ref leak in __blkg_lookup_create()
  block: fix elvpriv allocation failure handling
  block: collapse blk_alloc_request() into get_request()
  blkcg: collapse blkcg_policy_ops into blkcg_policy
  blkcg: embed struct blkg_policy_data in policy specific data
  blkcg: mass rename of blkcg API
  blkcg: style cleanups for blk-cgroup.h
  blkcg: remove blkio_group->path[]
  blkcg: blkg_rwstat_read() was missing inline
  blkcg: shoot down blkgs if all policies are deactivated
  blkcg: drop stuff unused after per-queue policy activation update
  blkcg: implement per-queue policy activation
  blkcg: add request_queue->root_blkg
  blkcg: make request_queue bypassing on allocation
  blkcg: make sure blkg_lookup() returns %NULL if @q is bypassing
  blkcg: make blkg_conf_prep() take @pol and return with queue lock held
  blkcg: remove static policy ID enums
  ...
2012-05-30 08:52:42 -07:00
Tejun Heo ff26eaadf4 blkcg: tg_stats_alloc_lock is an irq lock
tg_stats_alloc_lock nests inside queue lock and should always be held
with irq disabled.  throtl_pd_{init|exit}() were using non-irqsafe
spinlock ops which triggered inverse lock ordering via irq warning via
RCU freeing of blkg invoking throtl_pd_exit() w/o disabling IRQ.

Update both functions to use irq safe operations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
LKML-Reference: <1335339396.16988.80.camel@lappy>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-23 12:16:21 +02:00
Linus Torvalds 88d6ae8dc3 Merge branch 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "cgroup file type addition / removal is updated so that file types are
  added and removed instead of individual files so that dynamic file
  type addition / removal can be implemented by cgroup and used by
  controllers.  blkio controller changes which will come through block
  tree are dependent on this.  Other changes include res_counter cleanup
  and disallowing kthread / PF_THREAD_BOUND threads to be attached to
  non-root cgroups.

  There's a reported bug with the file type addition / removal handling
  which can lead to oops on cgroup umount.  The issue is being looked
  into.  It shouldn't cause problems for most setups and isn't a
  security concern."

Fix up trivial conflict in Documentation/feature-removal-schedule.txt

* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
  res_counter: Account max_usage when calling res_counter_charge_nofail()
  res_counter: Merge res_counter_charge and res_counter_charge_nofail
  cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
  cgroup: remove cgroup_subsys->populate()
  cgroup: get rid of populate for memcg
  cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
  cgroup: make css->refcnt clearing on cgroup removal optional
  cgroup: use negative bias on css->refcnt to block css_tryget()
  cgroup: implement cgroup_rm_cftypes()
  cgroup: introduce struct cfent
  cgroup: relocate __d_cgrp() and __d_cft()
  cgroup: remove cgroup_add_file[s]()
  cgroup: convert memcg controller to the new cftype interface
  memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
  cgroup: convert all non-memcg controllers to the new cftype interface
  cgroup: relocate cftype and cgroup_subsys definitions in controllers
  cgroup: merge cft_release_agent cftype array into the base files array
  cgroup: implement cgroup_add_cftypes() and friends
  cgroup: build list of all cgroups under a given cgroupfs_root
  cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
  ...
2012-05-22 17:40:19 -07:00
Linus Torvalds e60b9a0346 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:
 "Just a random collection of bug-fixes and cleanups, nothing new in
  this merge request."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (46 commits)
  s390/ap: Fix wrong or missing comments
  s390/ap: move receive callback to message struct
  s390/dasd: re-prioritize partition detection message
  s390/qeth: reshuffle initialization
  s390/qeth: cleanup drv attr usage
  s390/claw: cleanup drv attr usage
  s390/lcs: cleanup drv attr usage
  s390/ctc: cleanup drv attr usage
  s390/ccwgroup: remove ccwgroup_create_from_string
  s390/qeth: stop using struct ccwgroup driver for discipline callbacks
  s390/qeth: switch to ccwgroup_create_dev
  s390/claw: switch to ccwgroup_create_dev
  s390/lcs: switch to ccwgroup_create_dev
  s390/ctcm: switch to ccwgroup_create_dev
  s390/ccwgroup: exploit ccwdev_by_dev_id
  s390/ccwgroup: introduce ccwgroup_create_dev
  s390: fix race on TIF_MCCK_PENDING
  s390/barrier: make use of fast-bcr facility
  s390/barrier: cleanup barrier functions
  s390/claw: remove "eieio" calls
  ...
2012-05-21 12:41:17 -07:00
Stefan Haberland 505e5ecfd3 s390/dasd: re-prioritize partition detection message
To avoid confusion while formatting a DASD device change the level of
the "Expected VOL1 label not found" message from warning to info.

Signed-off-by: Stefan Haberland <stefan.haberland@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-05-16 14:42:51 +02:00
Tejun Heo 05c69d298c block: fix buffer overflow when printing partition UUIDs
6d1d8050b4 "block, partition: add partition_meta_info to hd_struct"
added part_unpack_uuid() which assumes that the passed in buffer has
enough space for sprintfing "%pU" - 37 characters including '\0'.

Unfortunately, b5af921ec0 "init: add support for root devices
specified by partition UUID" supplied 33 bytes buffer to the function
leading to the following panic with stackprotector enabled.

  Kernel panic - not syncing: stack-protector: Kernel stack corrupted in: ffffffff81b14c7e

  [<ffffffff815e226b>] panic+0xba/0x1c6
  [<ffffffff81b14c7e>] ? printk_all_partitions+0x259/0x26xb
  [<ffffffff810566bb>] __stack_chk_fail+0x1b/0x20
  [<ffffffff81b15c7e>] printk_all_paritions+0x259/0x26xb
  [<ffffffff81aedfe0>] mount_block_root+0x1bc/0x27f
  [<ffffffff81aee0fa>] mount_root+0x57/0x5b
  [<ffffffff81aee23b>] prepare_namespace+0x13d/0x176
  [<ffffffff8107eec0>] ? release_tgcred.isra.4+0x330/0x30
  [<ffffffff81aedd60>] kernel_init+0x155/0x15a
  [<ffffffff81087b97>] ? schedule_tail+0x27/0xb0
  [<ffffffff815f4d24>] kernel_thread_helper+0x5/0x10
  [<ffffffff81aedc0b>] ? start_kernel+0x3c5/0x3c5
  [<ffffffff815f4d20>] ? gs_change+0x13/0x13

Increase the buffer size, remove the dangerous part_unpack_uuid() and
use snprintf() directly from printk_all_partitions().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Szymon Gruszczynski <sz.gruszczynski@googlemail.com>
Cc: Will Drewry <wad@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-15 08:22:04 +02:00
Jens Axboe 0b7877d4ee Linux 3.4-rc5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQEcBAABAgAGBQJPnb50AAoJEHm+PkMAQRiGAE0H/A4zFZIUGmF3miKPDYmejmrZ
 oVDYxVAu6JHjHWhu8E3VsinvyVscowjV8dr15eSaQzmDmRkSHAnUQ+dB7Di7jLC2
 MNopxsWjwyZ8zvvr3rFR76kjbWKk/1GYytnf7GPZLbJQzd51om2V/TY/6qkwiDSX
 U8Tt7ihSgHAezefqEmWp2X/1pxDCEt+VFyn9vWpkhgdfM1iuzF39MbxSZAgqDQ/9
 JJrBHFXhArqJguhENwL7OdDzkYqkdzlGtS0xgeY7qio2CzSXxZXK4svT6FFGA8Za
 xlAaIvzslDniv3vR2ZKd6wzUwFHuynX222hNim3QMaYdXm012M+Nn1ufKYGFxI0=
 =4d4w
 -----END PGP SIGNATURE-----

Merge tag 'v3.4-rc5' into for-3.5/core

The core branch is behind driver commits that we want to build
on for 3.5, hence I'm pulling in a later -rc.

Linux 3.4-rc5

Conflicts:
	Documentation/feature-removal-schedule.txt

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-01 14:29:55 +02:00
Tejun Heo a637120e49 blkcg: use radix tree to index blkgs from blkcg
blkg lookup is currently performed by traversing linked list anchored
at blkcg->blkg_list.  This is very unscalable and with blk-throttle
enabled and enough request queues on the system, this can get very
ugly quickly (blk-throttle performs look up on every bio submission).

This patch makes blkcg use radix tree to index blkgs combined with
simple last-looked-up hint.  This is mostly identical to how icqs are
indexed from ioc.

Note that because __blkg_lookup() may be invoked without holding queue
lock, hint is only updated from __blkg_lookup_create().  Due to cfq's
cfqq caching, this makes hint updates overly lazy.  This will be
improved with scheduled blkcg aware request allocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:40 +02:00
Tejun Heo 496fb7806d blkcg: fix blkcg->css ref leak in __blkg_lookup_create()
__blkg_lookup_create() leaked blkcg->css ref if blkg allocation
failed.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:40 +02:00
Tejun Heo aaf7c68068 block: fix elvpriv allocation failure handling
Request allocation is mempool backed to guarantee forward progress
under memory pressure; unfortunately, this property got broken while
adding elvpriv data.  Failures during elvpriv allocation, including
ioc and icq creation failures, currently make get_request() fail as
whole.  There's no forward progress guarantee for these allocations -
they may fail indefinitely under memory pressure stalling IO and
deadlocking the system.

This patch updates get_request() such that elvpriv allocation failure
doesn't make the whole function fail.  If elvpriv allocation fails,
the allocation is degraded into !ELVPRIV.  This will force the request
to ELEVATOR_INSERT_BACK disturbing scheduling but elvpriv alloc
failures should be rare (nothing is per-request) and anything is
better than deadlocking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:40 +02:00
Tejun Heo 29e2b09ab5 block: collapse blk_alloc_request() into get_request()
Allocation failure handling in get_request() is about to be updated.
To ease the update, collapse blk_alloc_request() into get_request().

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:40 +02:00
Tejun Heo f9fcc2d391 blkcg: collapse blkcg_policy_ops into blkcg_policy
There's no reason to keep blkcg_policy_ops separate.  Collapse it into
blkcg_policy.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:17 +02:00
Tejun Heo f95a04afa8 blkcg: embed struct blkg_policy_data in policy specific data
Currently blkg_policy_data carries policy specific data as char flex
array instead of being embedded in policy specific data.  This was
forced by oddities around blkg allocation which are all gone now.

This patch makes blkg_policy_data embedded in policy specific data -
throtl_grp and cfq_group so that it's more conventional and consistent
with how io_cq is handled.

* blkcg_policy->pdata_size is renamed to ->pd_size.

* Functions which used to take void *pdata now takes struct
  blkg_policy_data *pd.

* blkg_to_pdata/pdata_to_blkg() updated to blkg_to_pd/pd_to_blkg().

* Dummy struct blkg_policy_data definition added.  Dummy
  pdata_to_blkg() definition was unused and inconsistent with the
  non-dummy version - correct dummy pd_to_blkg() added.

* throtl and cfq updated accordingly.

* As dummy blkg_to_pd/pd_to_blkg() are provided,
  blkg_to_cfqg/cfqg_to_blkg() don't need to be ifdef'd.  Moved outside
  ifdef block.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:17 +02:00
Tejun Heo 3c798398e3 blkcg: mass rename of blkcg API
During the recent blkcg cleanup, most of blkcg API has changed to such
extent that mass renaming wouldn't cause any noticeable pain.  Take
the chance and cleanup the naming.

* Rename blkio_cgroup to blkcg.

* Drop blkio / blkiocg prefixes and consistently use blkcg.

* Rename blkio_group to blkcg_gq, which is consistent with io_cq but
  keep the blkg prefix / variable name.

* Rename policy method type and field names to signify they're dealing
  with policy data.

* Rename blkio_policy_type to blkcg_policy.

This patch doesn't cause any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:17 +02:00
Tejun Heo 36558c8a30 blkcg: style cleanups for blk-cgroup.h
* Update indentation on struct field declarations.

* Uniformly don't use "extern" on function declarations.

* Merge the two #ifdef CONFIG_BLK_CGROUP blocks.

All changes in this patch are cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:16 +02:00
Tejun Heo 54e7ed12ba blkcg: remove blkio_group->path[]
blkio_group->path[] stores the path of the associated cgroup and is
used only for debug messages.  Just format the path from blkg->cgroup
when printing debug messages.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:16 +02:00
Tejun Heo c94bed8999 blkcg: blkg_rwstat_read() was missing inline
blkg_rwstat_read() in blk-cgroup.h was missing inline modifier causing
compile warning depending on configuration.  Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:16 +02:00
Tejun Heo 6d18b008da blkcg: shoot down blkgs if all policies are deactivated
There's no reason to keep blkgs around if no policy is activated for
the queue.  This patch moves queue locking out of blkg_destroy_all()
and call it from blkg_deactivate_policy() on deactivation of the last
policy on the queue.

This change was suggested by Vivek.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:06 +02:00
Tejun Heo 3c96cb32d3 blkcg: drop stuff unused after per-queue policy activation update
* All_q_list is unused.  Drop all_q_{mutex|list}.

* @for_root of blkg_lookup_create() is always %false when called from
  outside blk-cgroup.c proper.  Factor out __blkg_lookup_create() so
  that it doesn't check whether @q is bypassing and use the
  underscored version for the @for_root callsite.

* blkg_destroy_all() is used only from blkcg proper and @destroy_root
  is always %true.  Make it static and drop @destroy_root.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:06 +02:00
Tejun Heo a2b1693bac blkcg: implement per-queue policy activation
All blkcg policies were assumed to be enabled on all request_queues.
Due to various implementation obstacles, during the recent blkcg core
updates, this was temporarily implemented as shooting down all !root
blkgs on elevator switch and policy [de]registration combined with
half-broken in-place root blkg updates.  In addition to being buggy
and racy, this meant losing all blkcg configurations across those
events.

Now that blkcg is cleaned up enough, this patch replaces the temporary
implementation with proper per-queue policy activation.  Each blkcg
policy should call the new blkcg_[de]activate_policy() to enable and
disable the policy on a specific queue.  blkcg_activate_policy()
allocates and installs policy data for the policy for all existing
blkgs.  blkcg_deactivate_policy() does the reverse.  If a policy is
not enabled for a given queue, blkg printing / config functions skip
the respective blkg for the queue.

blkcg_activate_policy() also takes care of root blkg creation, and
cfq_init_queue() and blk_throtl_init() are updated accordingly.

This replaces blkcg_bypass_{start|end}() and update_root_blkg_pd()
unnecessary.  Dropped.

v2: cfq_init_queue() was returning uninitialized @ret on root_group
    alloc failure if !CONFIG_CFQ_GROUP_IOSCHED.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:06 +02:00
Tejun Heo 03d8e11142 blkcg: add request_queue->root_blkg
With per-queue policy activation, root blkg creation will be moved to
blkcg core.  Add q->root_blkg in preparation.  For blk-throtl, this
replaces throtl_data->root_tg; however, cfq needs to keep
cfqd->root_group for !CONFIG_CFQ_GROUP_IOSCHED.

This is to prepare for per-queue policy activation and doesn't cause
any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:06 +02:00
Tejun Heo b82d4b197c blkcg: make request_queue bypassing on allocation
With the previous change to guarantee bypass visiblity for RCU read
lock regions, entering bypass mode involves non-trivial overhead and
future changes are scheduled to make use of bypass mode during init
path.  Combined it may end up adding noticeable delay during boot.

This patch makes request_queue start its life in bypass mode, which is
ended on queue init completion at the end of
blk_init_allocated_queue(), and updates blk_queue_bypass_start() such
that draining and RCU synchronization are performed only when the
queue actually enters bypass mode.

This avoids unnecessarily switching in and out of bypass mode during
init avoiding the overhead and any nasty surprises which may step from
leaving bypass mode on half-initialized queues.

The boot time overhead was pointed out by Vivek.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:06 +02:00
Tejun Heo 80fd99792b blkcg: make sure blkg_lookup() returns %NULL if @q is bypassing
Currently, blkg_lookup() doesn't check @q bypass state.  This patch
updates blk_queue_bypass_start() to do synchronize_rcu() before
returning and updates blkg_lookup() to check blk_queue_bypass() and
return %NULL if bypassing.  This ensures blkg_lookup() returns %NULL
if @q is bypassing.

This is to guarantee that nobody is accessing policy data while @q is
bypassing, which is necessary to allow replacing blkio_cgroup->pd[] in
place on policy [de]activation.

v2: Added more comments explaining bypass guarantees as suggested by
    Vivek.

v3: Added more comments explaining why there's no synchronize_rcu() in
    blk_cleanup_queue() as suggested by Vivek.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:06 +02:00
Tejun Heo da8b066262 blkcg: make blkg_conf_prep() take @pol and return with queue lock held
Add @pol to blkg_conf_prep() and let it return with queue lock held
(to be released by blkg_conf_finish()).  Note that @pol isn't used
yet.

This is to prepare for per-queue policy activation and doesn't cause
any visible difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:06 +02:00
Tejun Heo 8bd435b30e blkcg: remove static policy ID enums
Remove BLKIO_POLICY_* enums and let blkio_policy_register() allocate
@pol->plid dynamically on registration.  The maximum number of blkcg
policies which can be registered at the same time is defined by
BLKCG_MAX_POLS constant added to include/linux/blkdev.h.

Note that blkio_policy_register() now may fail.  Policy init functions
updated accordingly and unnecessary ifdefs removed from cfq_init().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:06 +02:00
Tejun Heo ec399347d3 blkcg: use @pol instead of @plid in update_root_blkg_pd() and blkcg_print_blkgs()
The two functions were taking "enum blkio_policy_id plid".  Make them
take "const struct blkio_policy_type *pol" instead.

This is to prepare for per-queue policy activation and doesn't cause
any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:06 +02:00
Tejun Heo bc0d6501a8 blkcg: kill blkio_list and replace blkio_list_lock with a mutex
With blkio_policy[], blkio_list is redundant and hinders with
per-queue policy activation.  Remove it.  Also, replace
blkio_list_lock with a mutex blkcg_pol_mutex and let it protect the
whole [un]registration.

This is to prepare for per-queue policy activation and doesn't cause
any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:06 +02:00
Tejun Heo f48ec1d788 cfq: fix build breakage & warnings
* CFQ_WEIGHT_* defined inside CONFIG_BLK_CGROUP causes cfq-iosched.c
  compile failure when the config is disabled.  Move it outside the
  ifdef block.

* Dummy cfqg_stats_*() definitions were lacking inline modifiers
  causing unused functions warning if !CONFIG_CFQ_GROUP_IOSCHED.  Add
  them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20 10:06:06 +02:00
Linus Torvalds d8dd0b6d48 Merge branch 'for-3.4/core' of git://git.kernel.dk/linux-block
Pull block core bits from Jens Axboe:
 "It's a nice and quiet round this time, since most of the tricky stuff
  has been pushed to 3.5 to give it more time to mature.  After a few
  hectic block IO core changes for 3.3 and 3.2, I'm quite happy with a
  slow round.

  Really minor stuff in here, the only real functional change is making
  the auto-unplug threshold a per-queue entity.  The threshold is set so
  that it's low enough that we don't hold off IO for too long, but still
  big enough to get a nice benefit from the batched insert (and hence
  queue lock cost reduction).  For raid configurations, this currently
  breaks down."

* 'for-3.4/core' of git://git.kernel.dk/linux-block:
  block: make auto block plug flush threshold per-disk based
  Documentation: Add sysfs ABI change for cfq's target latency.
  block: Make cfq_target_latency tunable through sysfs.
  block: use lockdep_assert_held for queue locking
  block: blk_alloc_queue_node(): use caller's GFP flags instead of GFP_KERNEL
2012-04-13 18:07:19 -07:00
Shaohua Li 1b2e19f17e block: make auto block plug flush threshold per-disk based
We do auto block plug flush to reduce latency, the threshold is 16
requests. This works well if the task is accessing one or two drives.
The problem is if the task is accessing a raid 0 device and the raid
disk number is big, say 8 or 16, 16/8 = 2 or 16/16=1, we will have
heavy lock contention.

This patch makes the threshold per-disk based. The latency should be
still ok accessing one or two drives. The setup with application
accessing a lot of drives in the meantime uaually is big machine,
avoiding lock contention is more important, because any contention
will actually increase latency.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-06 11:37:47 -06:00
Tejun Heo 5bc4afb1ec blkcg: drop BLKCG_STAT_{PRIV|POL|OFF} macros
Now that all stat handling code lives in policy implementations,
there's no need to encode policy ID in cft->private.

* Export blkcg_prfill_[rw]stat() from blkcg, remove
  blkcg_print_[rw]stat(), and implement cfqg_print_[rw]stat() which
  use hard-code BLKIO_POLICY_PROP.

* Use cft->private for offset of the target field directly and drop
  BLKCG_STAT_{PRIV|POL|OFF}().

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:45 -07:00
Tejun Heo d366e7ec41 blkcg: pass around pd->pdata instead of pd itself in prfill functions
Now that all conf and stat fields are moved into policy specific
blkio_policy_data->pdata areas, there's no reason to use
blkio_policy_data itself in prfill functions.  Pass around @pd->pdata
instead of @pd.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:44 -07:00
Tejun Heo af133ceb26 blkcg: move blkio_group_conf->iops and ->bps to blk-throttle
blkio_cgroup_conf->iops and ->bps are owned by blk-throttle and has no
reason to be defined in blkcg core.  Drop them and let conf setting
functions directly manipulate throtl_grp->bps[] and ->iops[].

This makes blkio_group_conf empty.  Drop it.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:44 -07:00
Tejun Heo 3381cb8d2e blkcg: move blkio_group_conf->weight to cfq
blkio_group_conf->weight is owned by cfq and has no reason to be
defined in blkcg core.  Replace it with cfq_group->dev_weight and let
conf setting functions directly set it.  If dev_weight is zero, the
cfqg doesn't have device specific weight configured.

Also, rename BLKIO_WEIGHT_* constants to CFQ_WEIGHT_* and rename
blkio_cgroup->weight to blkio_cgroup->cfq_weight.  We eventually want
per-policy storage in blkio_cgroup but just mark the ownership of the
field for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:44 -07:00
Tejun Heo 8a3d26151f blkcg: move blkio_group_stats_cpu and friends to blk-throttle.c
blkio_group_stats_cpu is used only by blk-throtl and has no reason to
be defined in blkcg core.

* Move blkio_group_stats_cpu to blk-throttle.c and rename it to
  tg_stats_cpu.

* blkg_policy_data->stats_cpu is replaced with throtl_grp->stats_cpu.
  prfill functions updated accordingly.

* All related macros / functions are renamed so that they have tg_
  prefix and the unnecessary @pol arguments are dropped.

* Per-cpu stats allocation code is also moved from blk-cgroup.c to
  blk-throttle.c and gets simplified to only deal with
  BLKIO_POLICY_THROTL.  percpu stat free is performed by the exit
  method throtl_exit_blkio_group().

* throtl_reset_group_stats() implemented for
  blkio_reset_group_stats_fn method so that tg->stats_cpu can be
  reset.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:44 -07:00
Tejun Heo 155fead9b6 blkcg: move blkio_group_stats to cfq-iosched.c
blkio_group_stats contains only fields used by cfq and has no reason
to be defined in blkcg core.

* Move blkio_group_stats to cfq-iosched.c and rename it to cfqg_stats.

* blkg_policy_data->stats is replaced with cfq_group->stats.
  blkg_prfill_[rw]stat() are updated to use offset against pd->pdata
  instead.

* All related macros / functions are renamed so that they have cfqg_
  prefix and the unnecessary @pol arguments are dropped.

* All stat functions now take cfq_group * instead of blkio_group *.

* lockdep assertion on queue lock dropped.  Elevator runs under queue
  lock by default.  There isn't much to be gained by adding lockdep
  assertions at stat function level.

* cfqg_stats_reset() implemented for blkio_reset_group_stats_fn method
  so that cfqg->stats can be reset.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:44 -07:00
Tejun Heo 9ade5ea4ce blkcg: add blkio_policy_ops operations for exit and stat reset
Add blkio_policy_ops->blkio_exit_group_fn() and
->blkio_reset_group_stats_fn().  These will be used to further
modularize blkcg policy implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:44 -07:00
Tejun Heo 41b38b6d54 blkcg: cfq doesn't need per-cpu dispatch stats
blkio_group_stats_cpu is used to count dispatch stats using per-cpu
counters.  This is used by both blk-throtl and cfq-iosched but the
sharing is rather silly.

* cfq-iosched doesn't need per-cpu dispatch stats.  cfq always updates
  those stats while holding queue_lock.

* blk-throtl needs per-cpu dispatch stats but only service_bytes and
  serviced.  It doesn't make use of sectors.

This patch makes cfq add and use global stats for service_bytes,
serviced and sectors, removes per-cpu sectors counter and moves
per-cpu stat printing code to blk-throttle.c.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:44 -07:00
Tejun Heo 629ed0b102 blkcg: move statistics update code to policies
As with conf/stats file handling code, there's no reason for stat
update code to live in blkcg core with policies calling into update
them.  The current organization is both inflexible and complex.

This patch moves stat update code to specific policies.  All
blkiocg_update_*_stats() functions which deal with BLKIO_POLICY_PROP
stats are collapsed into their cfq_blkiocg_update_*_stats()
counterparts.  blkiocg_update_dispatch_stats() is used by both
policies and duplicated as throtl_update_dispatch_stats() and
cfq_blkiocg_update_dispatch_stats().  This will be cleaned up later.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:44 -07:00
Tejun Heo 2ce4d50f9c cfq: collapse cfq.h into cfq-iosched.c
block/cfq.h contains some functions which interact with blkcg;
however, this is only part of it and cfq-iosched.c already has quite
some #ifdef CONFIG_CFQ_GROUP_IOSCHED.  With conf/stat handling being
moved to specific policies, having these relay functions isolated in
cfq.h doesn't make much sense.  Collapse cfq.h into cfq-iosched.c for
now.  Let's split blkcg support properly later if necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:43 -07:00
Tejun Heo 60c2bc2d5a blkcg: move conf/stat file handling code to policies
blkcg conf/stat handling is convoluted in that details which belong to
specific policy implementations are all out in blkcg core and then
policies hook into core layer to access and manipulate confs and
stats.  This sadly achieves both inflexibility (confs/stats can't be
modified without messing with blkcg core) and complexity (all the
call-ins and call-backs).

The previous patches restructured conf and stat handling code such
that they can be separated out.  This patch relocates the file
handling part.  All conf/stat file handling code which belongs to
BLKIO_POLICY_PROP is moved to cfq-iosched.c and all
BKLIO_POLICY_THROTL code to blk-throtl.c.

The move is verbatim except for blkio_update_group_{weight|bps|iops}()
callbacks which relays conf changes to policies.  The configuration
settings are handled in policies themselves so the relaying isn't
necessary.  Conf setting functions are modified to directly call
per-policy update functions and the relaying mechanism is dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:43 -07:00
Tejun Heo 44ea53de46 blkcg: implement blkio_policy_type->cftypes
Add blkiop->cftypes which is added and removed together with the
policy.  This will be used to move conf/stat handling to the policies.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:43 -07:00
Tejun Heo 829fdb5000 blkcg: export conf/stat helpers to prepare for reorganization
conf/stat handling is about to be moved to policy implementation from
blkcg core.  Export conf/stat helpers from blkcg core so that
blk-throttle and cfq-iosched can use them.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:43 -07:00
Tejun Heo 726fa6945e blkcg: simplify blkg_conf_prep()
blkg_conf_prep() implements "MAJ:MIN VAL" parsing manually, which is
unnecessary.  Just use sscanf("%u:%u %llu").  This might not reject
some malformed input (extra input at the end) but we don't care.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:43 -07:00
Tejun Heo 3a8b31d396 blkcg: restructure blkio_group configruation setting
As part of userland interface restructuring, this patch updates
per-blkio_group configuration setting.  Instead of funneling
everything through a master function which has hard-coded cases for
each config file it may handle, the common part is factored into
blkg_conf_prep() and blkg_conf_finish() and different configuration
setters are implemented using the helpers.

While this doesn't result in immediate LOC reduction, this enables
further cleanups and more modular implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:43 -07:00
Tejun Heo c4682aec9c blkcg: restructure configuration printing
Similarly to the previous stat restructuring, this patch restructures
conf printing code such that,

* Conf printing uses the same helpers as stat.

* Printing function doesn't require hardcoded switching on the config
  being printed.  Note that this isn't complete yet for throttle
  confs.  The next patch will convert setting for these confs and will
  complete the transition.

* Printing uses read_seq_string callback (other methods will be phased
  out).

Note that blkio_group_conf.iops[2] is changed to u64 so that they can
be manipulated with the same functions.  This is transitional and will
go away later.

After this patch, per-device configurations - weight, bps and iops -
use __blkg_prfill_u64() for printing which uses white space as
delimiter instead of tab.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:43 -07:00
Tejun Heo 627f29f481 blkcg: drop blkiocg_file_write_u64()
blkiocg_file_write_u64() has single switch case.  Drop
blkiocg_file_write_u64(), rename blkio_weight_write() to
blkcg_set_weight() and use it directly for .write_u64 callback.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:43 -07:00
Tejun Heo d3d32e69fa blkcg: restructure statistics printing
blkcg stats handling is a mess.  None of the stats has much to do with
blkcg core but they are all implemented in blkcg core.  Code sharing
is achieved by mixing common code with hard-coded cases for each stat
counter.

This patch restructures statistics printing such that

* Common logic exists as helper functions and specific print functions
  use the helpers to implement specific cases.

* Printing functions serving multiple counters don't require hardcoded
  switching on specific counters.

* Printing uses read_seq_string callback (other methods will be phased
  out).

This change enables further cleanups and relocating stats code to the
policy implementation it belongs to.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:42 -07:00
Tejun Heo edcb0722c6 blkcg: introduce blkg_stat and blkg_rwstat
blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values
on 32bit archs and some stat counters have subtypes to distinguish
read/writes and sync/async IOs.  The stat code paths are confusing and
involve a lot of going back and forth between blkcg core and specific
policy implementations, and synchronization and subtype handling are
open coded in blkcg core.

This patch introduces struct blkg_stat and blkg_rwstat which, with
accompanying operations, encapsulate stat updating and accessing with
proper synchronization.

blkg_stat is simple u64 counter with 64bit read-access protection.
blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw
flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and
replaces stat_sub_type indexed arrays.

All counters in blkio_group_stats and blkio_group_stats_cpu are
replaced with either blkg_stat or blkg_rwstat along with all users.

This does add one u64_stats_sync per counter and increase stats_sync
operations but they're empty/noops on 64bit archs and blkcg doesn't
have too many counters, especially with DEBUG_BLK_CGROUP off.

While the currently resulting code isn't necessarily simpler at the
moment, this will enable further clean up of blkcg stats code.

- BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to
  BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}.

- blkg_stat_add() replaces blkio_add_stat() and
  blkio_check_and_dec_stat().  Note that BUG_ON() on underflow in the
  latter function no longer exists.  It's *way* better to have
  underflowed stat counters than oopsing.

- blkio_group_stats->dequeue is now a proper u64 stat counter instead
  of ulong.

- reset_stats() updated to clear each stat counters individually and
  BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed.

- Some functions reconstruct rw flags from direction and sync
  booleans.  This will be removed by future patches.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:42 -07:00
Tejun Heo 2aa4a1523b blkcg: BLKIO_STAT_CPU_SECTORS doesn't have subcounters
BLKIO_STAT_CPU_SECTORS doesn't need read/write/sync/async subcounters
and is counted by blkio_group_stats_cpu->sectors; however, it still
holds a member in blkio_group_stats_cpu->stat_arr_cpu.

Rearrange stat_type_cpu and define BLKIO_STAT_CPU_ARR_NR and use it
for stat_arr_cpu[] size so that only SERVICE_BYTES and SERVICED have
subcounters.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:42 -07:00
Tejun Heo aaec55a002 blkcg: remove unused @pol and @plid parameters
@pol to blkg_to_pdata() and @plid to blkg_lookup_create() are no
longer necessary.  Drop them.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 14:38:42 -07:00
Tao Ma 5bf14c0727 block: Make cfq_target_latency tunable through sysfs.
In cfq, when we calculate a time slice for a process(or a cfqq to
be precise), we have to consider the cfq_target_latency so that all the
sync request have an estimated latency(300ms) and it is controlled by
cfq_target_latency. But in some hadoop test, we have found that if
there are many processes doing sequential read(24 for example), the
throughput is bad because every process can only work for about 25ms
and the cfqq is switched. That leads to a higher disk seek. We can
achive the good throughput by setting low_latency=0, but then some
read's latency is too much for the application.

So this patch makes cfq_target_latency tunable through sysfs so that
we can tune it and find some magic number which is not bad for both
the throughput and the read latency.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-01 14:33:39 -07:00
Tejun Heo 959d851caa Merge branch 'for-3.5' of ../cgroup into block/for-3.5/core-merged
cgroup/for-3.5 contains the following changes which blk-cgroup needs
to proceed with the on-going cleanup.

* Dynamic addition and removal of cftypes to make config/stat file
  handling modular for policies.

* cgroup removal update to not wait for css references to drain to fix
  blkcg removal hang caused by cfq caching cfqgs.

Pull in cgroup/for-3.5 into block/for-3.5/core.  This causes the
following conflicts in block/blk-cgroup.c.

* 761b3ef50e "cgroup: remove cgroup_subsys argument from callbacks"
  conflicts with blkiocg_pre_destroy() addition and blkiocg_attach()
  removal.  Resolved by removing @subsys from all subsys methods.

* 676f7c8f84 "cgroup: relocate cftype and cgroup_subsys definitions in
  controllers" conflicts with ->pre_destroy() and ->attach() updates
  and removal of modular config.  Resolved by dropping forward
  declarations of the methods and applying updates to the relocated
  blkio_subsys.

* 4baf6e3325 "cgroup: convert all non-memcg controllers to the new
  cftype interface" builds upon the previous item.  Resolved by adding
  ->base_cftypes to the relocated blkio_subsys.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 12:55:00 -07:00
Tejun Heo 4baf6e3325 cgroup: convert all non-memcg controllers to the new cftype interface
Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
net_cls and device controllers to use the new cftype based interface.
Termination entry is added to cftype arrays and populate callbacks are
replaced with cgroup_subsys->base_cftypes initializations.

This is functionally identical transformation.  There shouldn't be any
visible behavior change.

memcg is rather special and will be converted separately.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
2012-04-01 12:09:55 -07:00
Tejun Heo 676f7c8f84 cgroup: relocate cftype and cgroup_subsys definitions in controllers
blk-cgroup, netprio_cgroup, cls_cgroup and tcp_memcontrol
unnecessarily define cftype array and cgroup_subsys structures at the
top of the file, which is unconventional and necessiates forward
declaration of methods.

This patch relocates those below the definitions of the methods and
removes the forward declarations.  Note that forward declaration of
tcp_files[] is added in tcp_memcontrol.c for tcp_init_cgroup().  This
will be removed soon by another patch.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
2012-04-01 12:09:55 -07:00
Andi Kleen 8bcb6c7d48 block: use lockdep_assert_held for queue locking
Instead of an ugly open coded variant.

Cc: axboe@kernel.dk
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-30 12:33:28 +02:00
Dan Carpenter a5567932fc blkcg: change a spin_lock() to spin_lock_irq()
Smatch complains that we re-enable IRQs twice.  It looks like we forgot
to disable them here on the spin_trylock() failure path.  This was added
in 9f13ef678e "blkcg: use double locking instead of RCU for blkg
synchronization".

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>`
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-29 20:57:08 +02:00
Tejun Heo eb7d8c07f9 cfq: fix cfqg ref handling when BLK_CGROUP && !CFQ_GROUP_IOSCHED
When BLK_CGROUP is enabled but CFQ_GROUP_IOSCHED is, cfq ends up
calling blkg_get/put() on dummy cfqg leading to the following crash.

  BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
  IP: [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430
  PGD 0
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  CPU 0
  Modules linked in:

  Pid: 1, comm: swapper/0 Not tainted 3.3.0-rc6-work+ #125 Bochs Bochs
  RIP: 0010:[<ffffffff813d44d8>]  [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430
  RSP: 0018:ffff88001f9dfd80  EFLAGS: 00010046
  RAX: ffff88001aefbbf0 RBX: ffff88001aeedbf0 RCX: 0000000000000100
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff820ffd40
  RBP: ffff88001f9dfdd0 R08: 0000000000000000 R09: 0000000000000001
  R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
  R13: 0000000000000009 R14: ffff88001aefbc30 R15: 0000000000000003
  FS:  0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 00000000000000b0 CR3: 000000000206f000 CR4: 00000000000006f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process swapper/0 (pid: 1, threadinfo ffff88001f9de000, task ffff88001f9dc040)
  Stack:
   ffff88001aeedbf0 ffff88001aefbdb0 ffff88001aef1548 ffff88001aefbbf0
   ffff88001f9dfdd0 ffff88001aef1548 ffffffff820d6320 ffffffff8165ce30
   ffffffff82c555e0 ffff88001aeebbf0 ffff88001f9dfe00 ffffffff813b0507
  Call Trace:
   [<ffffffff813b0507>] elevator_init+0xd7/0x140
   [<ffffffff813b83d5>] blk_init_allocated_queue+0x125/0x150
   [<ffffffff813b94d3>] blk_init_queue_node+0x43/0x80
   [<ffffffff813b9523>] blk_init_queue+0x13/0x20
   [<ffffffff821aec00>] floppy_init+0x82/0xec7
   [<ffffffff810001d2>] do_one_initcall+0x42/0x170
   [<ffffffff821835fc>] kernel_init+0xcb/0x14f
   [<ffffffff81b40b24>] kernel_thread_helper+0x4/0x10
  Code: 00 e8 1d 9e 76 00 48 8b 43 48 48 85 c0 48 89 83 28 03 00 00 74 07 4c 8b a0 10 ff ff ff 8b 15 b0 2e d0 00 85 d2 0f 85 49 01 00 00 <41> 8b 84 24 b0 00 00 00 85 c0 0f 8e 8c 01 00 00 83 e8 01 85 c0
  RIP  [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430

Because cfq's blkcg support has a on/off switch, CFQ_GROUP_IOSCHED,
separate from BLK_CGROUP, blkg access through cfqg needs to be
conditioned on it.

* Make blkg_to_cfqg() and cfqg_to_blkg() conditioned on
  CFQ_GROUP_IOSCHED.  If disabled, they always return %NULL.

* Introduce cfqg_get() and cfqg_put() conditioned on
  CFQ_GROUP_IOSCHED.  If disabled, they are noops.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-23 14:02:53 +01:00
Dan Carpenter 00380a404f block: blk_alloc_queue_node(): use caller's GFP flags instead of GFP_KERNEL
We should use the GFP flags that the caller specified instead of picking
our own.  All the callers specify GFP_KERNEL so this doesn't make a
difference to how the kernel runs, it's just a cleanup.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-23 09:58:54 +01:00
Linus Torvalds 0d9cabdcce Merge branch 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup changes from Tejun Heo:
 "Out of the 8 commits, one fixes a long-standing locking issue around
  tasklist walking and others are cleanups."

* 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list
  cgroup: Remove wrong comment on cgroup_enable_task_cg_list()
  cgroup: remove cgroup_subsys argument from callbacks
  cgroup: remove extra calls to find_existing_css_set
  cgroup: replace tasklist_lock with rcu_read_lock
  cgroup: simplify double-check locking in cgroup_attach_proc
  cgroup: move struct cgroup_pidlist out from the header file
  cgroup: remove cgroup_attach_task_current_cg()
2012-03-20 18:11:21 -07:00
Linus Torvalds 2ba68940c8 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes for v3.4 from Ingo Molnar

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  printk: Make it compile with !CONFIG_PRINTK
  sched/x86: Fix overflow in cyc2ns_offset
  sched: Fix nohz load accounting -- again!
  sched: Update yield() docs
  printk/sched: Introduce special printk_sched() for those awkward moments
  sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer
  sched: Cleanup cpu_active madness
  sched: Fix load-balance wreckage
  sched: Clean up parameter passing of proc_sched_autogroup_set_nice()
  sched: Ditch per cgroup task lists for load-balancing
  sched: Rename load-balancing fields
  sched: Move load-balancing arguments into helper struct
  sched/rt: Do not submit new work when PI-blocked
  sched/rt: Prevent idle task boosting
  sched/wait: Add __wake_up_all_locked() API
  sched/rt: Document scheduler related skip-resched-check sites
  sched/rt: Use schedule_preempt_disabled()
  sched/rt: Add schedule_preempt_disabled()
  sched/rt: Do not throttle when PI boosting
  sched/rt: Keep period timer ticking when rt throttling is active
  ...
2012-03-20 10:31:44 -07:00
Tejun Heo 2b566fa55b block: remove ioc_*_changed()
After the previous patch to cfq, there's no ioc_get_changed() user
left.  This patch yanks out ioc_{ioprio|cgroup|get}_changed() and all
related stuff.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20 12:47:48 +01:00
Tejun Heo 598971bfbd cfq: don't use icq_get_changed()
cfq caches the associated cfqq's for a given cic.  The cache needs to
be flushed if the cic's ioprio or blkcg has changed.  It is currently
done by requiring the changing action to set the respective
ICQ_*_CHANGED bit in the icq and testing it from cfq_set_request(),
which involves iterating through all the affected icqs.

All cfq wants to know is whether ioprio and/or blkcg have changed
since the last flush and can be easily achieved by just remembering
the current ioprio and blkcg ID in cic.

This patch adds cic->{ioprio|blkcg_id}, updates all ioprio users to
use the remembered value instead, and updates cfq_set_request() path
such that, instead of using icq_get_changed(), the current values are
compared against the remembered ones and trigger appropriate flush
action if not.  Condition tests are moved inside both _changed
functions which are now named check_ioprio_changed() and
check_blkcg_changed().

ioprio.h::task_ioprio*() can't be used anymore and replaced with
open-coded IOPRIO_CLASS_NONE case in cfq_async_queue_prio().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20 12:47:47 +01:00
Tejun Heo abede6da27 cfq: pass around cfq_io_cq instead of io_context
Now that io_cq is managed by block core and guaranteed to exist for
any in-flight request, it is easier and carries more information to
pass around cfq_io_cq than io_context.

This patch updates cfq_init_prio_data(), cfq_find_alloc_queue() and
cfq_get_queue() to take @cic instead of @ioc.  This change removes a
duplicate cfq_cic_lookup() from cfq_find_alloc_queue().

This change enables the use of cic-cached ioprio in the next patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20 12:47:47 +01:00
Tejun Heo 9a9e8a26da blkcg: add blkcg->id
Add 64bit unique id to blkcg.  This will be used by policies which
want blkcg identity test to tell whether the associated blkcg has
changed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20 12:47:47 +01:00
Tejun Heo edf1b879e3 blkcg: remove blkio_group->stats_lock
With recent plug merge updates, all non-percpu stat updates happen
under queue_lock making stats_lock unnecessary to synchronize stat
updates.  The only synchronization necessary is stat reading, which
can be done using u64_stats_sync instead.

This patch removes blkio_group->stats_lock and adds
blkio_group_stats->syncp for reader synchronization.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20 12:45:37 +01:00
Tejun Heo c4c76a0538 blkcg: restructure blkio_get_stat()
Restructure blkio_get_stat() to prepare for removal of stats_lock.

* Define BLKIO_STAT_ARR_NR explicitly to denote which stats have
  subtypes instead of using BLKIO_STAT_QUEUED.

* Separate out stat acquisition and printing.  After this, there are
  only two users of blkio_fill_stat().  Just open code it.

* The code was mixing MAX_KEY_LEN and MAX_KEY_LEN - 1.  There's no
  need to subtract one.  Use MAX_KEY_LEN consistently.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20 12:45:37 +01:00
Tejun Heo 997a026c80 blkcg: simplify stat reset
blkiocg_reset_stats() implements stat reset for blkio.reset_stats
cgroupfs file.  This feature is very unconventional and something
which shouldn't have been merged.  It's only useful when there's only
one user or tool looking at the stats.  As soon as multiple users
and/or tools are involved, it becomes useless as resetting disrupts
other usages.  There are very good reasons why all other stats expect
readers to read values at the start and end of a period and subtract
to determine delta over the period.

The implementation is rather complex - some fields shouldn't be
cleared and it saves some fields, resets whole and restores for some
reason.  Reset of percpu stats is also racy.  The comment points to
64bit store atomicity for the reason but even without that stores for
zero can simply race with other CPUs doing RMW and get clobbered.

Simplify reset by

* Clear selectively instead of resetting and restoring.

* Grouping debug stat fields to be reset and using memset() over them.

* Not caring about stats_lock.

* Using memset() to reset percpu stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20 12:45:37 +01:00
Tejun Heo 5fe224d2d5 blkcg: don't use percpu for merged stats
With recent plug merge updates, merged stats are no longer called for
plug merges and now only updated while holding queue_lock.  As
stats_lock is scheduled to be removed, there's no reason to use percpu
for merged stats.  Don't use percpu for merged stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20 12:45:37 +01:00
Vivek Goyal 1cd9e039fc blkcg: alloc per cpu stats from worker thread in a delayed manner
Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in
IO path there are times when we want GFP_NOIO semantics. As there is no
way to pass the allocation flags to alloc_percpu(), this patch delays the
allocation of stats using a worker thread.

v2-> tejun suggested following changes. Changed the patch accordingly.
	- move alloc_node location in structure
	- reduce the size of names of some of the fields
	- Reduce the scope of locking of alloc_list_lock
	- Simplified stat_alloc_fn() by allocating stats for all
	  policies in one go and then assigning these to a group.

v3 -> Andrew suggested to put some comments in the code. Also raised
      concerns about trying to allocate infinitely in case of allocation
      failure. I have changed the logic to sleep for 10ms before retrying.
      That should take care of non-preemptible UP kernels.

v4 -> Tejun had more suggestions.
	- drop list_for_each_entry_all()
	- instead of msleep() use queue_delayed_work()
	- Some cleanups realted to more compact coding.

v5-> tejun suggested more cleanups leading to more compact code.

tj: - Relocated pcpu_stats into blkio_stat_alloc_fn().
    - Minor comment update.
    - This also fixes suspicious RCU usage warning caused by invoking
      cgroup_path() from blkg_alloc() without holding RCU read lock.
      Now that blkg_alloc() doesn't require sleepable context, RCU
      read lock from blkg_lookup_create() is maintained throughout
      blkg_alloc().

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20 12:45:37 +01:00
Linus Torvalds f1cbd03f5e Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Been sitting on this for a while, but lets get this out the door.
  This fixes various important bugs for 3.3 final, along with a few more
  trivial ones.  Please pull!"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix ioc leak in put_io_context
  block, sx8: fix pointer math issue getting fw version
  Block: use a freezable workqueue for disk-event polling
  drivers/block/DAC960: fix -Wuninitialized warning
  drivers/block/DAC960: fix DAC960_V2_IOCTL_Opcode_T -Wenum-compare warning
  block: fix __blkdev_get and add_disk race condition
  block: Fix setting bio flags in drivers (sd_dif/floppy)
  block: Fix NULL pointer dereference in sd_revalidate_disk
  block: exit_io_context() should call elevator_exit_icq_fn()
  block: simplify ioc_release_fn()
  block: replace icq->changed with icq->flags
2012-03-14 17:16:45 -07:00
Xiaotian Feng ff8c1474cc block: fix ioc leak in put_io_context
When put_io_context is called, if ioc->icq_list is empty and refcount
is 1, kernel will not free the ioc.

This is caught by following kmemleak:

unreferenced object 0xffff880036349fe0 (size 216):
  comm "sh", pid 2137, jiffies 4294931140 (age 290579.412s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    01 00 01 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
  backtrace:
    [<ffffffff8169f926>] kmemleak_alloc+0x26/0x50
    [<ffffffff81195a9c>] kmem_cache_alloc_node+0x1cc/0x2a0
    [<ffffffff81356b67>] create_io_context_slowpath+0x27/0x130
    [<ffffffff81356d2b>] get_task_io_context+0xbb/0xf0
    [<ffffffff81055f0e>] copy_process+0x188e/0x18b0
    [<ffffffff8105609b>] do_fork+0x11b/0x420
    [<ffffffff810247f8>] sys_clone+0x28/0x30
    [<ffffffff816d3373>] stub_clone+0x13/0x20
    [<ffffffffffffffff>] 0xffffffffffffffff

ioc should be freed if ioc->icq_list is empty.
Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-14 15:34:48 +01:00
Tejun Heo 671058fb2a block: make blk-throttle preserve the issuing task on delayed bios
Make blk-throttle call bio_associate_current() on bios being delayed
such that they get issued to block layer with the original io_context.
This allows stacking blk-throttle and cfq-iosched propio policies.
bios will always be issued with the correct ioc and blkcg whether it
gets delayed by blk-throttle or not.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:24 +01:00
Tejun Heo 4f85cb96d9 block: make block cgroup policies follow bio task association
Implement bio_blkio_cgroup() which returns the blkcg associated with
the bio if exists or %current's blkcg, and use it in blk-throttle and
cfq-iosched propio.  This makes both cgroup policies honor task
association for the bio instead of always assuming %current.

As nobody is using bio_set_task() yet, this doesn't introduce any
behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:24 +01:00
Tejun Heo 852c788f83 block: implement bio_associate_current()
IO scheduling and cgroup are tied to the issuing task via io_context
and cgroup of %current.  Unfortunately, there are cases where IOs need
to be routed via a different task which makes scheduling and cgroup
limit enforcement applied completely incorrectly.

For example, all bios delayed by blk-throttle end up being issued by a
delayed work item and get assigned the io_context of the worker task
which happens to serve the work item and dumped to the default block
cgroup.  This is double confusing as bios which aren't delayed end up
in the correct cgroup and makes using blk-throttle and cfq propio
together impossible.

Any code which punts IO issuing to another task is affected which is
getting more and more common (e.g. btrfs).  As both io_context and
cgroup are firmly tied to task including userland visible APIs to
manipulate them, it makes a lot of sense to match up tasks to bios.

This patch implements bio_associate_current() which associates the
specified bio with %current.  The bio will record the associated ioc
and blkcg at that point and block layer will use the recorded ones
regardless of which task actually ends up issuing the bio.  bio
release puts the associated ioc and blkcg.

It grabs and remembers ioc and blkcg instead of the task itself
because task may already be dead by the time the bio is issued making
ioc and blkcg inaccessible and those are all block layer cares about.

elevator_set_req_fn() is updated such that the bio elvdata is being
allocated for is available to the elevator.

This doesn't update block cgroup policies yet.  Further patches will
implement the support.

-v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
     rq_ioc() to fix build breakage.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:24 +01:00
Tejun Heo f6e8d01bee block: add io_context->active_ref
Currently ioc->nr_tasks is used to decide two things - whether an ioc
is done issuing IOs and whether it's shared by multiple tasks.  This
patch separate out the first into ioc->active_ref, which is acquired
and released using {get|put}_io_context_active() respectively.

This will be used to associate bio's with a given task.  This patch
doesn't introduce any visible behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:24 +01:00
Tejun Heo 24acfc34fb block: interface update for ioc/icq creation functions
Make the following interface updates to prepare for future ioc related
changes.

* create_io_context() returning ioc only works for %current because it
  doesn't increment ref on the ioc.  Drop @task parameter from it and
  always assume %current.

* Make create_io_context_slowpath() return 0 or -errno and rename it
  to create_task_io_context().

* Make ioc_create_icq() take @ioc as parameter instead of assuming
  that of %current.  The caller, get_request(), is updated to create
  ioc explicitly and then pass it into ioc_create_icq().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:24 +01:00
Tejun Heo b679281a64 block: restructure get_request()
get_request() is structured a bit unusually in that failure path is
inlined in the usual flow with goto labels atop and inside it.
Relocate the error path to the end of the function.

This is to prepare for icq handling changes in get_request() and
doesn't introduce any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:24 +01:00
Tejun Heo c875f4d025 blkcg: drop unnecessary RCU locking
Now that blkg additions / removals are always done under both q and
blkcg locks, the only places RCU locking is necessary are
blkg_lookup[_create]() for lookup w/o blkcg lock.  This patch drops
unncessary RCU locking replacing it with plain blkcg locking as
necessary.

* blkiocg_pre_destroy() already perform proper locking and don't need
  RCU.  Dropped.

* blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read
  lock.  This isn't a hot path.

* Now unnecessary synchronize_rcu() from queue exit paths removed.
  This makes q->nr_blkgs unnecessary.  Dropped.

* RCU annotation on blkg->q removed.

-v2: Vivek pointed out that blkg_lookup_create() still needs to be
     called under rcu_read_lock().  Updated.

-v3: After the update, stats_lock locking in blkio_read_blkg_stats()
     shouldn't be using _irq variant as it otherwise ends up enabling
     irq while blkcg->lock is locked.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:24 +01:00
Tejun Heo 9f13ef678e blkcg: use double locking instead of RCU for blkg synchronization
blkgs are chained from both blkcgs and request_queues and thus
subjected to two locks - blkcg->lock and q->queue_lock.  As both blkcg
and q can go away anytime, locking during removal is tricky.  It's
currently solved by wrapping removal inside RCU, which makes the
synchronization complex.  There are three locks to worry about - the
outer RCU, q lock and blkcg lock, and it leads to nasty subtle
complications like conditional synchronize_rcu() on queue exit paths.

For all other paths, blkcg lock is naturally nested inside q lock and
the only exception is blkcg removal path, which is a very cold path
and can be implemented as clumsy but conceptually-simple reverse
double lock dancing.

This patch updates blkg removal path such that blkgs are removed while
holding both q and blkcg locks, which is trivial for request queue
exit path - blkg_destroy_all().  The blkcg removal path,
blkiocg_pre_destroy(), implements reverse double lock dancing
essentially identical to ioc_release_fn().

This simplifies blkg locking - no half-dead blkgs to worry about.  Now
unnecessary RCU annotations will be removed by the next patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:24 +01:00
Tejun Heo e8989fae38 blkcg: unify blkg's for blkcg policies
Currently, blkg is per cgroup-queue-policy combination.  This is
unnatural and leads to various convolutions in partially used
duplicate fields in blkg, config / stat access, and general management
of blkgs.

This patch make blkg's per cgroup-queue and let them serve all
policies.  blkgs are now created and destroyed by blkcg core proper.
This will allow further consolidation of common management logic into
blkcg core and API with better defined semantics and layering.

As a transitional step to untangle blkg management, elvswitch and
policy [de]registration, all blkgs except the root blkg are being shot
down during elvswitch and bypass.  This patch adds blkg_root_update()
to update root blkg in place on policy change.  This is hacky and racy
but should be good enough as interim step until we get locking
simplified and switch over to proper in-place update for all blkgs.

-v2: Root blkgs need to be updated on elvswitch too and blkg_alloc()
     comment wasn't updated according to the function change.  Fixed.
     Both pointed out by Vivek.

-v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for
     all policies.  This freed root pd during elvswitch before the
     last queue finished exiting and led to oops.  Directly invoke
     update_root_blkg_pd() only on BLKIO_POLICY_PROP from
     cfq_exit_queue().  This also is closer to what will be done with
     proper in-place blkg update.  Reported by Vivek.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:23 +01:00
Tejun Heo 03aa264ac1 blkcg: let blkcg core manage per-queue blkg list and counter
With the previous patch to move blkg list heads and counters to
request_queue and blkg, logic to manage them in both policies are
almost identical and can be moved to blkcg core.

This patch moves blkg link logic into blkg_lookup_create(), implements
common blkg unlink code in blkg_destroy(), and updates
blkg_destory_all() so that it's policy specific and can skip root
group.  The updated blkg_destroy_all() is now used to both clear queue
for bypassing and elv switching, and release all blkgs on q exit.

This patch introduces a race window where policy [de]registration may
race against queue blkg clearing.  This can only be a problem on cfq
unload and shouldn't be a real problem in practice (and we have many
other places where this race already exists).  Future patches will
remove these unlikely races.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:23 +01:00
Tejun Heo 4eef304998 blkcg: move per-queue blkg list heads and counters to queue and blkg
Currently, specific policy implementations are responsible for
maintaining list and number of blkgs.  This duplicates code
unnecessarily, and hinders factoring common code and providing blkcg
API with better defined semantics.

After this patch, request_queue hosts list heads and counters and blkg
has list nodes for both policies.  This patch only relocates the
necessary fields and the next patch will actually move management code
into blkcg core.

Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to
have 2 elements.  This is to avoid include dependency and will be
removed by the next patch.

This patch doesn't introduce any behavior change.

-v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed
     as pointed out by Vivek.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:23 +01:00
Tejun Heo c1768268f9 blkcg: don't use blkg->plid in stat related functions
blkg is scheduled to be unified for all policies and thus there won't
be one-to-one mapping from blkg to policy.  Update stat related
functions to take explicit @pol or @plid arguments and not use
blkg->plid.

This is painful for now but most of specific stat interface functions
will be replaced with a handful of generic helpers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:23 +01:00
Tejun Heo 549d3aa872 blkcg: make blkg->pd an array and move configuration and stats into it
To prepare for unifying blkgs for different policies, make blkg->pd an
array with BLKIO_NR_POLICIES elements and move blkg->conf, ->stats,
and ->stats_cpu into blkg_policy_data.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:23 +01:00
Tejun Heo 1adaf3dde3 blkcg: move refcnt to blkcg core
Currently, blkcg policy implementations manage blkg refcnt duplicating
mostly identical code in both policies.  This patch moves refcnt to
blkg and let blkcg core handle refcnt and freeing of blkgs.

* cfq blkgs now also get freed via RCU.

* cfq blkgs lose RB_EMPTY_ROOT() sanity check on blkg free.  If
  necessary, we can add blkio_exit_group_fn() to resurrect this.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:23 +01:00
Tejun Heo 0381411e4b blkcg: let blkcg core handle policy private data allocation
Currently, blkg's are embedded in private data blkcg policy private
data structure and thus allocated and freed by policies.  This leads
to duplicate codes in policies, hinders implementing common part in
blkcg core with strong semantics, and forces duplicate blkg's for the
same cgroup-q association.

This patch introduces struct blkg_policy_data which is a separate data
structure chained from blkg.  Policies specifies the amount of private
data it needs in its blkio_policy_type->pdata_size and blkcg core
takes care of allocating them along with blkg which can be accessed
using blkg_to_pdata().  blkg can be determined from pdata using
pdata_to_blkg().  blkio_alloc_group_fn() method is accordingly updated
to blkio_init_group_fn().

For consistency, tg_of_blkg() and cfqg_of_blkg() are replaced with
blkg_to_tg() and blkg_to_cfqg() respectively, and functions to map in
the reverse direction are added.

Except that policy specific data now lives in a separate data
structure from blkg, this patch doesn't introduce any functional
difference.

This will be used to unify blkg's for different policies.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:23 +01:00
Tejun Heo 923adde1be blkcg: clear all request_queues on blkcg policy [un]registrations
Keep track of all request_queues which have blkcg initialized and turn
on bypass and invoke blkcg_clear_queue() on all before making changes
to blkcg policies.

This is to prepare for moving blkg management into blkcg core.  Note
that this uses more brute force than necessary.  Finer grained shoot
down will be implemented later and given that policy [un]registration
almost never happens on running systems (blk-throtl can't be built as
a module and cfq usually is the builtin default iosched), this
shouldn't be a problem for the time being.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:23 +01:00
Tejun Heo 5efd611351 blkcg: add blkcg_{init|drain|exit}_queue()
Currently block core calls directly into blk-throttle for init, drain
and exit.  This patch adds blkcg_{init|drain|exit}_queue() which wraps
the blk-throttle functions.  This is to give more control and
visiblity to blkcg core layer for proper layering.  Further patches
will add logic common to blkcg policies to the functions.

While at it, collapse blk_throtl_release() into blk_throtl_exit().
There's no reason to keep them separate.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:23 +01:00
Tejun Heo 7ee9c56205 blkcg: let blkio_group point to blkio_cgroup directly
Currently, blkg points to the associated blkcg via its css_id.  This
unnecessarily complicates dereferencing blkcg.  Let blkg hold a
reference to the associated blkcg and point directly to it and disable
css_id on blkio_subsys.

This change requires splitting blkiocg_destroy() into
blkiocg_pre_destroy() and blkiocg_destroy() so that all blkg's can be
destroyed and all the blkcg references held by them dropped during
cgroup removal.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:23 +01:00
Vivek Goyal 92616b5b3a blkcg: skip blkg printing if q isn't associated with disk
blk-cgroup printing code currently assumes that there is a device/disk
associated with every queue in the system, but modules like floppy,
can instantiate request queues without registering disk which can lead
to oops.

Skip the queue/blkg which don't have dev/disk associated with them.

-tj: Factored out backing_dev_info check into blkg_dev_name().

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:22 +01:00
Tejun Heo 7a4dd281ec blkcg: kill the mind-bending blkg->dev
blkg->dev is dev_t recording the device number of the block device for
the associated request_queue.  It is used to identify the associated
block device when printing out configuration or stats.

This is redundant to begin with.  A blkg is an association between a
cgroup and a request_queue and it of course is possible to reach
request_queue from blkg and synchronization conventions are in place
for safe q dereferencing, so this shouldn't be necessary from the
beginning.  Furthermore, it's initialized by sscanf()ing the device
name of backing_dev_info.  The mind boggles.

Anyways, if blkg is visible under rcu lock, we *know* that the
associated request_queue hasn't gone away yet and its bdi is
registered and alive - blkg can't be created for request_queue which
hasn't been fully initialized and it can't go away before blkg is
removed.

Let stat and conf read functions get device name from
blkg->q->backing_dev_info.dev and pass it down to printing functions
and remove blkg->dev.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:22 +01:00
Tejun Heo 4bfd482e73 blkcg: kill blkio_policy_node
Now that blkcg configuration lives in blkg's, blkio_policy_node is no
longer necessary.  Kill it.

blkio_policy_parse_and_set() now fails if invoked for missing device
and functions to print out configurations are updated to print from
blkg's.

cftype_blkg_same_policy() is dropped along with other policy functions
for consistency.  Its one line is open coded in the only user -
blkio_read_blkg_stats().

-v2: Update to reflect the retry-on-bypass logic change of the
     previous patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:22 +01:00
Tejun Heo e56da7e287 blkcg: don't allow or retain configuration of missing devices
blkcg is very peculiar in that it allows setting and remembering
configurations for non-existent devices by maintaining separate data
structures for configuration.

This behavior is completely out of the usual norms and outright
confusing; furthermore, it uses dev_t number to match the
configuration to devices, which is unpredictable to begin with and
becomes completely unuseable if EXT_DEVT is fully used.

It is wholely unnecessary - we already have fully functional userland
mechanism to program devices being hotplugged which has full access to
device identification, connection topology and filesystem information.

Add a new struct blkio_group_conf which contains all blkcg
configurations to blkio_group and let blkio_group, which can be
created iff the associated device exists and is removed when the
associated device goes away, carry all configurations.

Note that, after this patch, all newly created blkg's will always have
the default configuration (unlimited for throttling and blkcg's weight
for propio).

This patch makes blkio_policy_node meaningless but doesn't remove it.
The next patch will.

-v2: Updated to retry after short sleep if blkg lookup/creation failed
     due to the queue being temporarily bypassed as indicated by
     -EBUSY return.  Pointed out by Vivek.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:22 +01:00
Tejun Heo cd1604fab4 blkcg: factor out blkio_group creation
Currently both blk-throttle and cfq-iosched implement their own
blkio_group creation code in throtl_get_tg() and cfq_get_cfqg().  This
patch factors out the common code into blkg_lookup_create(), which
returns ERR_PTR value so that transitional failures due to queue
bypass can be distinguished from other failures.

* New plkio_policy_ops methods blkio_alloc_group_fn() and
  blkio_link_group_fn added.  Both are transitional and will be
  removed once the blkg management code is fully moved into
  blk-cgroup.c.

* blkio_alloc_group_fn() allocates policy-specific blkg which is
  usually a larger data structure with blkg as the first entry and
  intiailizes it.  Note that initialization of blkg proper, including
  percpu stats, is responsibility of blk-cgroup proper.

  Note that default config (weight, bps...) initialization is done
  from this method; otherwise, we end up violating locking order
  between blkcg and q locks via blkcg_get_CONF() functions.

* blkio_link_group_fn() is called under queue_lock and responsible for
  linking the blkg to the queue.  blkcg side is handled by blk-cgroup
  proper.

* The common blkg creation function is named blkg_lookup_create() and
  blkiocg_lookup_group() is renamed to blkg_lookup() for consistency.
  Also, throtl / cfq related functions are similarly [re]named for
  consistency.

This simplifies blkcg policy implementations and enables further
cleanup.

-v2: Vivek noticed that blkg_lookup_create() incorrectly tested
     blk_queue_dead() instead of blk_queue_bypass() leading a user of
     the function ending up creating a new blkg on bypassing queue.
     This is a bug introduced while relocating bypass patches before
     this one.  Fixed.

-v3: ERR_PTR patch folded into this one.  @for_root added to
     blkg_lookup_create() to allow creating root group on a bypassed
     queue during elevator switch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:22 +01:00
Tejun Heo f51b802c17 blkcg: use the usual get blkg path for root blkio_group
For root blkg, blk_throtl_init() was using throtl_alloc_tg()
explicitly and cfq_init_queue() was manually initializing embedded
cfqd->root_group, adding unnecessarily different code paths to blkg
handling.

Make both use the usual blkio_group get functions - throtl_get_tg()
and cfq_get_cfqg() - for the root blkio_group too.  Note that
blk_throtl_init() callsite is pushed downwards in
blk_alloc_queue_node() so that @q is sufficiently initialized for
throtl_get_tg().

This simplifies root blkg handling noticeably for cfq and will allow
further modularization of blkcg API.

-v2: Vivek pointed out that using cfq_get_cfqg() won't work if
     CONFIG_CFQ_GROUP_IOSCHED is disabled.  Fix it by factoring out
     initialization of base part of cfqg into cfq_init_cfqg_base() and
     alloc/init/free explicitly if !CONFIG_CFQ_GROUP_IOSCHED.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:22 +01:00
Tejun Heo 035d10b2fa blkcg: add blkio_policy[] array and allow one policy per policy ID
Block cgroup policies are maintained in a linked list and,
theoretically, multiple policies sharing the same policy ID are
allowed.

This patch temporarily restricts one policy per plid and adds
blkio_policy[] array which indexes registered policy types by plid.
Both the restriction and blkio_policy[] array are transitional and
will be removed once API cleanup is complete.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:22 +01:00
Tejun Heo ca32aefc7f blkcg: use q and plid instead of opaque void * for blkio_group association
blkgio_group is association between a block cgroup and a queue for a
given policy.  Using opaque void * for association makes things
confusing and hinders factoring of common code.  Use request_queue *
and, if necessary, policy id instead.

This will help block cgroup API cleanup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:22 +01:00
Tejun Heo 0a5a7d0e32 blkcg: update blkg get functions take blkio_cgroup as parameter
In both blkg get functions - throtl_get_tg() and cfq_get_cfqg(),
instead of obtaining blkcg of %current explicitly, let the caller
specify the blkcg to use as parameter and make both functions hold on
to the blkcg.

This is part of block cgroup interface cleanup and will help making
blkcg API more modular.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:22 +01:00
Tejun Heo 2a7f124414 blkcg: move rcu_read_lock() outside of blkio_group get functions
rcu_read_lock() in throtl_get_tb() and cfq_get_cfqg() holds onto
@blkcg while looking up blkg.  For API cleanup, the next patch will
make the caller responsible for determining @blkcg to look blkg from
and let them specify it as a parameter.  Move rcu read locking out to
the callers to prepare for the change.

-v2: Originally this patch was described as a fix for RCU read locking
     bug around @blkg, which Vivek pointed out to be incorrect.  It
     was from misunderstanding the role of rcu locking as protecting
     @blkg not @blkcg.  Patch description updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:22 +01:00
Tejun Heo 72e06c2551 blkcg: shoot down blkio_groups on elevator switch
Elevator switch may involve changes to blkcg policies.  Implement
shoot down of blkio_groups.

Combined with the previous bypass updates, the end goal is updating
blkcg core such that it can ensure that blkcg's being affected become
quiescent and don't have any per-blkg data hanging around before
commencing any policy updates.  Until queues are made aware of the
policies that applies to them, as an interim step, all per-policy blkg
data will be shot down.

* blk-throtl doesn't need this change as it can't be disabled for a
  live queue; however, update it anyway as the scheduled blkg
  unification requires this behavior change.  This means that
  blk-throtl configuration will be unnecessarily lost over elevator
  switch.  This oddity will be removed after blkcg learns to associate
  individual policies with request_queues.

* blk-throtl dosen't shoot down root_tg.  This is to ease transition.
  Unified blkg will always have persistent root group and not shooting
  down root_tg for now eases transition to that point by avoiding
  having to update td->root_tg and is safe as blk-throtl can never be
  disabled

-v2: Vivek pointed out that group list is not guaranteed to be empty
     on return from clear function if it raced cgroup removal and
     lost.  Fix it by waiting a bit and retrying.  This kludge will
     soon be removed once locking is updated such that blkg is never
     in limbo state between blkcg and request_queue locks.

     blk-throtl no longer shoots down root_tg to avoid breaking
     td->root_tg.

     Also, Nest queue_lock inside blkio_list_lock not the other way
     around to avoid introduce possible deadlock via blkcg lock.

-v3: blkcg_clear_queue() repositioned and renamed to
     blkg_destroy_all() to increase consistency with later changes.
     cfq_clear_queue() updated to check q->elevator before
     dereferencing it to avoid NULL dereference on not fully
     initialized queues (used by later change).

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:22 +01:00
Tejun Heo 6ecf23afab block: extend queue bypassing to cover blkcg policies
Extend queue bypassing such that dying queue is always bypassing and
blk-throttle is drained on bypass.  With blkcg policies updated to
test blk_queue_bypass() instead of blk_queue_dead(), this ensures that
no bio or request is held by or going through blkcg policies on a
bypassing queue.

This will be used to implement blkg cleanup on elevator switches and
policy changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:22 +01:00
Tejun Heo d732580b4e block: implement blk_queue_bypass_start/end()
Rename and extend elv_queisce_start/end() to
blk_queue_bypass_start/end() which are exported and supports nesting
via @q->bypass_depth.  Also add blk_queue_bypass() to test bypass
state.

This will be further extended and used for blkio_group management.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:21 +01:00
Tejun Heo b2fab5acd2 elevator: make elevator_init_fn() return 0/-errno
elevator_ops->elevator_init_fn() has a weird return value.  It returns
a void * which the caller should assign to q->elevator->elevator_data
and %NULL return denotes init failure.

Update such that it returns integer 0/-errno and sets elevator_data
directly as necessary.

This makes the interface more conventional and eases further cleanup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:21 +01:00
Tejun Heo 5a5bafdc39 elevator: clear auxiliary data earlier during elevator switch
Elevator switch tries hard to keep as much as context until new
elevator is ready so that it can revert to the original state if
initializing the new elevator fails for some reason.  Unfortunately,
with more auxiliary contexts to manage, this makes elevator init and
exit paths too complex and fragile.

This patch makes elevator_switch() unregister the current elevator and
flush icq's before start initializing the new one.  As we still keep
the old elevator itself, the only difference is that we lose icq's on
rare occassions of switching failure, which isn't critical at all.

Note that this makes explicit elevator parameter to
elevator_init_queue() and __elv_register_queue() unnecessary as they
always can use the current elevator.

This patch enables block cgroup cleanups.

-v2: blk_add_trace_msg() prints elevator name from @new_e instead of
     @e->type as the local variable no longer exists.  This caused
     build failure on CONFIG_BLK_DEV_IO_TRACE.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:21 +01:00
Tejun Heo b95ada558c cfq: don't register propio policy if !CONFIG_CFQ_GROUP_IOSCHED
cfq has been registering zeroed blkio_poilcy_cfq if CFQ_GROUP_IOSCHED
is disabled.  This fortunately doesn't collide with blk-throtl as
BLKIO_POLICY_PROP is zero but is unnecessary and risky.  Just don't
register it if not enabled.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:21 +01:00
Tejun Heo 32e380aedc blkcg: make CONFIG_BLK_CGROUP bool
Block cgroup core can be built as module; however, it isn't too useful
as blk-throttle can only be built-in and cfq-iosched is usually the
default built-in scheduler.  Scheduled blkcg cleanup requires calling
into blkcg from block core.  To simplify that, disallow building blkcg
as module by making CONFIG_BLK_CGROUP bool.

If building blkcg core as module really matters, which I doubt, we can
revisit it after blkcg API cleanup.

-v2: Vivek pointed out that IOSCHED_CFQ was incorrectly updated to
     depend on BLK_CGROUP.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:21 +01:00
Tejun Heo b855b04a0b block: blk-throttle should be drained regardless of q->elevator
Currently, blk_cleanup_queue() doesn't call elv_drain_elevator() if
q->elevator doesn't exist; however, bio based drivers don't have
elevator initialized but can still use blk-throttle.  This patch moves
q->elevator test inside blk_drain_queue() such that only
elv_drain_elevator() is skipped if !q->elevator.

-v2: loop can have registered queue which has NULL request_fn.  Make
     sure we don't call into __blk_run_queue() in such cases.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Vivek Goyal <vgoyal@redhat.com>

Fold in bug fix from Vivek.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:24:55 +01:00
Alan Stern 62d3c5439c Block: use a freezable workqueue for disk-event polling
This patch (as1519) fixes a bug in the block layer's disk-events
polling.  The polling is done by a work routine queued on the
system_nrt_wq workqueue.  Since that workqueue isn't freezable, the
polling continues even in the middle of a system sleep transition.

Obviously, polling a suspended drive for media changes and such isn't
a good thing to do; in the case of USB mass-storage devices it can
lead to real problems requiring device resets and even re-enumeration.

The patch fixes things by creating a new system-wide, non-reentrant,
freezable workqueue and using it for disk-events polling.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: <stable@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-02 10:51:00 +01:00
Stanislaw Gruszka 9f53d2fe81 block: fix __blkdev_get and add_disk race condition
The following situation might occur:

__blkdev_get:			add_disk:

				register_disk()
get_gendisk()

disk_block_events()
	disk->ev == NULL

				disk_add_events()

__disk_unblock_events()
	disk->ev != NULL
	--ev->block

Then we unblock events, when they are suppose to be blocked. This can
trigger events related block/genhd.c warnings, but also can crash in
sd_check_events() or other places.

I'm able to reproduce crashes with the following scripts (with
connected usb dongle as sdb disk).

<snip>
DEV=/dev/sdb
ENABLE=/sys/bus/usb/devices/1-2/bConfigurationValue

function stop_me()
{
	for i in `jobs -p` ; do kill $i 2> /dev/null ; done
	exit
}

trap stop_me SIGHUP SIGINT SIGTERM

for ((i = 0; i < 10; i++)) ; do
	while true; do fdisk -l $DEV  2>&1 > /dev/null ; done &
done

while true ; do
echo 1 > $ENABLE
sleep 1
echo 0 > $ENABLE
done
</snip>

I use the script to verify patch fixing oops in sd_revalidate_disk
http://marc.info/?l=linux-scsi&m=132935572512352&w=2
Without Jun'ichi Nomura patch titled "Fix NULL pointer dereference in
sd_revalidate_disk" or this one, script easily crash kernel within
a few seconds. With both patches applied I do not observe crash.
Unfortunately after some time (dozen of minutes), script will hung in:

[ 1563.906432]  [<c08354f5>] schedule_timeout_uninterruptible+0x15/0x20
[ 1563.906437]  [<c04532d5>] msleep+0x15/0x20
[ 1563.906443]  [<c05d60b2>] blk_drain_queue+0x32/0xd0
[ 1563.906447]  [<c05d6e00>] blk_cleanup_queue+0xd0/0x170
[ 1563.906454]  [<c06d278f>] scsi_free_queue+0x3f/0x60
[ 1563.906459]  [<c06d7e6e>] __scsi_remove_device+0x6e/0xb0
[ 1563.906463]  [<c06d4aff>] scsi_forget_host+0x4f/0x60
[ 1563.906468]  [<c06cd84a>] scsi_remove_host+0x5a/0xf0
[ 1563.906482]  [<f7f030fb>] quiesce_and_remove_host+0x5b/0xa0 [usb_storage]
[ 1563.906490]  [<f7f03203>] usb_stor_disconnect+0x13/0x20 [usb_storage]

Anyway I think this patch is some step forward.

As drawback, I do not teardown on sysfs file create error, because I do
not know how to nullify disk->ev (since it can be used). However add_disk
error handling practically does not exist too, and things will work
without this sysfs file, except events will not be exported to user
space.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-02 10:44:17 +01:00
Jun'ichi Nomura fe316bf2d5 block: Fix NULL pointer dereference in sd_revalidate_disk
Since 2.6.39 (1196f8b), when a driver returns -ENOMEDIUM for open(),
__blkdev_get() calls rescan_partitions() to remove
in-kernel partition structures and raise KOBJ_CHANGE uevent.

However it ends up calling driver's revalidate_disk without open
and could cause oops.

In the case of SCSI:

  process A                  process B
  ----------------------------------------------
  sys_open
    __blkdev_get
      sd_open
        returns -ENOMEDIUM
                             scsi_remove_device
                               <scsi_device torn down>
      rescan_partitions
        sd_revalidate_disk
          <oops>
Oopses are reported here:
http://marc.info/?l=linux-scsi&m=132388619710052

This patch separates the partition invalidation from rescan_partitions()
and use it for -ENOMEDIUM case.

Reported-by: Huajun Li <huajun.li.lee@gmail.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-02 10:38:33 +01:00
Ingo Molnar 7e4d960993 Merge branch 'linus' into sched/core
Merge reason: we'll queue up dependent patches.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01 10:26:43 +01:00
Anton Altaparmakov 97387e3baa LDM: Fix reassembly of extended VBLKs.
From: Ben Hutchings <ben@decadent.org.uk>

Extended VBLKs (those larger than the preset VBLK size) are divided
into fragments, each with its own VBLK header.  Our LDM implementation
generally assumes that each VBLK is contiguous in memory, so these
fragments must be assembled before further processing.

Currently the reassembly seems to be done quite wrongly - no VBLK
header is copied into the contiguous buffer, and the length of the
header is subtracted twice from each fragment.  Also the total
length of the reassembled VBLK is calculated incorrectly.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
2012-02-24 09:37:42 +00:00
Tejun Heo 621032ad6e block: exit_io_context() should call elevator_exit_icq_fn()
While updating locking, b2efa05265 "block, cfq: unlink
cfq_io_context's immediately" moved elevator_exit_icq_fn() invocation
from exit_io_context() to the final ioc put.  While this doesn't cause
catastrophic failure, it effectively removes task exit notification to
elevator and cause noticeable IO performance degradation with CFQ.

On task exit, CFQ used to immediately expire the slice if it was being
used by the exiting task as no more IO would be issued by the task;
however, after b2efa05265, the notification is lost and disk could sit
idle needlessly, leading to noticeable IO performance degradation for
certain workloads.

This patch renames ioc_exit_icq() to ioc_destroy_icq(), separates
elevator_exit_icq_fn() invocation into ioc_exit_icq() and invokes it
from exit_io_context().  ICQ_EXITED flag is added to avoid invoking
the callback more than once for the same icq.

Walking icq_list from ioc side and invoking elevator callback requires
reverse double locking.  This may be better implemented using RCU;
unfortunately, using RCU isn't trivial.  e.g. RCU protection would
need to cover request_queue and queue_lock switch on cleanup makes
grabbing queue_lock from RCU unsafe.  Reverse double locking should
do, at least for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-bisected-by: Shaohua Li <shli@kernel.org>
LKML-Reference: <CANejiEVzs=pUhQSTvUppkDcc2TNZyfohBRLygW5zFmXyk5A-xQ@mail.gmail.com>
Tested-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-15 09:45:53 +01:00
Tejun Heo 2274b029f6 block: simplify ioc_release_fn()
Reverse double lock dancing in ioc_release_fn() can be simplified by
just using trylock on the queue_lock and back out from ioc lock on
trylock failure.  Simplify it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-15 09:45:52 +01:00
Tejun Heo d705ae6b13 block: replace icq->changed with icq->flags
icq->changed was used for ICQ_*_CHANGED bits.  Rename it to flags and
access it under ioc->lock instead of using atomic bitops.
ioc_get_changed() is added so that the changed part can be fetched and
cleared as before.

icq->flags will be used to carry other flags.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-15 09:45:49 +01:00
Tejun Heo d8c66c5d59 block: fix lockdep warning on io_context release put_io_context()
11a3122f6c "block: strip out locking optimization in put_io_context()"
removed ioc_lock depth lockdep annoation along with locking
optimization; however, while recursing from put_io_context() is no
longer possible, ioc_release_fn() may still end up putting the last
reference of another ioc through elevator, which wlil grab ioc->lock
triggering spurious (as the ioc is always different one) A-A deadlock
warning.

As this can only happen one time from ioc_release_fn(), using non-zero
subclass from ioc_release_fn() is enough.  Use subclass 1.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-11 12:37:25 +01:00
Stanislaw Gruszka 37b40adf2d bsg: fix sysfs link remove warning
We create "bsg" link if q->kobj.sd is not NULL, so remove it only
when the same condition is true.

Fixes:

WARNING: at fs/sysfs/inode.c:323 sysfs_hash_and_remove+0x2b/0x77()
sysfs: can not remove 'bsg', no directory
Call Trace:
  [<c0429683>] warn_slowpath_common+0x6a/0x7f
  [<c0537a68>] ? sysfs_hash_and_remove+0x2b/0x77
  [<c042970b>] warn_slowpath_fmt+0x2b/0x2f
  [<c0537a68>] sysfs_hash_and_remove+0x2b/0x77
  [<c053969a>] sysfs_remove_link+0x20/0x23
  [<c05d88f1>] bsg_unregister_queue+0x40/0x6d
  [<c0692263>] __scsi_remove_device+0x31/0x9d
  [<c069149f>] scsi_forget_host+0x41/0x52
  [<c0689fa9>] scsi_remove_host+0x71/0xe0
  [<f7de5945>] quiesce_and_remove_host+0x51/0x83 [usb_storage]
  [<f7de5a1e>] usb_stor_disconnect+0x18/0x22 [usb_storage]
  [<c06c29de>] usb_unbind_interface+0x4e/0x109
  [<c067a80f>] __device_release_driver+0x6b/0xa6
  [<c067a861>] device_release_driver+0x17/0x22
  [<c067a46a>] bus_remove_device+0xd6/0xe6
  [<c06785e2>] device_del+0xf2/0x137
  [<c06c101f>] usb_disable_device+0x94/0x1a0

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 20:02:03 +01:00
Tejun Heo 07c2bd3735 block: don't call elevator callbacks for plug merges
Plug merge calls two elevator callbacks outside queue lock -
elevator_allow_merge_fn() and elevator_bio_merged_fn().  Although
attempt_plug_merge() suggests that elevator is guaranteed to be there
through the existing request on the plug list, nothing prevents plug
merge from calling into dying or initializing elevator.

For regular merges, bypass ensures elvpriv count to reach zero, which
in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
from forced back insertion.  Plug merge doesn't check ELVPRIV, and, as
the requests haven't gone through elevator insertion yet, it doesn't
have SOFTBARRIER set allowing merges on a bypassed queue.

This, for example, leads to the following crash during elevator
switch.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
 IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
 PGD 112cbc067 PUD 115d5c067 PMD 0
 Oops: 0000 [#1] PREEMPT SMP
 CPU 1
 Modules linked in: deadline_iosched

 Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
 RIP: 0010:[<ffffffff813b34e9>]  [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
 RSP: 0018:ffff8801143a38f8  EFLAGS: 00010297
 RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
 RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
 RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
 R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
 FS:  00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
 Stack:
  ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
  ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
  ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
 Call Trace:
  [<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
  [<ffffffff81391bf1>] elv_try_merge+0x21/0x40
  [<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
  [<ffffffff81396a5a>] generic_make_request+0xca/0x100
  [<ffffffff81396b04>] submit_bio+0x74/0x100
  [<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
  [<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
  [<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
  [<ffffffff811986b2>] do_sync_read+0xe2/0x120
  [<ffffffff81199345>] vfs_read+0xc5/0x180
  [<ffffffff81199501>] sys_read+0x51/0x90
  [<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b

There are multiple ways to fix this including making plug merge check
ELVPRIV; however,

* Calling into elevator outside queue lock is confusing and
  error-prone.

* Requests on plug list aren't known to the elevator.  They aren't on
  the elevator yet, so there's no elevator specific state to update.

* Given the nature of plug merges - collecting bio's for the same
  purpose from the same issuer - elevator specific restrictions aren't
  applicable.

So, simply don't call into elevator methods from plug merge by moving
elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
using blk_try_merge() in attempt_plug_merge().

This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.

Note that this makes per-cgroup merged stats skip plug merging.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 09:19:42 +01:00
Tejun Heo 050c8ea80e block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions
blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
test.  blk_try_merge() determines merge direction and expects the
caller to have tested elv_rq_merge_ok() previously.

elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
elv_iosched_allow_merge().  elv_try_merge() is removed and the two
callers are updated to call elv_rq_merge_ok() explicitly followed by
blk_try_merge().  While at it, make rq_merge_ok() functions return
bool.

This is to prepare for plug merge update and doesn't introduce any
behavior change.

This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 09:19:38 +01:00
Tejun Heo 11a3122f6c block: strip out locking optimization in put_io_context()
put_io_context() performed a complex trylock dancing to avoid
deferring ioc release to workqueue.  It was also broken on UP because
trylock was always assumed to succeed which resulted in unbalanced
preemption count.

While there are ways to fix the UP breakage, even the most
pathological microbench (forced ioc allocation and tight fork/exit
loop) fails to show any appreciable performance benefit of the
optimization.  Strip it out.  If there turns out to be workloads which
are affected by this change, simpler optimization from the discussion
thread can be applied later.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-07 07:51:30 +01:00
Shaohua Li 9fa73472dd block: fix ioc locking warning
Meelis reported a warning:

WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec()
Hardware name: 939Dual-SATA2
timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103
Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq
Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd125666 #176
Call Trace:
 <IRQ>  [<ffffffff81022aaa>] warn_slowpath_common+0x7e/0x96
 [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d
 [<ffffffff81022b56>] warn_slowpath_fmt+0x41/0x43
 [<ffffffff8114c526>] ? cfq_idle_slice_timer+0xa1/0xaa
 [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d
 [<ffffffff8102c124>] run_timer_softirq+0x199/0x1ec
 [<ffffffff81047a53>] ? timekeeping_get_ns+0x12/0x31
 [<ffffffff810145fd>] ? apic_write+0x11/0x13
 [<ffffffff81027475>] __do_softirq+0x74/0xfa
 [<ffffffff812f337a>] call_softirq+0x1a/0x30
 [<ffffffff81002ff9>] do_softirq+0x31/0x68
 [<ffffffff810276cf>] irq_exit+0x3d/0xa3
 [<ffffffff81014aca>] smp_apic_timer_interrupt+0x6b/0x77
 [<ffffffff812f2de9>] apic_timer_interrupt+0x69/0x70
 <EOI>  [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d
 [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d
 [<ffffffff8100801f>] ? default_idle+0x1e/0x32
 [<ffffffff81008019>] ? default_idle+0x18/0x32
 [<ffffffff810008b1>] cpu_idle+0x87/0xd1
 [<ffffffff812de861>] rest_init+0x85/0x89
 [<ffffffff81659a4d>] start_kernel+0x2eb/0x2f8
 [<ffffffff8165926e>] x86_64_start_reservations+0x7e/0x82
 [<ffffffff81659362>] x86_64_start_kernel+0xf0/0xf7

this_q == locked_q is possible. There are two problems here:
1. In UP case, there is preemption counter issue as spin_trylock always
successes.
2. In SMP case, the loop breaks too earlier.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Knut Petersen <Knut_Petersen@t-online.de>
Tested-by: Knut Petersen <Knut_Petersen@t-online.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-06 08:57:29 +01:00
Li Zefan 761b3ef50e cgroup: remove cgroup_subsys argument from callbacks
The argument is not used at all, and it's not necessary, because
a specific callback handler of course knows which subsys it
belongs to.

Now only ->pupulate() takes this argument, because the handlers of
this callback always call cgroup_add_file()/cgroup_add_files().

So we reduce a few lines of code, though the shrinking of object size
is minimal.

 16 files changed, 113 insertions(+), 162 deletions(-)

   text    data     bss     dec     hex filename
5486240  656987 7039960 13183187         c928d3 vmlinux.o.orig
5486170  656987 7039960 13183117         c9288d vmlinux.o

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-02-02 09:20:22 -08:00
Peter Zijlstra 39be350127 sched, block: Unify cache detection
The block layer has some code trying to determine if two CPUs share a
cache, the scheduler has a similar function. Expose the function used
by the scheduler and make the block layer use it, thereby removing the
block layers usage of CONFIG_SCHED* and topology bits.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jens Axboe <axboe@kernel.dk>
Link: http://lkml.kernel.org/r/1327579450.2446.95.camel@twins
2012-01-27 13:28:48 +01:00
Shaohua Li 05c30b9551 block: fix NULL icq_cache reference
Vivek reported a kernel crash:
[   94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[   94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
[   94.218004] PGD 13abda067 PUD 137d52067 PMD 0
[   94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[   94.218004] CPU 0
[   94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
[   94.218004]
[   94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
[   94.218004] RIP: 0010:[<ffffffff81142fae>]  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
[   94.218004] RSP: 0018:ffff88013fc03de0  EFLAGS: 00010006
[   94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
[   94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
[   94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
[   94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
[   94.218004] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[   94.218004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
[   94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
[   94.218004] Stack:
[   94.218004]  0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
[   94.218004]  0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
[   94.218004]  ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
[   94.218004] Call Trace:
[   94.218004]  <IRQ>
[   94.218004]  [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20
[   94.218004]  [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420
[   94.218004]  [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250
[   94.218004]  [<ffffffff810405ee>] __do_softirq+0xce/0x3e0
[   94.218004]  [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100
[   94.218004]  [<ffffffff81090104>] ? tick_program_event+0x24/0x30
[   94.218004]  [<ffffffff8183ed1c>] call_softirq+0x1c/0x30
[   94.218004]  [<ffffffff8100422d>] do_softirq+0x8d/0xc0
[   94.218004]  [<ffffffff81040c3e>] irq_exit+0xae/0xe0
[   94.218004]  [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99
[   94.218004]  [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80

Once a queue is quiesced, it's not supposed to have any elvpriv data or
icq's, and elevator switching depends on that.  Request alloc path
followed the rule for elvpriv data but forgot apply it to icq's
leading to the following crash during elevator switch. Fix it by not
allocating icq's if ELVPRIV is not set for the request.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-01-19 09:20:10 +01:00
Shaohua Li df0793abb9 block,cfq: change code order
cfq_slice_expired will change saved_workload_slice. It should be called
first so saved_workload_slice is correctly set to 0 after workload type
is changed.
This fixes the code order changed by 54b466e44b.

Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-01-19 09:20:09 +01:00
Jens Axboe 54b466e44b cfq-iosched: fix use-after-free of cfqq
With the changes in life time management between the cfq IO contexts
and the cfq queues, we now risk having cfqd->active_queue being
freed when cfq_slice_expired() is being called. cfq_preempt_queue()
caches this queue and uses it after calling said function, causing
a use-after-free condition. This triggers the following oops,
when cfqq_type() attempts to dereference it:

BUG: unable to handle kernel paging request at ffff8800746c4f0c
IP: [<ffffffff81266d59>] cfqq_type+0xb/0x20
PGD 18d4063 PUD 1fe15067 PMD 1ffb9067 PTE 80000000746c4160
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU 3
Modules linked in:

Pid: 1, comm: init Not tainted 3.2.0-josef+ #367 Bochs Bochs
RIP: 0010:[<ffffffff81266d59>]  [<ffffffff81266d59>] cfqq_type+0xb/0x20
RSP: 0018:ffff880079c11778  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff880076f3df08 RCX: 0000000000000000
RDX: 0000000000000006 RSI: ffff880074271888 RDI: ffff8800746c4f08
RBP: ffff880079c11778 R08: 0000000000000078 R09: 0000000000000001
R10: 09f911029d74e35b R11: 09f911029d74e35b R12: ffff880076f337f0
R13: ffff8800746c4f08 R14: ffff8800746c4f08 R15: 0000000000000002
FS:  00007f62fd44f700(0000) GS:ffff88007cd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8800746c4f0c CR3: 0000000076c21000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process init (pid: 1, threadinfo ffff880079c10000, task ffff880079c0a040)
Stack:
 ffff880079c117c8 ffffffff812683d8 ffff880079c117a8 ffffffff8125de43
 ffff8800744fcf48 ffff880074b43e98 ffff8800770c8828 ffff880074b43e98
 0000000000000003 0000000000000000 ffff880079c117f8 ffffffff81254149
Call Trace:
 [<ffffffff812683d8>] cfq_insert_request+0x3f5/0x47c
 [<ffffffff8125de43>] ? blk_recount_segments+0x20/0x31
 [<ffffffff81254149>] __elv_add_request+0x1ca/0x200
 [<ffffffff8125aa99>] blk_queue_bio+0x2ef/0x312
 [<ffffffff81258f7b>] generic_make_request+0x9f/0xe0
 [<ffffffff8125907b>] submit_bio+0xbf/0xca
 [<ffffffff81136ec7>] submit_bh+0xdf/0xfe
 [<ffffffff81176d04>] ext3_bread+0x50/0x99
 [<ffffffff811785b3>] dx_probe+0x38/0x291
 [<ffffffff81178864>] ext3_dx_find_entry+0x58/0x219
 [<ffffffff81178ad5>] ext3_find_entry+0xb0/0x406
 [<ffffffff8110c4d5>] ? cache_alloc_debugcheck_after.isra.46+0x14d/0x1a0
 [<ffffffff8110cfbd>] ? kmem_cache_alloc+0xef/0x191
 [<ffffffff8117a330>] ext3_lookup+0x39/0xe1
 [<ffffffff81119461>] d_alloc_and_lookup+0x45/0x6c
 [<ffffffff8111ac41>] do_lookup+0x1e4/0x2f5
 [<ffffffff8111aef6>] link_path_walk+0x1a4/0x6ef
 [<ffffffff8111b557>] path_lookupat+0x59/0x5ea
 [<ffffffff8127406c>] ? __strncpy_from_user+0x30/0x5a
 [<ffffffff8111bce0>] do_path_lookup+0x23/0x59
 [<ffffffff8111cfd6>] user_path_at_empty+0x53/0x99
 [<ffffffff8107b37b>] ? remove_wait_queue+0x51/0x56
 [<ffffffff8111d02d>] user_path_at+0x11/0x13
 [<ffffffff811141f5>] vfs_fstatat+0x3a/0x64
 [<ffffffff8111425a>] vfs_stat+0x1b/0x1d
 [<ffffffff81114359>] sys_newstat+0x1a/0x33
 [<ffffffff81060e12>] ? task_stopped_code+0x42/0x42
 [<ffffffff815d6712>] system_call_fastpath+0x16/0x1b
Code: 89 e6 48 89 c7 e8 fa ca fe ff 85 c0 74 06 4c 89 2b 41 b6 01 5b 44 89 f0 41 5c 41 5d 41 5e 5d c3 55 48 89 e5 66 66 66 66 90 31 c0 <8b> 57 04 f6 c6 01 74 0b 83 e2 20 83 fa 01 19 c0 83 c0 02 5d c3
RIP  [<ffffffff81266d59>] cfqq_type+0xb/0x20
 RSP <ffff880079c11778>
CR2: ffff8800746c4f0c

Get rid of the caching of cfqd->active_queue, and reorder the
check so that it happens before we expire the active queue.

Thanks to Tejun for pin pointing the error location.

Reported-by: Chris Mason <chris.mason@oracle.com>
Tested-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-01-17 21:26:11 +01:00
Linus Torvalds b3c9dd182e Merge branch 'for-3.3/core' of git://git.kernel.dk/linux-block
* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
  Revert "block: recursive merge requests"
  block: Stop using macro stubs for the bio data integrity calls
  blockdev: convert some macros to static inlines
  fs: remove unneeded plug in mpage_readpages()
  block: Add BLKROTATIONAL ioctl
  block: Introduce blk_set_stacking_limits function
  block: remove WARN_ON_ONCE() in exit_io_context()
  block: an exiting task should be allowed to create io_context
  block: ioc_cgroup_changed() needs to be exported
  block: recursive merge requests
  block, cfq: fix empty queue crash caused by request merge
  block, cfq: move icq creation and rq->elv.icq association to block core
  block, cfq: restructure io_cq creation path for io_context interface cleanup
  block, cfq: move io_cq exit/release to blk-ioc.c
  block, cfq: move icq cache management to block core
  block, cfq: move io_cq lookup to blk-ioc.c
  block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
  block, cfq: reorganize cfq_io_context into generic and cfq specific parts
  block: remove elevator_queue->ops
  block: reorder elevator switch sequence
  ...

Fix up conflicts in:
 - block/blk-cgroup.c
	Switch from can_attach_task to can_attach
 - block/cfq-iosched.c
	conflict with now removed cic index changes (we now use q->id instead)
2012-01-15 12:24:45 -08:00
Jens Axboe 5d381efb3d Revert "block: recursive merge requests"
This reverts commit 274193224c.

We have some problems related to selection of empty queues
that need to be resolved, evidence so far points to the
recursive merge logic making either being the cause or at
least the accelerator for this. So revert it for now, until
we figure this out.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-01-15 10:29:48 +01:00
Paolo Bonzini 0bfc96cb77 block: fail SCSI passthrough ioctls on partition devices
Linux allows executing the SG_IO ioctl on a partition or LVM volume, and
will pass the command to the underlying block device.  This is
well-known, but it is also a large security problem when (via Unix
permissions, ACLs, SELinux or a combination thereof) a program or user
needs to be granted access only to part of the disk.

This patch lets partitions forward a small set of harmless ioctls;
others are logged with printk so that we can see which ioctls are
actually sent.  In my tests only CDROM_GET_CAPABILITY actually occurred.
Of course it was being sent to a (partition on a) hard disk, so it would
have failed with ENOTTY and the patch isn't changing anything in
practice.  Still, I'm treating it specially to avoid spamming the logs.

In principle, this restriction should include programs running with
CAP_SYS_RAWIO.  If for example I let a program access /dev/sda2 and
/dev/sdb, it still should not be able to read/write outside the
boundaries of /dev/sda2 independent of the capabilities.  However, for
now programs with CAP_SYS_RAWIO will still be allowed to send the
ioctls.  Their actions will still be logged.

This patch does not affect the non-libata IDE driver.  That driver
however already tests for bd != bd->bd_contains before issuing some
ioctl; it could be restricted further to forbid these ioctls even for
programs running with CAP_SYS_ADMIN/CAP_SYS_RAWIO.

Cc: linux-scsi@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Make it also print the command name when warning - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-14 15:07:24 -08:00
Paolo Bonzini 577ebb374c block: add and use scsi_blk_cmd_ioctl
Introduce a wrapper around scsi_cmd_ioctl that takes a block device.

The function will then be enhanced to detect partition block devices
and, in that case, subject the ioctls to whitelisting.

Cc: linux-scsi@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-14 15:07:24 -08:00
Martin K. Petersen ef00f59c95 block: Add BLKROTATIONAL ioctl
Introduce an ioctl which permits applications to query whether a block
device is rotational.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-01-11 16:29:31 +01:00
Martin K. Petersen b1bd055d39 block: Introduce blk_set_stacking_limits function
Stacking driver queue limits are typically bounded exclusively by the
capabilities of the low level devices, not by the stacking driver
itself.

This patch introduces blk_set_stacking_limits() which has more liberal
metrics than the default queue limits function. This allows us to
inherit topology parameters from bottom devices without manually
tweaking the default limits in each driver prior to calling the stacking
function.

Since there is now a clear distinction between stacking and low-level
devices, blk_set_default_limits() has been modified to carry the more
conservative values that we used to manually set in
blk_queue_make_request().

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-01-11 16:27:11 +01:00
Linus Torvalds db0c2bf69a Merge branch 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
* 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
  cgroup: fix to allow mounting a hierarchy by name
  cgroup: move assignement out of condition in cgroup_attach_proc()
  cgroup: Remove task_lock() from cgroup_post_fork()
  cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
  cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
  cgroup: only need to check oldcgrp==newgrp once
  cgroup: remove redundant get/put of task struct
  cgroup: remove redundant get/put of old css_set from migrate
  cgroup: Remove unnecessary task_lock before fetching css_set on migration
  cgroup: Drop task_lock(parent) on cgroup_fork()
  cgroups: remove redundant get/put of css_set from css_set_check_fetched()
  resource cgroups: remove bogus cast
  cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
  cgroup, cpuset: don't use ss->pre_attach()
  cgroup: don't use subsys->can_attach_task() or ->attach_task()
  cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
  cgroup: improve old cgroup handling in cgroup_attach_proc()
  cgroup: always lock threadgroup during migration
  threadgroup: extend threadgroup_lock() to cover exit and exec
  threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
  ...

Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups:
fix a css_set not found bug in cgroup_attach_proc" that already
mentioned that the bug is fixed (differently) in Tejun's cgroup
patchset. This one, in other words.
2012-01-09 12:59:24 -08:00
Linus Torvalds 972b2c7199 Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
  reiserfs: Properly display mount options in /proc/mounts
  vfs: prevent remount read-only if pending removes
  vfs: count unlinked inodes
  vfs: protect remounting superblock read-only
  vfs: keep list of mounts for each superblock
  vfs: switch ->show_options() to struct dentry *
  vfs: switch ->show_path() to struct dentry *
  vfs: switch ->show_devname() to struct dentry *
  vfs: switch ->show_stats to struct dentry *
  switch security_path_chmod() to struct path *
  vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
  vfs: trim includes a bit
  switch mnt_namespace ->root to struct mount
  vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
  vfs: opencode mntget() mnt_set_mountpoint()
  vfs: spread struct mount - remaining argument of next_mnt()
  vfs: move fsnotify junk to struct mount
  vfs: move mnt_devname
  vfs: move mnt_list to struct mount
  vfs: switch pnode.h macros to struct mount *
  ...
2012-01-08 12:19:57 -08:00
Al Viro ece2ccb668 Merge branches 'vfsmount-guts', 'umode_t' and 'partitions' into Z 2012-01-06 23:15:54 -05:00
Linus Torvalds 07d106d0a3 vfs: fix up ENOIOCTLCMD error handling
We're doing some odd things there, which already messes up various users
(see the net/socket.c code that this removes), and it was going to add
yet more crud to the block layer because of the incorrect error code
translation.

ENOIOCTLCMD is not an error return that should be returned to user mode
from the "ioctl()" system call, but it should *not* be translated as
EINVAL ("Invalid argument").  It should be translated as ENOTTY
("Inappropriate ioctl for device").

That EINVAL confusion has apparently so permeated some code that the
block layer actually checks for it, which is sad.  We continue to do so
for now, but add a big comment about how wrong that is, and we should
remove it entirely eventually.  In the meantime, this tries to keep the
changes localized to just the EINVAL -> ENOTTY fix, and removing code
that makes it harder to do the right thing.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-05 15:40:12 -08:00
Al Viro 2c9ede55ec switch device_get_devnode() and ->devnode() to umode_t *
both callers of device_get_devnode() are only interested in lower 16bits
and nobody tries to return anything wider than 16bit anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:55 -05:00
Al Viro ff01bb4832 fs: move code out of buffer.c
Move invalidate_bdev, block_sync_page into fs/block_dev.c.  Export
kill_bdev as well, so brd doesn't have to open code it.  Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving.  The small comment replacing it says enough.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:07 -05:00
Al Viro 94ea4158f1 separate partition format handling from generic code
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:06 -05:00
Al Viro 9be96f3fd1 move fs/partitions to block/
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:06 -05:00
Al Viro 4752bc309b make register_disk() static
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:05 -05:00
Dan Williams f2b20d4365 block: fix blk_queue_end_tag()
Commit 5e081591 "block: warn if tag is greater than real_max_depth"
cleaned up blk_queue_end_tag() to warn when the tag is truly invalid
(greater than real_max_depth).  However, it changed behavior in the tag <
max_depth case to not end the request.  Leading to triggering of
BUG_ON(blk_queued_rq(rq)) in the request completion path:

  http://marc.info/?l=linux-kernel&m=132204370518629&w=2

In order to allow blk_queue_resize_tags() to shrink the tag space
blk_queue_end_tag() must always complete tags with a value less than
real_max_depth regardless of the current max_depth.  The comment about
"handling the shrink case" seems to be what prompted changes in this
space, so remove it and BUG on all invalid tags (made even simpler by
Matthew's suggestion to use an unsigned compare).

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Tao Ma <boyu.mt@taobao.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Reported-by: Meelis Roos <mroos@ut.ee>
Reported-by: Ed Nadolski <edmund.nadolski@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-29 09:16:28 +01:00
Tejun Heo c98b2cc29a block: remove WARN_ON_ONCE() in exit_io_context()
6e736be7 "block: make ioc get/put interface more conventional and fix
race on alloction" added WARN_ON_ONCE() in exit_io_context() which
triggers if !PF_EXITING.  All tasks hitting exit_io_context() from
task exit should have PF_EXITING set but task struct tearing down
after fork failure calls into the function without PF_EXITING,
triggering the condition.

  WARNING: at block/blk-ioc.c:234 exit_io_context+0x40/0x92()
  Pid: 17090, comm: trinity Not tainted 3.2.0-rc6-next-20111222-sasha-dirty #77
  Call Trace:
   [<ffffffff810b69a3>] warn_slowpath_common+0x8f/0xb2
   [<ffffffff810b6a77>] warn_slowpath_null+0x18/0x1a
   [<ffffffff8181a7a2>] exit_io_context+0x40/0x92
   [<ffffffff810b58c9>] copy_process+0x126f/0x1453
   [<ffffffff810b5c1b>] do_fork+0x120/0x3e9
   [<ffffffff8106242f>] sys_clone+0x26/0x28
   [<ffffffff82425803>] stub_clone+0x13/0x20
  ---[ end trace a2e4eb670b375238 ]---

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-27 18:52:16 +01:00
Tejun Heo fd63836811 block: an exiting task should be allowed to create io_context
While fixing io_context creation / task exit race condition,
6e736be7f2 "block: make ioc get/put interface more conventional and
fix race on alloction" also prevented an exiting (%PF_EXITING) task
from creating its own io_context.  This is incorrect as exit path may
issue IOs, e.g. from exit_files(), and if those IOs are the first ones
issued by the task, io_context needs to be created to process the IOs.

Combined with the existing problem of io_context / io_cq creation
failure having the possibility of stalling IO, this problem results in
deterministic full IO lockup with certain workloads.

Fix it by allowing io_context creation regardless of %PF_EXITING for
%current.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-25 14:29:14 +01:00
majianpeng 609f6ea1c9 block: re-use existing 'reading' variable instead of checking direction again
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-21 15:27:24 +01:00
Jens Axboe 64c42998f1 block: ioc_cgroup_changed() needs to be exported
With the ioc changed, ioc_cgroup_changed() can be used by modular
code. So ensure that it is exported.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-19 10:36:44 +01:00
Shaohua Li 6ae0516b8a block, cfq: fix empty queue crash caused by request merge
All requests of a queue could be merged to other requests of other queue.
Such queue will not have request in it, but it's in service tree. This
will cause kernel oops.
I encounter a BUG_ON() in cfq_dispatch_request() with next patch, but the
issue should exist without the patch.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-16 14:04:23 +01:00
Shaohua Li 274193224c block: recursive merge requests
In my workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1,
a+3,.... When the requests are flushed to queue, a and a+1 are merged
to (a, a+1), a+2 and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3)
aren't merged.
With recursive merge below, the workload throughput gets improved 20%
and context switch drops 60%.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-16 14:00:31 +01:00
Shaohua Li 4a0b75c7d0 block, cfq: fix empty queue crash caused by request merge
All requests of a queue could be merged to other requests of other queue.
Such queue will not have request in it, but it's in service tree. This
will cause kernel oops.
I encounter a BUG_ON() in cfq_dispatch_request() with next patch, but the
issue should exist without the patch.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-16 14:00:22 +01:00
Tejun Heo 4eabc94125 block: don't kick empty queue in blk_drain_queue()
While probing, fd sets up queue, probes hardware and tears down the
queue if probing fails.  In the process, blk_drain_queue() kicks the
queue which failed to finish initialization and fd is unhappy about
that.

  floppy0: no floppy controllers found
  ------------[ cut here ]------------
  WARNING: at drivers/block/floppy.c:2929 do_fd_request+0xbf/0xd0()
  Hardware name: To Be Filled By O.E.M.
  VFS: do_fd_request called on non-open device
  Modules linked in:
  Pid: 1, comm: swapper Not tainted 3.2.0-rc4-00077-g5983fe2 #2
  Call Trace:
   [<ffffffff81039a6a>] warn_slowpath_common+0x7a/0xb0
   [<ffffffff81039b41>] warn_slowpath_fmt+0x41/0x50
   [<ffffffff813d657f>] do_fd_request+0xbf/0xd0
   [<ffffffff81322b95>] blk_drain_queue+0x65/0x80
   [<ffffffff81322c93>] blk_cleanup_queue+0xe3/0x1a0
   [<ffffffff818a809d>] floppy_init+0xdeb/0xe28
   [<ffffffff818a72b2>] ? daring+0x6b/0x6b
   [<ffffffff810002af>] do_one_initcall+0x3f/0x170
   [<ffffffff81884b34>] kernel_init+0x9d/0x11e
   [<ffffffff810317c2>] ? schedule_tail+0x22/0xa0
   [<ffffffff815dbb14>] kernel_thread_helper+0x4/0x10
   [<ffffffff81884a97>] ? start_kernel+0x2be/0x2be
   [<ffffffff815dbb10>] ? gs_change+0xb/0xb

Avoid it by making blk_drain_queue() kick queue iff dispatch queue has
something on it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ralf Hildebrandt <Ralf.Hildebrandt@charite.de>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Tested-by: Sergei Trofimovich <slyich@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-15 20:03:04 +01:00
Tejun Heo f1f8cc9465 block, cfq: move icq creation and rq->elv.icq association to block core
Now block layer knows everything necessary to create and associate
icq's with requests.  Move ioc_create_icq() to blk-ioc.c and update
get_request() such that, if elevator_type->icq_size is set, requests
are automatically associated with their matching icq's before
elv_set_request().  io_context reference is also managed by block core
on request alloc/free.

* Only ioprio/cgroup changed handling remains from cfq_get_cic().
  Collapsed into cfq_set_request().

* This removes queue kicking on icq allocation failure (for now).  As
  icq allocation failure is rare and the only effect of queue kicking
  achieved was possibily accelerating queue processing, this change
  shouldn't be noticeable.

  There is a larger underlying problem.  Unlike request allocation,
  icq allocation is not guaranteed to succeed eventually after
  retries.  The number of icq is unbound and thus mempool can't be the
  solution either.  This effectively adds allocation dependency on
  memory free path and thus possibility of deadlock.

  This usually wouldn't happen because icq allocation is not a hot
  path and, even when the condition triggers, it's highly unlikely
  that none of the writeback workers already has icq.

  However, this is still possible especially if elevator is being
  switched under high memory pressure, so we better get it fixed.
  Probably the only solution is just bypassing elevator and appending
  to dispatch queue on any elevator allocation failure.

* Comment added to explain how icq's are managed and synchronized.

This completes cleanup of io_context interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:42 +01:00
Tejun Heo 9b84cacd01 block, cfq: restructure io_cq creation path for io_context interface cleanup
Add elevator_ops->elevator_init_icq_fn() and restructure
cfq_create_cic() and rename it to ioc_create_icq().

The new function expects its caller to pass in io_context, uses
elevator_type->icq_cache, handles generic init, calls the new elevator
operation for elevator specific initialization, and returns pointer to
created or looked up icq.  This leaves cfq_icq_pool variable without
any user.  Removed.

This prepares for io_context interface cleanup and doesn't introduce
any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:42 +01:00
Tejun Heo 7e5a879449 block, cfq: move io_cq exit/release to blk-ioc.c
With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
blk-ioc too.  The odd ->io_cq->exit/release() callbacks are replaced
with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
and q, and freeing automatically handled by blk-ioc.  The elevator
operation only need to perform exit operation specific to the elevator
- in cfq's case, exiting the cfqq's.

Also, clearing of io_cq's on q detach is moved to block core and
automatically performed on elevator switch and q release.

Because the q io_cq points to might be freed before RCU callback for
the io_cq runs, blk-ioc code should remember to which cache the io_cq
needs to be freed when the io_cq is released.  New field
io_cq->__rcu_icq_cache is added for this purpose.  As both the new
field and rcu_head are used only after io_cq is released and the
q/ioc_node fields aren't, they are put into unions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:42 +01:00
Tejun Heo 3d3c2379fe block, cfq: move icq cache management to block core
Let elevators set ->icq_size and ->icq_align in elevator_type and
elv_register() and elv_unregister() respectively create and destroy
kmem_cache for icq.

* elv_register() now can return failure.  All callers updated.

* icq caches are automatically named "ELVNAME_io_cq".

* cfq_slab_setup/kill() are collapsed into cfq_init/exit().

* While at it, minor indentation change for iosched_cfq.elevator_name
  for consistency.

This will help moving icq management to block core.  This doesn't
introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:42 +01:00
Tejun Heo 47fdd4ca96 block, cfq: move io_cq lookup to blk-ioc.c
Now that all io_cq related data structures are in block core layer,
io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c.

Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with
parameter return type changes (cfqd -> request_queue, cfq_io_cq ->
io_cq) and cfq_cic_lookup() becomes thin wrapper around
cfq_cic_lookup().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:42 +01:00
Tejun Heo a612fddf0d block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
Most of icq management is about to be moved out of cfq into blk-ioc.
This patch prepares for it.

* Move cfqd->icq_list to request_queue->icq_list

* Make request explicitly point to icq instead of through elevator
  private data.  ->elevator_private[3] is replaced with sub struct elv
  which contains icq pointer and priv[2].  cfq is updated accordingly.

* Meaningless clearing of ->elevator_private[0] removed from
  elv_set_request().  At that point in code, the field was guaranteed
  to be %NULL anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:41 +01:00
Tejun Heo c586980732 block, cfq: reorganize cfq_io_context into generic and cfq specific parts
Currently io_context and cfq logics are mixed without clear boundary.
Most of io_context is independent from cfq but cfq_io_context handling
logic is dispersed between generic ioc code and cfq.

cfq_io_context represents association between an io_context and a
request_queue, which is a concept useful outside of cfq, but it also
contains fields which are useful only to cfq.

This patch takes out generic part and put it into io_cq (io
context-queue) and the rest into cfq_io_cq (cic moniker remains the
same) which contains io_cq.  The following changes are made together.

* cfq_ttime and cfq_io_cq now live in cfq-iosched.c.

* All related fields, functions and constants are renamed accordingly.

* ioc->ioc_data is now "struct io_cq *" instead of "void *" and
  renamed to icq_hint.

This prepares for io_context API cleanup.  Documentation is currently
sparse.  It will be added later.

Changes in this patch are mechanical and don't cause functional
change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:41 +01:00
Tejun Heo 22f746e235 block: remove elevator_queue->ops
elevator_queue->ops points to the same ops struct ->elevator_type.ops
is pointing to.  The only effect of caching it in elevator_queue is
shorter notation - it doesn't save any indirect derefence.

Relocate elevator_type->list which used only during module init/exit
to the end of the structure, rename elevator_queue->elevator_type to
->type, and replace elevator_queue->ops with elevator_queue->type.ops.

This doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:41 +01:00
Tejun Heo f8fc877d3c block: reorder elevator switch sequence
Elevator switch sequence first attached the new elevator, then tried
registering it (sysfs) and if that failed attached back the old
elevator.  However, sysfs registration doesn't require the elevator to
be attached, so there is no reason to do the "detach, attach new,
register, maybe re-attach old" sequence.  It can just do "register,
detach, attach".

* elevator_init_queue() is updated to set ->elevator_data directly and
  return 0 / -errno.  This allows elevator_exit() on an unattached
  elevator.

* __elv_unregister_queue() which was necessary to unregister
  unattached q is removed in favor of __elv_register_queue() which can
  register unattached q.

* elevator_attach() becomes a single assignment and obscures more then
  it helps.  Dropped.

This will help cleaning up io_context handling across elevator switch.

This patch doesn't introduce visible behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:40 +01:00
Tejun Heo f2dbd76a0a block, cfq: replace current_io_context() with create_io_context()
When called under queue_lock, current_io_context() triggers lockdep
warning if it hits allocation path.  This is because io_context
installation is protected by task_lock which is not IRQ safe, so it
triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
deadlock warning.

Given the restriction, accessor + creator rolled into one doesn't work
too well.  Drop current_io_context() and let the users access
task->io_context directly inside queue_lock combined with explicit
creation using create_io_context().

Future ioc updates will further consolidate ioc access and the create
interface will be unexported.

While at it, relocate ioc internal interface declarations in blk.h and
add section comments before and after.

This patch does not introduce functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:40 +01:00
Tejun Heo 1238033c79 block, cfq: kill cic->key
Now that lazy paths are removed, cfqd_dead_key() is meaningless and
cic->q can be used whereever cic->key is used.  Kill cic->key.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:40 +01:00
Tejun Heo b50b636bce block, cfq: kill ioc_gone
Now that cic's are immediately unlinked under both locks, there's no
need to count and drain cic's before module unload.  RCU callback
completion is waited with rcu_barrier().

While at it, remove residual RCU operations on cic_list.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:39 +01:00
Tejun Heo b9a1920837 block, cfq: remove delayed unlink
Now that all cic's are immediately unlinked from both ioc and queue,
lazy dropping from lookup path and trimming on elevator unregister are
unnecessary.  Kill them and remove now unused elevator_ops->trim().

This also leaves call_for_each_cic() without any user.  Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:39 +01:00
Tejun Heo b2efa05265 block, cfq: unlink cfq_io_context's immediately
cic is association between io_context and request_queue.  A cic is
linked from both ioc and q and should be destroyed when either one
goes away.  As ioc and q both have their own locks, locking becomes a
bit complex - both orders work for removal from one but not from the
other.

Currently, cfq tries to circumvent this locking order issue with RCU.
ioc->lock nests inside queue_lock but the radix tree and cic's are
also protected by RCU allowing either side to walk their lists without
grabbing lock.

This rather unconventional use of RCU quickly devolves into extremely
fragile convolution.  e.g. The following is from cfqd going away too
soon after ioc and q exits raced.

 general protection fault: 0000 [#1] PREEMPT SMP
 CPU 2
 Modules linked in:
 [   88.503444]
 Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs
 RIP: 0010:[<ffffffff81397628>]  [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0
 ...
 Call Trace:
  [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90
  [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20
  [<ffffffff81389130>] exit_io_context+0x100/0x140
  [<ffffffff81098a29>] do_exit+0x579/0x850
  [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0
  [<ffffffff81098de7>] sys_exit_group+0x17/0x20
  [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b

The only real hot path here is cic lookup during request
initialization and avoiding extra locking requires very confined use
of RCU.  This patch makes cic removal from both ioc and request_queue
perform double-locking and unlink immediately.

* From q side, the change is almost trivial as ioc->lock nests inside
  queue_lock.  It just needs to grab each ioc->lock as it walks
  cic_list and unlink it.

* From ioc side, it's a bit more difficult because of inversed lock
  order.  ioc needs its lock to walk its cic_list but can't grab the
  matching queue_lock and needs to perform unlock-relock dancing.

  Unlinking is now wholly done from put_io_context() and fast path is
  optimized by using the queue_lock the caller already holds, which is
  by far the most common case.  If the ioc accessed multiple devices,
  it tries with trylock.  In unlikely cases of fast path failure, it
  falls back to full double-locking dance from workqueue.

Double-locking isn't the prettiest thing in the world but it's *far*
simpler and more understandable than RCU trick without adding any
meaningful overhead.

This still leaves a lot of now unnecessary RCU logics.  Future patches
will trim them.

-v2: Vivek pointed out that cic->q was being dereferenced after
     cic->release() was called.  Updated to use local variable @this_q
     instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:39 +01:00
Tejun Heo f1a4f4d35f block, cfq: fix cic lookup locking
* cfq_cic_lookup() may be called without queue_lock and multiple tasks
  can execute it simultaneously for the same shared ioc.  Nothing
  prevents them racing each other and trying to drop the same dead cic
  entry multiple times.

* smp_wmb() in cfq_exit_cic() doesn't really do anything and nothing
  prevents cfq_cic_lookup() seeing stale cic->key.  This usually
  doesn't blow up because by the time cic is exited, all requests have
  been drained and new requests are terminated before going through
  elevator.  However, it can still be triggered by plug merge path
  which doesn't grab queue_lock and thus can't check DEAD state
  reliably.

This patch updates lookup locking such that,

* Lookup is always performed under queue_lock.  This doesn't add any
  more locking.  The only issue is cfq_allow_merge() which can be
  called from plug merge path without holding any lock.  For now, this
  is worked around by using cic of the request to merge into, which is
  guaranteed to have the same ioc.  For longer term, I think it would
  be best to separate out plug merge method from regular one.

* Spurious ioc->lock locking around cic lookup hint assignment
  dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:39 +01:00
Tejun Heo 216284c352 block, cfq: fix race condition in cic creation path and tighten locking
cfq_get_io_context() would fail if multiple tasks race to insert cic's
for the same association.  This patch restructures
cfq_get_io_context() such that slow path insertion race is handled
properly.

Note that the restructuring also makes cfq_get_io_context() called
under queue_lock and performs both ioc and cfqd insertions while
holding both ioc and queue locks.  This is part of on-going locking
tightening and will be used to simplify synchronization rules.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:38 +01:00
Tejun Heo dc86900e0a block, cfq: move ioc ioprio/cgroup changed handling to cic
ioprio/cgroup change was handled by marking the changed state in ioc
and, on the following access to the ioc, performing RCU-protected
iteration through all cic's grabbing the matching queue_lock.

This patch moves the changed state to each cic.  When ioprio or cgroup
changes, the respective bit is set on all cic's of the ioc and when
each of those cic (not ioc) is accessed, change is applied for that
specific ioc-queue pair.

This also fixes the following two race conditions between setting and
clearing of changed states.

* Missing barrier between assign/load of ioprio and ioprio_changed
  allowed applying old ioprio.

* Change requests could happen between application of change and
  clearing of changed variables.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:38 +01:00
Tejun Heo 283287a52e block, cfq: misc updates to cfq_io_context
Make the following changes to prepare for ioc/cic management cleanup.

* Add cic->q so that ioc can determine the associated queue without
  querying cfq.  This will eventually replace ->key.

* Factor out cfq_release_cic() from cic_free_func().  This function
  assumes that the caller handled locking.

* Rename __cfq_exit_single_io_context() to cfq_exit_cic() and make it
  take only @cic.

* Restructure cfq_cic_link() for future updates.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:38 +01:00
Tejun Heo 09ac46c429 block: misc updates to blk_get_queue()
* blk_get_queue() is peculiar in that it returns 0 on success and 1 on
  failure instead of 0 / -errno or boolean.  Update it such that it
  returns %true on success and %false on failure.

* Make sure the caller checks for the return value.

* Separate out __blk_get_queue() which doesn't check whether @q is
  dead and put it in blk.h.  This will be used later.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:38 +01:00
Tejun Heo 6e736be7f2 block: make ioc get/put interface more conventional and fix race on alloction
Ignoring copy_io() during fork, io_context can be allocated from two
places - current_io_context() and set_task_ioprio().  The former is
always called from local task while the latter can be called from
different task.  The synchornization between them are peculiar and
dubious.

* current_io_context() doesn't grab task_lock() and assumes that if it
  saw %NULL ->io_context, it would stay that way until allocation and
  assignment is complete.  It has smp_wmb() between alloc/init and
  assignment.

* set_task_ioprio() grabs task_lock() for assignment and does
  smp_read_barrier_depends() between "ioc = task->io_context" and "if
  (ioc)".  Unfortunately, this doesn't achieve anything - the latter
  is not a dependent load of the former.  ie, if ioc itself were being
  dereferenced "ioc->xxx", it would mean something (not sure what tho)
  but as the code currently stands, the dependent read barrier is
  noop.

As only one of the the two test-assignment sequences is task_lock()
protected, the task_lock() can't do much about race between the two.
Nothing prevents current_io_context() and set_task_ioprio() allocating
its own ioc for the same task and overwriting the other's.

Also, set_task_ioprio() can race with exiting task and create a new
ioc after exit_io_context() is finished.

ioc get/put doesn't have any reason to be complex.  The only hot path
is accessing the existing ioc of %current, which is simple to achieve
given that ->io_context is never destroyed as long as the task is
alive.  All other paths can happily go through task_lock() like all
other task sub structures without impacting anything.

This patch updates ioc get/put so that it becomes more conventional.

* alloc_io_context() is replaced with get_task_io_context().  This is
  the only interface which can acquire access to ioc of another task.
  On return, the caller has an explicit reference to the object which
  should be put using put_io_context() afterwards.

* The functionality of current_io_context() remains the same but when
  creating a new ioc, it shares the code path with
  get_task_io_context() and always goes through task_lock().

* get_io_context() now means incrementing ref on an ioc which the
  caller already has access to (be that an explicit refcnt or implicit
  %current one).

* PF_EXITING inhibits creation of new io_context and once
  exit_io_context() is finished, it's guaranteed that both ioc
  acquisition functions return %NULL.

* All users are updated.  Most are trivial but
  smp_read_barrier_depends() removal from cfq_get_io_context() needs a
  bit of explanation.  I suppose the original intention was to ensure
  ioc->ioprio is visible when set_task_ioprio() allocates new
  io_context and installs it; however, this wouldn't have worked
  because set_task_ioprio() doesn't have wmb between init and install.
  There are other problems with this which will be fixed in another
  patch.

* While at it, use NUMA_NO_NODE instead of -1 for wildcard node
  specification.

-v2: Vivek spotted contamination from debug patch.  Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:38 +01:00
Tejun Heo 42ec57a8f6 block: misc ioc cleanups
* int return from put_io_context() wasn't used by anybody.  Make it
  return void like other put functions and docbook-fy the function
  comment.

* Reorder dummy declarations for !CONFIG_BLOCK case a bit.

* Make alloc_ioc_context() use __GFP_ZERO allocation, take init out of
  if block and drop 0'ing.

* Docbook-fy current_io_context() comment.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:37 +01:00
Tejun Heo a73f730d01 block, cfq: move cfqd->cic_index to q->id
cfq allocates per-queue id using ida and uses it to index cic radix
tree from io_context.  Move it to q->id and allocate on queue init and
free on queue release.  This simplifies cfq a bit and will allow for
further improvements of io context life-cycle management.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:37 +01:00
Tejun Heo 8ba61435d7 block: add missing blk_queue_dead() checks
blk_insert_cloned_request(), blk_execute_rq_nowait() and
blk_flush_plug_list() either didn't check whether the queue was dead
or did it without holding queue_lock.  Update them so that dead state
is checked while holding queue_lock.

AFAICS, this plugs all holes (requeue doesn't matter as the request is
transitioning atomically from in_flight to queued).

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:37 +01:00
Tejun Heo 481a7d6479 block: fix drain_all condition in blk_drain_queue()
When trying to drain all requests, blk_drain_queue() checked only
q->rq.count[]; however, this only tracks REQ_ALLOCED requests.  This
patch updates blk_drain_queue() such that it looks at all the counters
and queues so that request_queue is actually empty on completion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:37 +01:00
Tejun Heo 34f6055c80 block: add blk_queue_dead()
There are a number of QUEUE_FLAG_DEAD tests.  Add blk_queue_dead()
macro and use it.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:37 +01:00
Tejun Heo 1ba64edef6 block, sx8: kill blk_insert_request()
The only user left for blk_insert_request() is sx8 and it can be
trivially switched to use blk_execute_rq_nowait() - special requests
aren't included in io stat and sx8 doesn't use block layer tagging.
Switch sx8 and kill blk_insert_requeset().

This patch doesn't introduce any functional difference.

Only compile tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:37 +01:00
Tejun Heo bb9d97b6df cgroup: don't use subsys->can_attach_task() or ->attach_task()
Now that subsys->can_attach() and attach() take @tset instead of
@task, they can handle per-task operations.  Convert
->can_attach_task() and ->attach_task() users to use ->can_attach()
and attach() instead.  Most converions are straight-forward.
Noteworthy changes are,

* In cgroup_freezer, remove unnecessary NULL assignments to unused
  methods.  It's useless and very prone to get out of sync, which
  already happened.

* In cpuset, PF_THREAD_BOUND test is checked for each task.  This
  doesn't make any practical difference but is conceptually cleaner.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
2011-12-12 18:12:21 -08:00
Yasuaki Ishimatsu 5eb46851de cfq-iosched: fix cfq_cic_link() race confition
cfq_cic_link() has race condition. When some processes which shared ioc
issue I/O to same block device simultaneously, cfq_cic_link() returns -EEXIST
sometimes. The race condition might stop I/O by following steps:

step  1: Process A: Issue an I/O to /dev/sda
step  2: Process A: Get an ioc (iocA here) in get_io_context() which does not
		    linked with a cic for the device
step  3: Process A: Get a new cic for the device (cicA here) in
		    cfq_alloc_io_context()

step  4: Process B: Issue an I/O to /dev/sda
step  5: Process B: Get iocA in get_io_context() since process A and B share the
		    same ioc
step  6: Process B: Get a new cic for the device (cicB here) in
		    cfq_alloc_io_context() since iocA has not been linked with a
		    cic for the device yet

step  7: Process A: Link cicA to iocA in cfq_cic_link()
step  8: Process A: Dispatch I/O to driver and finish it

step  9: Process B: Try to link cicB to iocA in cfq_cic_link()
		    But it fails with showing "cfq: cic link failed!" kernel
		    message, since iocA has already linked with cicA at step 7.
step 10: Process B: Wait for finishig I/O in get_request_wait()
		    The function does not wake up, when there is no I/O to the
		    device.

When cfq_cic_link() returns -EEXIST, it means ioc has already linked with cic.
So when cfq_cic_link() return -EEXIST, retry cfq_cic_lookup().

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-02 10:07:07 +01:00
majianpeng 2984ff38cc cfq-iosched: free cic_index if blkio_alloc_blkg_stats fails
If we fail allocating the blkpg stats, we free cfqd and cfgq.
But we need to free the IDA cfqd->cic_index as well.

Signed-off-by: majianpeng <majianpeng@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-30 15:47:48 +01:00
Mike Snitzer 5151412dd4 block: initialize request_queue's numa node during
struct request_queue is allocated with __GFP_ZERO so its "node" field is
zero before initialization.  This causes an oops if node 0 is offline in
the page allocator because its zonelists are not initialized.  From Dave
Young's dmesg:

	SRAT: Node 1 PXM 2 0-d0000000
	SRAT: Node 1 PXM 2 100000000-330000000
	SRAT: Node 0 PXM 1 330000000-630000000
	Initmem setup node 1 0000000000000000-000000000affb000
	...
	Built 1 zonelists in Node order, mobility grouping on.
	...
	BUG: unable to handle kernel paging request at 0000000000001c08
	IP: [<ffffffff8111c355>] __alloc_pages_nodemask+0xb5/0x870

and __alloc_pages_nodemask+0xb5 translates to a NULL pointer on
zonelist->_zonerefs.

The fix is to initialize q->node at the time of allocation so the correct
node is passed to the slab allocator later.

Since blk_init_allocated_queue_node() is no longer needed, merge it with
blk_init_allocated_queue().

[rientjes@google.com: changelog, initializing q->node]
Cc: stable@vger.kernel.org [2.6.37+]
Reported-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Tested-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-23 10:59:13 +01:00
Shaohua Li 019ceb7d5d block: add missed trace_block_plug
After flush plug list, the list has no request, so we need to add a
trace_block_plug().

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-16 09:21:50 +01:00
Shaohua Li 3540d5e89b block: avoid unnecessary plug list flush
get_request_wait() could sleep and flush the plug list.  If the list is
already flushed, don't flush again.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-16 09:21:50 +01:00
Ben Hutchings 6b76106d8e block: Always check length of all iov entries in blk_rq_map_user_iov()
Even after commit 5478755616
("block: check for proper length of iov entries earlier ...")
we still won't check for zero-length entries after an unaligned
entry.  Remove the break-statement, so all entries are checked.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-13 19:58:09 +01:00
Tejun Heo d0985394e7 block: Revert "[SCSI] genhd: add a new attribute "alias" in gendisk"
This reverts commit a72c5e5eb7.

The commit introduced alias for block devices which is intended to be
used during logging although actual usage hasn't been committed yet.
This approach adds very limited benefit (raw log might be easier to
follow) which can be trivially implemented in userland but has a lot
of problems.

It is much worse than netif renames because it doesn't rename the
actual device but just adds conveninence name which isn't used
universally or enforced.  Everything internal including device lookup
and sysfs still uses the internal name and nothing prevents two
devices from using conflicting alias - ie. sda can have sdb as its
alias.

This has been nacked by people working on device driver core, block
layer and kernel-userland interface and shouldn't have been
upstreamed.  Revert it.

 http://thread.gmane.org/gmane.linux.kernel/1155104
 http://thread.gmane.org/gmane.linux.scsi/68632
 http://thread.gmane.org/gmane.linux.scsi/69776

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
 Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Nao Nishijima <nao.nishijima.xt@hitachi.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-10 09:03:55 +01:00
Linus Torvalds 32aaeffbd4 Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
  Revert "tracing: Include module.h in define_trace.h"
  irq: don't put module.h into irq.h for tracking irqgen modules.
  bluetooth: macroize two small inlines to avoid module.h
  ip_vs.h: fix implicit use of module_get/module_put from module.h
  nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
  include: replace linux/module.h with "struct module" wherever possible
  include: convert various register fcns to macros to avoid include chaining
  crypto.h: remove unused crypto_tfm_alg_modname() inline
  uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
  pm_runtime.h: explicitly requires notifier.h
  linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
  miscdevice.h: fix up implicit use of lists and types
  stop_machine.h: fix implicit use of smp.h for smp_processor_id
  of: fix implicit use of errno.h in include/linux/of.h
  of_platform.h: delete needless include <linux/module.h>
  acpi: remove module.h include from platform/aclinux.h
  miscdevice.h: delete unnecessary inclusion of module.h
  device_cgroup.h: delete needless include <linux/module.h>
  net: sch_generic remove redundant use of <linux/module.h>
  net: inet_timewait_sock doesnt need <linux/module.h>
  ...

Fix up trivial conflicts (other header files, and  removal of the ab3550 mfd driver) in
 - drivers/media/dvb/frontends/dibx000_common.c
 - drivers/media/video/{mt9m111.c,ov6650.c}
 - drivers/mfd/ab3550-core.c
 - include/linux/dmaengine.h
2011-11-06 19:44:47 -08:00
Linus Torvalds 3d0a8d10cf Merge branch 'for-3.2/drivers' of git://git.kernel.dk/linux-block
* 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits)
  virtio-blk: use ida to allocate disk index
  hpsa: add small delay when using PCI Power Management to reset for kump
  cciss: add small delay when using PCI Power Management to reset for kump
  xen/blkback: Fix two races in the handling of barrier requests.
  xen/blkback: Check for proper operation.
  xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
  xen/blkback: Report VBD_WSECT (wr_sect) properly.
  xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
  xen-blkfront: plug device number leak in xlblk_init() error path
  xen-blkfront: If no barrier or flush is supported, use invalid operation.
  xen-blkback: use kzalloc() in favor of kmalloc()+memset()
  xen-blkback: fixed indentation and comments
  xen-blkfront: fix a deadlock while handling discard response
  xen-blkfront: Handle discard requests.
  xen-blkback: Implement discard requests ('feature-discard')
  xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
  drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
  drivers/block/loop.c: emit uevent on auto release
  drivers/block/cpqarray.c: use pci_dev->revision
  loop: always allow userspace partitions and optionally support automatic scanning
  ...

Fic up trivial header file includsion conflict in drivers/block/loop.c
2011-11-04 17:22:14 -07:00
Linus Torvalds b4fdcb02f1 Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
  block: don't call blk_drain_queue() if elevator is not up
  blk-throttle: use queue_is_locked() instead of lockdep_is_held()
  blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
  blk-throttle: Free up policy node associated with deleted rule
  block: warn if tag is greater than real_max_depth.
  block: make gendisk hold a reference to its queue
  blk-flush: move the queue kick into
  blk-flush: fix invalid BUG_ON in blk_insert_flush
  block: Remove the control of complete cpu from bio.
  block: fix a typo in the blk-cgroup.h file
  block: initialize the bounce pool if high memory may be added later
  block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
  block: drop @tsk from attempt_plug_merge() and explain sync rules
  block: make get_request[_wait]() fail if queue is dead
  block: reorganize throtl_get_tg() and blk_throtl_bio()
  block: reorganize queue draining
  block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
  block: pass around REQ_* flags instead of broken down booleans during request alloc/free
  block: move blk_throtl prototypes to block/blk.h
  block: fix genhd refcounting in blkio_policy_parse_and_set()
  ...

Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
 - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
 - drivers/staging/zram/zram_drv.c
2011-11-04 17:06:58 -07:00
Tejun Heo 6dd9ad7df2 block: don't call blk_drain_queue() if elevator is not up
blk_cleanup_queue() may be called before elevator is set up on a
queue which triggers the following oops.

 BUG: unable to handle kernel NULL pointer dereference at           (null)
 IP: [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70
 ...
 Pid: 830, comm: kworker/0:2 Not tainted 3.1.0-next-20111025_64+ #1590
 Bochs Bochs
 RIP: 0010:[<ffffffff8125a69c>]  [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70
 ...
 Call Trace:
  [<ffffffff8125da92>] blk_drain_queue+0x42/0x70
  [<ffffffff8125db90>] blk_cleanup_queue+0xd0/0x1c0
  [<ffffffff81469640>] md_free+0x50/0x70
  [<ffffffff8126f43b>] kobject_release+0x8b/0x1d0
  [<ffffffff81270d56>] kref_put+0x36/0xa0
  [<ffffffff8126f2b7>] kobject_put+0x27/0x60
  [<ffffffff814693af>] mddev_delayed_delete+0x2f/0x40
  [<ffffffff81083450>] process_one_work+0x100/0x3b0
  [<ffffffff8108527f>] worker_thread+0x15f/0x3a0
  [<ffffffff81089937>] kthread+0x87/0x90
  [<ffffffff81621834>] kernel_thread_helper+0x4/0x10

Fix it by making blk_cleanup_queue() check whether q->elevator is set
up before invoking blk_drain_queue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-03 18:52:11 +01:00
Paul Gortmaker 6adb1236b5 block: Change module.h -> export.h in bsg-lib.c
This file isn't using full modular functionality, and hence
can be "downgraded" to just using the export.h header.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:31:13 -04:00
Paul Gortmaker d5decd3b95 block: add export.h to files using EXPORT_SYMBOL/THIS_MODULE macros
These files were getting <linux/module.h> via an implicit include
path, but we want to crush those out of existence since they cost
time during compiles of processing thousands of lines of headers
for no reason.  Give them the lightweight header that just contains
the EXPORT_SYMBOL infrastructure.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:31:12 -04:00
Linus Torvalds ec7ae51753 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (204 commits)
  [SCSI] qla4xxx: export address/port of connection (fix udev disk names)
  [SCSI] ipr: Fix BUG on adapter dump timeout
  [SCSI] megaraid_sas: Fix instance access in megasas_reset_timer
  [SCSI] hpsa: change confusing message to be more clear
  [SCSI] iscsi class: fix vlan configuration
  [SCSI] qla4xxx: fix data alignment and use nl helpers
  [SCSI] iscsi class: fix link local mispelling
  [SCSI] iscsi class: Replace iscsi_get_next_target_id with IDA
  [SCSI] aacraid: use lower snprintf() limit
  [SCSI] lpfc 8.3.27: Change driver version to 8.3.27
  [SCSI] lpfc 8.3.27: T10 additions for SLI4
  [SCSI] lpfc 8.3.27: Fix queue allocation failure recovery
  [SCSI] lpfc 8.3.27: Change algorithm for getting physical port name
  [SCSI] lpfc 8.3.27: Changed worst case mailbox timeout
  [SCSI] lpfc 8.3.27: Miscellanous logic and interface fixes
  [SCSI] megaraid_sas: Changelog and version update
  [SCSI] megaraid_sas: Add driver workaround for PERC5/1068 kdump kernel panic
  [SCSI] megaraid_sas: Add multiple MSI-X vector/multiple reply queue support
  [SCSI] megaraid_sas: Add support for MegaRAID 9360/9380 12GB/s controllers
  [SCSI] megaraid_sas: Clear FUSION_IN_RESET before enabling interrupts
  ...
2011-10-28 16:44:18 -07:00
Jens Axboe 334c2b0b8b blk-throttle: use queue_is_locked() instead of lockdep_is_held()
We can't use the latter if !CONFIG_LOCKDEP.

Reported-by: Sedat Dilek <sedat.dilek@googlemail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-25 15:51:48 +02:00
Vivek Goyal a38eb630fa blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
blkcg->policy_list is protected by blkcg->lock. Its not rcu protected
list. So even for readers, they need to take blkcg->lock. There are
few functions which were reading the list without taking lock. Fix it.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-25 15:48:12 +02:00
Vivek Goyal e060f00bee blk-throttle: Free up policy node associated with deleted rule
If a rule is being deleted, free up associated policy node. Otherwise
that memory is leaked.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-25 15:48:12 +02:00
Tao Ma 5e08159197 block: warn if tag is greater than real_max_depth.
In case tag depth is reduced, it is max_depth not real_max_depth.
So we should allow a request with tag >= max_depth, but for a
tag >= real_max_depth, there really should be some problem.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-25 10:20:05 +02:00
Jens Axboe 83157223de Merge branch 'for-linus' into for-3.2/core 2011-10-24 16:24:38 +02:00
Tejun Heo f992ae801a block: make gendisk hold a reference to its queue
The following command sequence triggers an oops.

# mount /dev/sdb1 /mnt
# echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
# umount /mnt

 general protection fault: 0000 [#1] PREEMPT SMP
 CPU 2
 Modules linked in:

 Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
 RIP: 0010:[<ffffffff810d0879>]  [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
...
 Call Trace:
  [<ffffffff810d2845>] lock_acquire+0x95/0x140
  [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
  [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
  [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
  [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
  [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
  [<ffffffff811c40df>] blkdev_put+0x5f/0x190
  [<ffffffff8118f18d>] kill_block_super+0x4d/0x80
  [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
  [<ffffffff8119003a>] deactivate_super+0x4a/0x70
  [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
  [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
  [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b

This is because bdev holds on to disk but disk doesn't pin the
associated queue.  If a SCSI device is removed while the device is
still open, the sdev puts the base reference to the queue on release.
When the bdev is finally released, the associated queue is already
gone along with the bdi and bdev_inode_switch_bdi() ends up
dereferencing already freed bdi.

Even if it were not for this bug, disk not holding onto the associated
queue is very unusual and error-prone.

Fix it by making add_disk() take an extra reference to its queue and
put it on disk_release() and ensuring that disk and its fops owner are
put in that order after all accesses to the disk and queue are
complete.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24 16:24:31 +02:00
Jeff Moyer e67b77c791 blk-flush: move the queue kick into
A dm-multipath user reported[1] a problem when trying to boot
a kernel with commit 4853abaae7
(block: fix flush machinery for stacking drivers with differring
flush flags) applied.  It turns out that an empty flush request
can be sent into blk_insert_flush.  When the BUG_ON was fixed
to allow for this, I/O on the underlying device would stall.  The
reason is that blk_insert_cloned_request does not kick the queue.
In the aforementioned commit, I had added a special case to
kick the queue if data was sent down but the queue flags did
not require a flush.  A better solution is to push the queue
kick up into blk_insert_cloned_request.

This patch, along with a follow-on which fixes the BUG_ON, fixes
the issue reported.

[1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html

Reported-by: Christophe Saout <christophe@saout.de>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>

Stable note: 3.1
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24 16:24:31 +02:00
Jeff Moyer 834f9f61a5 blk-flush: fix invalid BUG_ON in blk_insert_flush
A user reported a regression due to commit
4853abaae7 (block: fix flush
machinery for stacking drivers with differring flush flags).
Part of the problem is that blk_insert_flush required a
single bio be attached to the request.  In reality, having
no attached bio is also a valid case, as can be observed with
an empty flush.

[1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html

Reported-by: Christophe Saout <christophe@saout.de>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com
Acked-by: Tejun Heo <tj@kernel.org>

Stable note: 3.1
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24 16:24:30 +02:00
Tao Ma 9562ad9ab3 block: Remove the control of complete cpu from bio.
bio originally has the functionality to set the complete cpu, but
it is broken.

Chirstoph said that "This code is unused, and from the all the
discussions lately pretty obviously broken.  The only thing keeping
it serves is creating more confusion and possibly more bugs."

And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
with leaving cpu control to the request based drivers, they are the
only ones that can toggle the setting anyway".

So this patch tries to remove all the work of controling complete cpu
from a bio.

Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24 16:11:30 +02:00
Jie Liu e890413af4 block: fix a typo in the blk-cgroup.h file
byptes -> bytes.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24 16:08:38 +02:00
Tejun Heo c9a929dde3 block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.

This is fundamentally broken.  The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.

With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.

 sd 0:0:1:0: [sdb] Stopping disk
 ata1.01: disabled
 general protection fault: 0000 [#1] PREEMPT SMP
 CPU 2
 Modules linked in:

 Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
 RIP: 0010:[<ffffffff8137d651>]  [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
 ...
 Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
 ...
 Call Trace:
  [<ffffffff8137d774>] elv_merge+0x84/0xe0
  [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
  [<ffffffff813838ea>] generic_make_request+0xca/0x100
  [<ffffffff81383994>] submit_bio+0x74/0x100
  [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
  [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
  [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
  [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
  [<ffffffff8118c1ca>] do_sync_read+0xda/0x120
  [<ffffffff8118ce55>] vfs_read+0xc5/0x180
  [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
  [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b

This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.

Similar problem exists for blk-throtl.  On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not.  Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.

The way it should work is having the usual clear distinction between
shutdown and release.  Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue.  Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed.  The rest of teardown
happens on release.

This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.

* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
  queue_lock.

* Unsynchronized DEAD check in generic_make_request_checks() removed.
  This couldn't make any meaningful difference as the queue could die
  after the check.

* blk_drain_queue() updated such that it can drain all requests and is
  now called during cleanup.

* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
  drains all throttled bios during cleanup and free td when queue is
  released.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:42:16 +02:00
Tejun Heo bd87b5898a block: drop @tsk from attempt_plug_merge() and explain sync rules
attempt_plug_merge() accesses elevator without holding queue_lock and
may call into ->elevator_bio_merge_fn().  The elvator is guaranteed to
be valid because it's accessed iff the plugged list has requests and
elevator is never exited with live requests, so as long as the
elevator method can deal with unlocked access, this is safe.

Explain the sync rules around attempt_plug_merge() and drop the
unnecessary @tsk parameter.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:33:08 +02:00
Tejun Heo da8303c63b block: make get_request[_wait]() fail if queue is dead
Currently get_request[_wait]() allocates request whether queue is dead
or not.  This patch makes get_request[_wait]() return NULL if @q is
dead.  blk_queue_bio() is updated to fail the submitted bio if request
allocation fails.  While at it, add docbook comments for
get_request[_wait]().

Note that the current code has rather unclear (there are spurious DEAD
tests scattered around) assumption that the owner of a queue
guarantees that no request travels block layer if the queue is dead
and this patch in itself doesn't change much; however, this will allow
fixing the broken assumption in the next patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:33:05 +02:00
Tejun Heo bc16a4f933 block: reorganize throtl_get_tg() and blk_throtl_bio()
blk_throtl_bio() and throtl_get_tg() have rather unusual interface.

* throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV),
  and drops queue_lock in the latter case.  Different locking context
  depending on return value is error-prone and DEAD state is scheduled
  to be protected by queue_lock anyway.  Move DEAD check inside
  queue_lock and return valid tg or NULL.

* blk_throtl_bio() indicates return status both with its return value
  and in/out param **@bio.  The former is used to indicate whether
  queue is found to be dead during throtl processing.  The latter
  whether the bio is throttled.

  There's no point in returning DEAD check result from
  blk_throtl_bio().  The queue can die after blk_throtl_bio() is
  finished but before make_request_fn() grabs queue lock.

  Make it take *@bio instead and return boolean result indicating
  whether the request is throttled or not.

This patch doesn't cause any visible functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:33:01 +02:00
Tejun Heo e3c78ca524 block: reorganize queue draining
Reorganize queue draining related code in preparation of queue exit
changes.

* Factor out actual draining from elv_quiesce_start() to
  blk_drain_queue().

* Make elv_quiesce_start/end() responsible for their own locking.

* Replace open-coded ELVSWITCH clearing in elevator_switch() with
  elv_quiesce_end().

This patch doesn't cause any visible functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:32:38 +02:00
Tejun Heo 315fceee81 block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
blk_get/put_queue() in scsi_cmd_ioctl() and throtl_get_tg() are
completely bogus.  The caller must have a reference to the queue on
entry and taking an extra reference doesn't change anything.

For scsi_cmd_ioctl(), the only effect is that it ends up checking
QUEUE_FLAG_DEAD on entry; however, this is bogus as queue can die
right after blk_get_queue().  Dead queue should be and is handled in
request issue path (it's somewhat broken now but that's a separate
problem and doesn't affect this one much).

throtl_get_tg() incorrectly assumes that q is rcu freed.  Also, it
doesn't check return value of blk_get_queue().  If the queue is
already dead, it ends up doing an extra put.

Drop them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:31:25 +02:00
Tejun Heo 75eb6c372d block: pass around REQ_* flags instead of broken down booleans during request alloc/free
blk_alloc_request() and freed_request() take different combinations of
REQ_* @flags, @priv and @is_sync when @flags is superset of the latter
two.  Make them take @flags only.  This cleans up the code a bit and
will ease updating allocation related REQ_* flags.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:31:22 +02:00
Tejun Heo bc9fcbf9cb block: move blk_throtl prototypes to block/blk.h
blk_throtl interface is block internal and there's no reason to have
them in linux/blkdev.h.  Move them to block/blk.h.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:31:18 +02:00
Tejun Heo ece84241b9 block: fix genhd refcounting in blkio_policy_parse_and_set()
blkio_policy_parse_and_set() calls blkio_check_dev_num() to check
whether the given dev_t is valid.  blkio_check_dev_num() uses
get_gendisk() for verification but never puts the returned genhd
leaking the reference.

This patch collapses blkio_check_dev_num() into its caller and updates
it such that the genhd is put before returning.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:31:15 +02:00
Tejun Heo 523e1d399c block: make gendisk hold a reference to its queue
The following command sequence triggers an oops.

# mount /dev/sdb1 /mnt
# echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
# umount /mnt

 general protection fault: 0000 [#1] PREEMPT SMP
 CPU 2
 Modules linked in:

 Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
 RIP: 0010:[<ffffffff810d0879>]  [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
...
 Call Trace:
  [<ffffffff810d2845>] lock_acquire+0x95/0x140
  [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
  [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
  [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
  [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
  [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
  [<ffffffff811c40df>] blkdev_put+0x5f/0x190
  [<ffffffff8118f18d>] kill_block_super+0x4d/0x80
  [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
  [<ffffffff8119003a>] deactivate_super+0x4a/0x70
  [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
  [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
  [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b

This is because bdev holds on to disk but disk doesn't pin the
associated queue.  If a SCSI device is removed while the device is
still open, the sdev puts the base reference to the queue on release.
When the bdev is finally released, the associated queue is already
gone along with the bdi and bdev_inode_switch_bdi() ends up
dereferencing already freed bdi.

Even if it were not for this bug, disk not holding onto the associated
queue is very unusual and error-prone.

Fix it by making add_disk() take an extra reference to its queue and
put it on disk_release() and ensuring that disk and its fops owner are
put in that order after all accesses to the disk and queue are
complete.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:31:07 +02:00
Jens Axboe 5c04b426f2 Merge branch 'v3.1-rc10' into for-3.2/core
Conflicts:
	block/blk-core.c
	include/linux/blkdev.h

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:30:42 +02:00
Hannes Reinecke 777eb1bf15 block: Free queue resources at blk_release_queue()
A kernel crash is observed when a mounted ext3/ext4 filesystem is
physically removed. The problem is that blk_cleanup_queue() frees up
some resources eg by calling elevator_exit(), which are not checked for
in normal operation. So we should rather move these calls to the
destructor function blk_release_queue() as at that point all remaining
references are gone. However, in doing so we have to ensure that any
externally supplied queue_lock is disconnected as the driver might free
up the lock after the call of blk_cleanup_queue(),

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-28 08:07:01 -06:00
Wanlong Gao d11bb4462c blk-cgroup: be able to remove the record of unplugged device
The bug is we're not able to remove the device from blkio cgroup's
per-device control files if it gets unplugged.

To reproduce the bug:

  # mount -t cgroup -o blkio xxx /cgroup
  # cd /cgroup
  # echo "8:0 1000" > blkio.throttle.read_bps_device
  # unplug the device
  # cat blkio.throttle.read_bps_device
  8:0	1000
  # echo "8:0 0" > blkio.throttle.read_bps_device
  -bash: echo: write error: No such device

After patching, the device removal will succeed.

Thanks for the comments of Paul, Zefan, and Vivek.

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-21 10:22:10 +02:00
Andrew Morton 499337bb65 block/blk-sysfs.c: fix kerneldoc references
The kerneldoc for blk_release_queue() is referring to blk_cleanup_queue().

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-21 10:01:22 +02:00
Suresh Jayaraman 75df713627 block: document blk-plug
Thus spake Andrew Morton:

"And I have the usual maintainability whine.  If someone comes up to
vmscan.c and sees it calling blk_start_plug(), how are they supposed to
work out why that call is there?  They go look at the blk_start_plug()
definition and it is undocumented.  I think we can do better than this?"

Adapted from the LWN article - http://lwn.net/Articles/438256/ by Jens
Axboe and from an earlier attempt by Shaohua Li to document blk-plug.

[akpm@linux-foundation.org: grammatical and spelling tweaks]
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-21 10:00:16 +02:00
Christoph Hellwig 27a84d54c0 block: refactor generic_make_request
Move all the checks performed on a bio into a new helper, and call it as
soon as bio is submitted even if it is a re-submission from ->make_request.

We explicitly mark the new helper as beeing non-inlined as the stack
usage for printing the block device name in the failure case is quite
high and this a patch where we have to be extremely conservative about
stack usage.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-15 14:01:40 +02:00
Tao Ma 8ad6a56f56 block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request
In __blk_complete_request, we check both QUEUE_FLAG_SAME_COMP and req->cpu
to decide whether we should use req->cpu. Actually the user can also
select the complete cpu by either setting BIO_CPU_AFFINE or by calling
bio_set_completion_cpu. Current solution makes these 2 ways don't work
any more. So we'd better just check req->cpu.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-14 09:31:01 +02:00
Christoph Hellwig 5a7bbad27a block: remove support for bio remapping from ->make_request
There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 12:12:01 +02:00
Jens Axboe c20e8de27f block: rename __make_request() to blk_queue_bio()
Now that it's exported, lets put it in a more sane namespace.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 12:08:31 +02:00
Christoph Hellwig 166e1f901b block: export __make_request
Avoid the hacks need for request based device mappers currently by simply
exporting the symbol instead of trying to get it through the back door.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 12:08:27 +02:00
Wang Sheng-Hui 484fc254b8 elevator: use ELV_NAME_MAX instead of magic number 16 for chosen_elevator
We have ELV_NAME_MAX defined to 16, and hence we should use it
instead of the magic nubmer 16 for elevator's name string.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 08:59:20 +02:00
Nao Nishijima a72c5e5eb7 [SCSI] genhd: add a new attribute "alias" in gendisk
This patch allows the user to set an "alias" of the disk via sysfs interface.

This patch only adds a new attribute "alias" in gendisk structure.
To show the alias instead of the device name in kernel messages,
we need to revise printk messages and use alias_name() in them.

Example:
(current) printk("disk name is %s\n", disk->disk_name);
(new)     printk("disk name is %s\n", alias_name(disk));

Users can use alphabets, numbers, '-' and '_' in "alias" attribute. A disk can
have an "alias" which length is up to 255 bytes. This attribute is write-once.

Suggested-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Suggested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nao Nishijima <nao.nishijima.xt@hitachi.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2011-08-29 00:16:19 -07:00
Shaohua Li 56ebdaf2fa block: simplify force plug flush code a little bit
Cleaning up the code a little bit. attempt_plug_merge() traverses the plug
list anyway, we can do the request counting there, so stack size is reduced
a little bit.
The motivation here is I suspect if we should count the requests for each
queue (task could handle multiple disks in the meantime), but my test doesn't
show it's worthy doing. If somebody proves we should do it, below change
will make that more easier.

Signed-off-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-24 16:04:34 +02:00
Shaohua Li a632716275 block: change force plug flush call order
Do blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request at the tail. New
request can't be merged to existing requests, but later new requests might
be merged with this new one. If blk_flush_plug_list() is done later, the
merge doesn't happen.
Believe it or not, this fixes a 10% regression running sysbench workload.

Signed-off-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-24 16:04:32 +02:00
Eric Seppanen e8037d4983 block: Fix queue_flag update when rq_affinity goes from 2 to 1
Commit 5757a6d76c added the QUEUE_FLAG_SAME_FORCE flag, but fails to
clear that flag when the current state is '2' (SAME_COMP + SAME_FORCE)
and the new state is '1' (SAME_COMP).

Acked-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Eric Seppanen <eric@purestorage.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-24 08:51:34 +02:00
Tejun Heo d27769ec3d block: add GENHD_FL_NO_PART_SCAN
There are cases where suppressing partition scan is useful - e.g. for
lo devices and pseudo SATA devices which advertise to be a disk but
get upset on partition scan (some port multiplier control devices show
such behavior).

This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan
regardless of the number of possible partitions.  disk_partitionable()
is renamed to disk_part_scan_enabled() as suppressing partition scan
doesn't imply the device can't be partitioned using
BLKPG_ADD/DEL_PARTITION calls from userland.  show_partition() now
directly tests disk_max_parts() to maintain backward-compatibility.

-v2: Updated to make it clear that only partition scan is suppressed
     not partitioning itself as suggested by Kay Sievers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-23 20:01:04 +02:00
Christoph Hellwig 65299a3b78 block: separate priority boosting from REQ_META
Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule,
and lave REQ_META purely for marking requests as metadata in blktrace.

All existing callers of REQ_META except for XFS are updated to also
set REQ_PRIO for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-23 14:50:29 +02:00
Linus Torvalds 5ccc38740a Merge branch 'for-linus' of git://git.kernel.dk/linux-block
* 'for-linus' of git://git.kernel.dk/linux-block: (23 commits)
  Revert "cfq: Remove special treatment for metadata rqs."
  block: fix flush machinery for stacking drivers with differring flush flags
  block: improve rq_affinity placement
  blktrace: add FLUSH/FUA support
  Move some REQ flags to the common bio/request area
  allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
  xen/blkback: Make description more obvious.
  cfq-iosched: Add documentation about idling
  block: Make rq_affinity = 1 work as expected
  block: swim3: fix unterminated of_device_id table
  block/genhd.c: remove useless cast in diskstats_show()
  drivers/cdrom/cdrom.c: relax check on dvd manufacturer value
  drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
  bsg-lib: add module.h include
  cfq-iosched: Reduce linked group count upon group destruction
  blk-throttle: correctly determine sync bio
  loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
  loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
  loop: add management interface for on-demand device allocation
  loop: replace linked list of allocated devices with an idr index
  ...
2011-08-19 10:47:07 -07:00
Jens Axboe b53d1ed734 Revert "cfq: Remove special treatment for metadata rqs."
We have a kernel build regression since 3.1-rc1, which is about 10%
regression. The kernel source is in an ext3 filesystem.
Alex Shi bisect it to commit:
commit a07405b780
Author: Justin TerAvest <teravest@google.com>
Date:   Sun Jul 10 22:09:19 2011 +0200

    cfq: Remove special treatment for metadata rqs.

Apparently this is caused by lack metadata preemption, where ext3/ext4
do use READ_META. I didn't see a way to fix the issue, so suggest
reverting the patch.

This reverts commit a07405b780.

Reported-by: Alex Shi<alex.shi@intel.com>
Reported-by: Shaohua Li<shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-19 08:34:48 +02:00
Jeff Moyer 4853abaae7 block: fix flush machinery for stacking drivers with differring flush flags
Commit ae1b153962, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA).  The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec.  It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:

static inline struct request *__elv_next_request(struct request_queue *q)
{
        struct request *rq;

        while (1) {
-               while (!list_empty(&q->queue_head)) {
+               if (!list_empty(&q->queue_head)) {
                        rq = list_entry_rq(q->queue_head.next);
-                       if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
-                           (rq->cmd_flags & REQ_FLUSH_SEQ))
-                               return rq;
-                       rq = blk_do_flush(q, rq);
-                       if (rq)
-                               return rq;
+                       return rq;
                }

Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:

struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
        unsigned int fflags = q->flush_flags; /* may change, cache it */
        bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
        bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
        bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
        REQ_FUA);
        unsigned skip = 0;
...
        if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
                rq->cmd_flags &= ~REQ_FLUSH;
		if (!has_fua)
			rq->cmd_flags &= ~REQ_FUA;
	        return rq;
	}

So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).

Now, however, we don't get into the flush machinery at all.  Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.

The agreed upon approach is to fix the flush machinery to allow
stacking.  While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.

In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io).  Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers.  So, I didn't see a way around the additional field.

I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance.  Comments and other testers, as always, are
appreciated.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-15 21:37:25 +02:00
Shaohua Li bcf30e75b7 block: improve rq_affinity placement
This patch reverts commit 35ae66e0a09ab70ed(block: Make rq_affinity = 1
work as expected). The purpose is to avoid an unnecessary IPI.
Let's take an example. My test box has cpu 0-7, one socket. Say request is
added from CPU 1, blk_complete_request() occurs at CPU 7. Without the reverted
patch, softirq will be done at CPU 7. With it, an IPI will be directed to CPU
0, and softirq will be done at CPU 0. In this case, doing softirq at CPU 0 and
CPU 7 have no difference from cache sharing point view and we can avoid an
ipi if doing it in CPU 7.
An immediate concern is this is just like QUEUE_FLAG_SAME_FORCE, but actually
not. blk_complete_request() is running in interrupt handler, and currently
I/O controller doesn't support multiple interrupts (I checked several LSI
cards and AHCI), so only one CPU can run blk_complete_request(). This is
still quite different as QUEUE_FLAG_SAME_FORCE.
Since only one CPU runs softirq, the only difference with below patch is
softirq not always runs at the first CPU of a group.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-11 10:39:04 +02:00
Jeff Moyer fa1bf42ff9 allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
blk_insert_flush has the following check:

	/*
	 * If there's data but flush is not necessary, the request can be
	 * processed directly without going through flush machinery.  Queue
	 * for normal execution.
	 */
	if ((policy & REQ_FSEQ_DATA) &&
	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
		list_add_tail(&rq->queuelist, &q->queue_head);
		return;
	}

However, blk_flush_policy will not return with policy set to only
REQ_FSEQ_DATA:

static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
{
	unsigned int policy = 0;

	if (fflags & REQ_FLUSH) {
		if (rq->cmd_flags & REQ_FLUSH)
			policy |= REQ_FSEQ_PREFLUSH;
		if (blk_rq_sectors(rq))
			policy |= REQ_FSEQ_DATA;
		if (!(fflags & REQ_FUA) && (rq->cmd_flags & REQ_FUA))
			policy |= REQ_FSEQ_POSTFLUSH;
	}
	return policy;
}

Notice that REQ_FSEQ_DATA is only set if REQ_FLUSH is set.  Fix this
mismatch by moving the setting of REQ_FSEQ_DATA outside of the REQ_FLUSH
check.

Tejun notes:

  Hmmm... yes, this can become a correctness issue if (and only if)
  blk_queue_flush() is called to change q->flush_flags while requests
  are in-flight; otherwise, requests wouldn't reach the function at all.
  Also, I think it would be a generally good idea to always set
  FSEQ_DATA if the request has data.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-09 20:32:09 +02:00
Tao Ma 35ae66e0a0 block: Make rq_affinity = 1 work as expected
Commit 5757a6d76c introduced a new rq_affinity = 2 so as to make
the request completed in the __make_request cpu. But it makes the
old rq_affinity = 1 not work any more. The root cause is that
if the 'cpu' and 'req->cpu' is in the same group and cpu != req->cpu,
ccpu will be the same as group_cpu, so the completion will be
excuted in the 'cpu' not 'group_cpu'.

This patch fix problem by simpling removing group_cpu and the codes
are more explicit now. If ccpu == cpu, we complete in cpu, otherwise
we raise_blk_irq to ccpu.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Reviewed-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-05 09:37:47 +02:00
Akinobu Mita dd48c085c1 fault-injection: add ability to export fault_attr in arbitrary directory
init_fault_attr_dentries() is used to export fault_attr via debugfs.
But it can only export it in debugfs root directory.

Per Forlin is working on mmc_fail_request which adds support to inject
data errors after a completed host transfer in MMC subsystem.

The fault_attr for mmc_fail_request should be defined per mmc host and
export it in debugfs directory per mmc host like
/sys/kernel/debug/mmc0/mmc_fail_request.

init_fault_attr_dentries() doesn't help for mmc_fail_request.  So this
introduces fault_create_debugfs_attr() which is able to create a
directory in the arbitrary directory and replace
init_fault_attr_dentries().

[akpm@linux-foundation.org: extraneous semicolon, per Randy]
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Tested-by: Per Forlin <per.forlin@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03 14:25:20 -10:00
Herbert Poetzl f95fe9cfb4 block/genhd.c: remove useless cast in diskstats_show()
Remove the (unsigned long long) cast in diskstats_show() and adjusts the
seq_printf() format string to 'unsigned long'

diskstats_show() uses part_stat_read() to get the stats, which either
accesses the specified field in the struct disk_stats directly (non SMP)
or sums up the per CPU values in a variable of the same type as the field,
so in any case the result will have the same type and range as the
specified field which for all disk_stats entries is unsigned long

Also, for unsigned long ranges the output of %lu should be identical to
the one of %llu, so no change in the actual proc entry contents.

Signed-off-by: Herbert Poetzl <herbert@13thfloor.at>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-02 12:43:50 +02:00
Jens Axboe e2a5429ff7 bsg-lib: add module.h include
Due to conflicts with the moduleh tree in linux-next, we
run into an include file mess. We really need export.h
in that tree, but if we add module.h locally then the
issue is easier to resolve.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-02 10:43:35 +02:00
Vivek Goyal a5395b83b7 cfq-iosched: Reduce linked group count upon group destruction
FQ keeps track of number of groups which are linked on blkcg->blkg_list.
This is useful to avoid races between queue exit and cgroup exit code
paths. So if at the request queue exit time linked group count is not
zero, that means there are some group out there which is yet to be
deleted under rcu read period and queue exit code should wait for
on rcu period.

In my previous patch I forgot to decrease the number of group count.
So in current form, we nr_blkcg_linked_grps is always non-zero and
we will always wait one rcu period (if BLK_CGROUP=y). The side effect
of this is that it can increase boot time. I am surprised, nobody
complained so far.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-02 09:24:09 +02:00
Shaohua Li e5a94f5684 blk-throttle: correctly determine sync bio
read request is always sync. Using rw_is_sync() to determine
if a bio is sync.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-01 10:31:06 +02:00
Mike Christie aa387cc895 block: add bsg helper library
This moves the FC classes bsg code to the block layer and
makes it a lib so that other classes like iscsi and SAS can use it.

It is helpful because working with the request queue, bios,
creating scatterlists, etc are a pain that the LLD does not
have to worry about with normal IOs and should not have to
worry about for bsg requests.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-31 22:05:09 +02:00
Akinobu Mita b2c9cd3793 fail_make_request: cleanup should_fail_request
This changes should_fail_request() to more usable wrapper function of
should_fail().  It can avoid putting #ifdef CONFIG_FAIL_MAKE_REQUEST in
the middle of a function.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26 16:49:46 -07:00
Jens Axboe 11ccf116d0 block: fix warning with calling smp_processor_id() in preemptible section
After commit 5757a6d7 introduced an unsafe calling of
smp_processor_id(), with preempt debuggin turned on we spew a lot of:

BUG: using smp_processor_id() in preemptible [00000000] code: kjournald/514
caller is __make_request+0x1b8/0x308
[<c0019f44>] (unwind_backtrace+0x0/0xe8) from [<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0)
[<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0) from [<c0223d14>] (__make_request+0x1b8/0x308)
[<c0223d14>] (__make_request+0x1b8/0x308) from [<c02215ac>] (generic_make_request+0x4dc/0x558)
[<c02215ac>] (generic_make_request+0x4dc/0x558) from [<c022173c>] (submit_bio+0x114/0x138)
[<c022173c>] (submit_bio+0x114/0x138) from [<c011f504>] (submit_bh+0x148/0x16c)
[<c011f504>] (submit_bh+0x148/0x16c) from [<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8)
[<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8) from [<c01aff78>] (journal_commit_transaction+0x1198/0x1688)
[<c01aff78>] (journal_commit_transaction+0x1198/0x1688) from [<c01b4034>] (kjournald+0xb4/0x224)
[<c01b4034>] (kjournald+0xb4/0x224) from [<c0069ea0>] (kthread+0x8c/0x94)
[<c0069ea0>] (kthread+0x8c/0x94) from [<c00137f8>] (kernel_thread_exit+0x0/0x8)

Fix this by just using raw_smp_processor_id(), it's just a hint
after all. There's no pinning of the CPU or accessing per-cpu
structures involved.

Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-26 15:01:15 +02:00
Linus Torvalds 8ded371f81 Merge branch 'for-3.1/drivers' of git://git.kernel.dk/linux-block
* 'for-3.1/drivers' of git://git.kernel.dk/linux-block:
  cciss: do not attempt to read from a write-only register
  xen/blkback: Add module alias for autoloading
  xen/blkback: Don't let in-flight requests defer pending ones.
  bsg: fix address space warning from sparse
  bsg: remove unnecessary conditional expressions
  bsg: fix bsg_poll() to return POLLOUT properly
2011-07-25 10:38:18 -07:00
Linus Torvalds 096a705bbc Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-block
* 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
  block: strict rq_affinity
  backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
  block: fix patch import error in max_discard_sectors check
  block: reorder request_queue to remove 64 bit alignment padding
  CFQ: add think time check for group
  CFQ: add think time check for service tree
  CFQ: move think time check variables to a separate struct
  fixlet: Remove fs_excl from struct task.
  cfq: Remove special treatment for metadata rqs.
  block: document blk_plug list access
  block: avoid building too big plug list
  compat_ioctl: fix make headers_check regression
  block: eliminate potential for infinite loop in blkdev_issue_discard
  compat_ioctl: fix warning caused by qemu
  block: flush MEDIA_CHANGE from drivers on close(2)
  blk-throttle: Make total_nr_queued unsigned
  block: Add __attribute__((format(printf...) and fix fallout
  fs/partitions/check.c: make local symbols static
  block:remove some spare spaces in genhd.c
  block:fix the comment error in blkdev.h
  ...
2011-07-25 10:33:36 -07:00
Dan Williams 5757a6d76c block: strict rq_affinity
Some systems benefit from completions always being steered to the strict
requester cpu rather than the looser "per-socket" steering that
blk_cpu_to_group() attempts by default. This is because the first
CPU in the group mask ends up being completely overloaded with work,
while the others (including the original submitter) has power left
to spare.

Allow the strict mode to be set by writing '2' to the sysfs control
file. This is identical to the scheme used for the nomerges file,
where '2' is a more aggressive setting than just being turned on.

echo 2 > /sys/block/<bdev>/queue/rq_affinity

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-23 20:44:25 +02:00
Jens Axboe 4c64500ead block: fix patch import error in max_discard_sectors check
A '!' snuck in before the unlikely, rendering it useless.

Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-23 20:34:59 +02:00
Linus Torvalds d4e06701b8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (77 commits)
  [SCSI] fix crash in scsi_dispatch_cmd()
  [SCSI] sr: check_events() ignore GET_EVENT when TUR says otherwise
  [SCSI] bnx2i: Fixed kernel panic due to illegal usage of sc->request->cpu
  [SCSI] bfa: Update the driver version to 3.0.2.1
  [SCSI] bfa: Driver and BSG enhancements.
  [SCSI] bfa: Added support to query PHY.
  [SCSI] bfa: Added HBA diagnostics support.
  [SCSI] bfa: Added support for flash configuration
  [SCSI] bfa: Added support to obtain SFP info.
  [SCSI] bfa: Added support for CEE info and stats query.
  [SCSI] bfa: Extend BSG interface.
  [SCSI] bfa: FCS bug fixes.
  [SCSI] bfa: DMA memory allocation enhancement.
  [SCSI] bfa: Brocade-1860 Fabric Adapter vHBA support.
  [SCSI] bfa: Brocade-1860 Fabric Adapter PLL init fixes.
  [SCSI] bfa: Added Fabric Assigned Address(FAA) support
  [SCSI] bfa: IOC bug fixes.
  [SCSI] bfa: Enable ASIC block configuration and query.
  [SCSI] bnx2i: Updated copyright and bump version
  [SCSI] bnx2i: Modified to skip CNIC registration if iSCSI is not supported
  ...

Fix up some trivial conflicts in:
 - drivers/scsi/bnx2fc/{bnx2fc.h,bnx2fc_fcoe.c}:
	Crazy broadcom version number conflicts
 - drivers/target/tcm_fc/tfc_cmd.c
	Just trivial cleanups done on adjacent lines
2011-07-23 11:13:11 -07:00
James Bottomley bfe159a512 [SCSI] fix crash in scsi_dispatch_cmd()
USB surprise removal of sr is triggering an oops in
scsi_dispatch_command().  What seems to be happening is that USB is
hanging on to a queue reference until the last close of the upper
device, so the crash is caused by surprise remove of a mounted CD
followed by attempted unmount.

The problem is that USB doesn't issue its final commands as part of
the SCSI teardown path, but on last close when the block queue is long
gone.  The long term fix is probably to make sr do the teardown in the
same way as sd (so remove all the lower bits on ejection, but keep the
upper disk alive until last close of user space).  However, the
current oops can be simply fixed by not allowing any commands to be
sent to a dead queue.

Cc: stable@kernel.org
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2011-07-21 14:21:18 -07:00
Lai Jiangshan 57bdfbf9ee block,rcu: Convert call_rcu(disk_free_ptbl_rcu_cb) to kfree_rcu()
The rcu callback disk_free_ptbl_rcu_cb() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(disk_free_ptbl_rcu_cb).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 14:10:13 -07:00
Shaohua Li 7700fc4f67 CFQ: add think time check for group
Currently when the last queue of a group has no request, we don't expire
the queue to hope request from the group comes soon, so the group doesn't
miss its share. But if the think time is big, the assumption isn't correct
and we just waste bandwidth. In such case, we don't do idle.

[global]
runtime=30
direct=1

[test1]
cgroup=test1
cgroup_weight=1000
rw=randread
ioengine=libaio
size=500m
runtime=30
directory=/mnt
filename=file1
thinktime=9000

[test2]
cgroup=test2
cgroup_weight=1000
rw=randread
ioengine=libaio
size=500m
runtime=30
directory=/mnt
filename=file2

	patched		base
test1	64k		39k
test2	548k		540k
total	604k		578k

group1 gets much better throughput because it waits less time.

To check if the patch changes behavior of queue without think time. I also
tried to give test1 2ms think time or no think time. The test result is stable.
The thoughput doesn't change with/without the patch.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-12 14:24:56 +02:00
Shaohua Li f5f2b6ceb2 CFQ: add think time check for service tree
Currently when the last queue of a service tree has no request, we don't
expire the queue to hope request from the service tree comes soon, so the
service tree doesn't miss its share. But if the think time is big, the
assumption isn't correct and we just waste bandwidth. In such case, we
don't do idle.

[global]
runtime=10
direct=1

[test1]
rw=randread
ioengine=libaio
size=500m
directory=/mnt
filename=file1
thinktime=9000

[test2]
rw=read
ioengine=libaio
size=1G
directory=/mnt
filename=file2

	patched		base
test1	41k/s		33k/s
test2	15868k/s	15789k/s
total	15902k/s	15817k/s

A slightly better

To check if the patch changes behavior of queue without think time. I also
tried to give test1 2ms think time or no think time. The test has variation
even without the patch, but the average throughput doesn't change with/without
the patch.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-12 14:24:55 +02:00
Shaohua Li 383cd7213f CFQ: move think time check variables to a separate struct
Move the variables to do think time check to a sepatate struct. This is
to prepare adding think time check for service tree and group. No
functional change.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-12 14:24:35 +02:00
Justin TerAvest 4aede84b33 fixlet: Remove fs_excl from struct task.
fs_excl is a poor man's priority inheritance for filesystems to hint to
the block layer that an operation is important. It was never clearly
specified, not widely adopted, and will not prevent starvation in many
cases (like across cgroups).

fs_excl was introduced with the time sliced CFQ IO scheduler, to
indicate when a process held FS exclusive resources and thus needed
a boost.

It doesn't cover all file systems, and it was never fully complete.
Lets kill it.

Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-12 08:35:10 +02:00
Justin TerAvest a07405b780 cfq: Remove special treatment for metadata rqs.
There is no consistency among filesystems from what bios (or requests)
are marked as being metadata. It's interesting to expose this in traces,
but we shouldn't schedule the requests differently based on whether or
not they're marked as being metadata.

Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-10 22:09:19 +02:00
Shaohua Li 55c022bbdd block: avoid building too big plug list
When I test fio script with big I/O depth, I found the total throughput drops
compared to some relative small I/O depth. The reason is the thread accumulates
big requests in its plug list and causes some delays (surely this depends
on CPU speed).
I thought we'd better have a threshold for requests. When a threshold reaches,
this means there is no request merge and queue lock contention isn't severe
when pushing per-task requests to queue, so the main advantages of blk plug
don't exist. We can force a plug list flush in this case.
With this, my test throughput actually increases and almost equals to small
I/O depth. Another side effect is irq off time decreases in blk_flush_plug_list()
for big I/O depth.
The BLK_MAX_REQUEST_COUNT is choosen arbitarily, but 16 is efficiently to
reduce lock contention to me. But I'm open here, 32 is ok in my test too.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-08 08:19:20 +02:00
Mike Snitzer 0f79960391 block: eliminate potential for infinite loop in blkdev_issue_discard
Due to the recently identified overflow in read_capacity_16() it was
possible for max_discard_sectors to be zero but still have discards
enabled on the associated device's queue.

Eliminate the possibility for blkdev_issue_discard to infinitely loop.

Interestingly this issue wasn't identified until a device, whose
discard_granularity was 0 due to read_capacity_16 overflow, was consumed
by blk_stack_limits() to construct limits for a higher-level DM
multipath device.  The multipath device's resulting limits never had the
discard limits stacked because blk_stack_limits() will only do so if
the bottom device's discard_granularity != 0.  This resulted in the
multipath device's limits.max_discard_sectors being 0.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-06 21:32:02 +02:00
Johannes Stezenbach 390192b300 compat_ioctl: fix warning caused by qemu
On Linux x86_64 host with 32bit userspace, running
qemu or even just "qemu-img create -f qcow2 some.img 1G"
causes a kernel warning:

ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(00005326){t:'S';sz:0} arg(7fffffff) on some.img
ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img

ioctl 00005326 is CDROM_DRIVE_STATUS,
ioctl 801c0204 is FDGETPRM.

The warning appears because the Linux compat-ioctl handler for these
ioctls only applies to block devices, while qemu also uses the ioctls on
plain files.

Signed-off-by: Johannes Stezenbach <js@sig21.net>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-01 22:32:26 +02:00
Tejun Heo 85ef06d1d2 block: flush MEDIA_CHANGE from drivers on close(2)
Currently, only open(2) is defined as the 'clearing' point.  It has
two roles - first, it's an acknowledgement from userland indicating
that the event has been received and kernel can clear pending states
and proceed to generate more events.  Secondly, it's passed on to
device drivers as a hint indicating that a synchronization point has
been reached and it might want to take a deeper look at the device.

The latter currently is only used by sr which uses two different
mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
to discover events, where the former is lighter weight and safe to be
used repeatedly but may not provide full coverage.  Among other
things, GET_EVENT can't detect media removal while TUR can.

This patch makes close(2) - blkdev_put() - indicate clearing hint for
MEDIA_CHANGE to drivers.  disk_check_events() is renamed to
disk_flush_events() and updated to take @mask for events to flush
which is or'd to ev->clearing and will be passed to the driver on the
next ->check_events() invocation.

This change makes sr generate MEDIA_CHANGE when media is ejected from
userland - e.g. with eject(1).

Note: Given the current usage, it seems @clearing hint is needlessly
complex.  disk_clear_events() can simply clear all events and the hint
can be boolean @flush.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-01 16:17:47 +02:00
Jens Axboe 04bf7869ca Merge branch 'for-linus' into for-3.1/core
Conflicts:
	block/blk-throttle.c
	block/cfq-iosched.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-01 16:17:13 +02:00
Shaohua Li 726e99ab88 cfq-iosched: make code consistent
ioc->ioc_data is rcu protectd, so uses correct API to access it.
This doesn't change any behavior, but just make code consistent.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: stable@kernel.org # after ab4bd22d
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-27 09:36:06 +02:00
Shaohua Li 3181faa85b cfq-iosched: fix a rcu warning
I got a rcu warnning at boot. the ioc->ioc_data is rcu_deferenced, but
doesn't hold rcu_read_lock.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: stable@kernel.org # after ab4bd22d
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-27 09:36:06 +02:00
Namhyung Kim 2b727c6300 bsg: fix address space warning from sparse
copy_from/to_user() and blk_rq_map_user() want __user pointer.
This patch fixes following warnings from sparse:

   CHECK   block/bsg.c
 block/bsg.c:185:38: warning: incorrect type in argument 2 (different address spaces)
 block/bsg.c:185:38:    expected void const [noderef] <asn:1>*from
 block/bsg.c:185:38:    got void *<noident>
 block/bsg.c:295:58: warning: incorrect type in argument 4 (different address spaces)
 block/bsg.c:295:58:    expected void [noderef] <asn:1>*<noident>
 block/bsg.c:295:58:    got void *[assigned] dxferp
 block/bsg.c:311:52: warning: incorrect type in argument 4 (different address spaces)
 block/bsg.c:311:52:    expected void [noderef] <asn:1>*<noident>
 block/bsg.c:311:52:    got void *[assigned] dxferp
 block/bsg.c:448:37: warning: incorrect type in argument 1 (different address spaces)
 block/bsg.c:448:37:    expected void [noderef] <asn:1>*dst
 block/bsg.c:448:37:    got void *<noident>

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-20 13:27:45 +02:00
Namhyung Kim 44194e3e88 bsg: remove unnecessary conditional expressions
Second condition in OR always implies first condition is false
thus bytes_read in the second is not needed. The same goes to
bytes_written.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-20 13:27:44 +02:00
Namhyung Kim 80ceb05713 bsg: fix bsg_poll() to return POLLOUT properly
POLLOUT should be returned only if bd->queued_cmds < bd->max_queue
so that bsg_alloc_command() can proceed.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-20 13:27:44 +02:00
Joe Perches d2f31a5fd6 blk-throttle: Make total_nr_queued unsigned
The total of two unsigned values should also be unsigned.

Update throtl_log output to unsigned.
Update total_nr_queued test to non-zero to be the
same as the other total_nr_queued tests.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-13 20:19:27 +02:00
Joe Perches fd16d26319 block: Add __attribute__((format(printf...) and fix fallout
Use the compiler to verify format strings and arguments.

Fix fallout.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-13 20:18:49 +02:00
Wanlong Gao 9f5e486550 block:remove some spare spaces in genhd.c
Remove the end-of-line spaces in genhd.c.

Signed-off-by: Wanlong Gao <wanlong.gao@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-13 10:45:43 +02:00
Joe Perches 08e8138ade block: Add __attribute__((format(printf...) and fix fallout
Use the compiler to verify format strings and arguments.

Fix fallout.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-13 10:42:49 +02:00
Tejun Heo fdd514e16b block: make disk_block_events() properly wait for work cancellation
disk_block_events() should guarantee that the event work is not in
flight on return and once blocked it shouldn't issue further
cancellations.

Because there was no synchronization between the first blocker doing
cancel_delayed_work_sync() and the following blockers, the following
blockers could finish before cancellation was complete, which broke
both guarantees - event work could be in flight and cancellation could
happen after return.

This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in
bug#34662.

  https://bugzilla.kernel.org/show_bug.cgi?id=34662

Fix it by adding an outer mutex which protects both block count
manipulation and work cancellation.

-v2: Use outer mutex instead of bit waitqueue per Linus.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09 20:43:59 +02:00
Tejun Heo c3af54afba block: remove non-syncing __disk_block_events() and fold it into disk_block_events()
After the previous update to disk_check_events(), nobody is using
non-syncing __disk_block_events().  Remove @sync and, as this makes
__disk_block_events() virtually identical to disk_block_events(),
remove the underscore prefixed version.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09 20:43:55 +02:00
Tejun Heo a9dce2a3b4 block: don't use non-syncing event blocking in disk_check_events()
This patch is part of fix for triggering of WARN_ON_ONCE() in
disk_clear_events() reported in bug#34662.

  https://bugzilla.kernel.org/show_bug.cgi?id=34662

disk_clear_events() blocks events, schedules and flushes the event
work.  It expects the work to have started execution on schedule and
finished on return from flush.  WARN_ON_ONCE() triggers if the event
work hasn't executed as expected.  This problem happens because
__disk_block_events() fails to guarantee that the event work item is
not in flight on return from the function in race-free manner.  The
problem is two-fold and this patch addresses one of them.

When __disk_block_events() is called with @sync == %false, it bumps
event block count, calls cancel_delayed_work() and return.  This makes
it impossible to guarantee that event polling is not in flight on
return from syncing __disk_block_events() - if the first blocker was
non-syncing, polling could still be in progress and later syncing ones
would assume that the first blocker already canceled it.

Making __disk_block_events() cancel_sync regardless of block count
isn't feasible either as it may race with forced event checking in
disk_clear_events().

As disk_check_events() is the only user of non-syncing
__disk_block_events(), updating it to directly cancel and schedule
event work is the easiest way to solve the issue.

Note that there's another bug in __disk_block_events() and this patch
doesn't fix the issue completely.  Later patch will fix the other bug.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09 20:43:54 +02:00
Paul Bolle df4156569d block: rename the return of two functions
If we rename the return of alloc_io_context() and get_io_context() from
"ret" to "ioc" the code get's (a bit) more readable and (a lot) more
grepable.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-06 05:57:25 +02:00
Paul Bolle 8aea45451b CFQ: make two functions static
Correctly suggested by sparse.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-06 05:57:25 +02:00
Jens Axboe 9b50902db5 cfq-iosched: fix locking around ioc->ioc_data assignment
Since we are modifying this RCU pointer, we need to hold
the lock protecting it around it.

This fixes a potential reuse and double free of a cfq
io_context structure. The bug has been in CFQ for a long
time, it hit very few people but those it did hit seemed
to see it a lot.

Tracked in RH bugzilla here:

https://bugzilla.redhat.com/show_bug.cgi?id=577968

Credit goes to Paul Bolle for figuring out that the issue
was around the one-hit ioc->ioc_data cache. Thanks to his
hard work the issue is now fixed.

Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-06 05:57:21 +02:00
Jens Axboe ab4bd22d3c cfq-iosched: fix locking around ioc->ioc_data assignment
Since we are modifying this RCU pointer, we need to hold
the lock protecting it around it.

This fixes a potential reuse and double free of a cfq
io_context structure. The bug has been in CFQ for a long
time, it hit very few people but those it did hit seemed
to see it a lot.

Tracked in RH bugzilla here:

https://bugzilla.redhat.com/show_bug.cgi?id=577968

Credit goes to Paul Bolle for figuring out that the issue
was around the one-hit ioc->ioc_data cache. Thanks to his
hard work the issue is now fixed.

Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-06 05:56:49 +02:00
Jeff Moyer 796d5116c4 iosched: prevent aliased requests from starving other I/O
Hi, Jens,

If you recall, I posted an RFC patch for this back in July of last year:
http://lkml.org/lkml/2010/7/13/279

The basic problem is that a process can issue a never-ending stream of
async direct I/Os to the same sector on a device, thus starving out
other I/O in the system (due to the way the alias handling works in both
cfq and deadline).  The solution I proposed back then was to start
dispatching from the fifo after a certain number of aliases had been
dispatched.  Vivek asked why we had to treat aliases differently at all,
and I never had a good answer.  So, I put together a simple patch which
allows aliases to be added to the rb tree (it adds them to the right,
though that doesn't matter as the order isn't guaranteed anyway).  I
think this is the preferred solution, as it doesn't break up time slices
in CFQ or batches in deadline.  I've tested it, and it does solve the
starvation issue.  Let me know what you think.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-02 21:19:05 +02:00
Paul Bolle e2bd9678fc block: Use hlist_entry() for io_context.cic_list.first
list_entry() and hlist_entry() are both simply aliases for
container_of(), but since io_context.cic_list.first is an hlist_node one
should at least use the correct alias.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-02 13:05:02 +02:00
Paul Bolle 28304f485c cfq-iosched: Remove bogus check in queue_fail path
queue_fail can only be reached if cic is NULL, so its check for cic must
be bogus.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-02 13:05:02 +02:00
Kyungmin Park 4495a7d41d CFQ: Fix typo and remove unnecessary semicolon
Fix comment typo and remove unnecessary semicolon at macro

Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-31 19:49:44 +02:00
Linus Torvalds bdf7cf1c83 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  loop: export module parameters
  block: export blk_{get,put}_queue()
  block: remove unused variable in bio_attempt_front_merge()
  block: always allocate genhd->ev if check_events is implemented
  brd: export module parameters
  brd: fix comment on initial device creation
  brd: handle on-demand devices correctly
  brd: limit 'max_part' module param to DISK_MAX_PARTS
  brd: get rid of unused members from struct brd_device
  block: fix oops on !disk->queue and sysfs discard alignment display
2011-05-27 10:24:40 -07:00
Jens Axboe d86e0e83b3 block: export blk_{get,put}_queue()
We need them in SCSI to fix a bug, but currently they are not
exported to modules. Export them.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-27 07:45:45 +02:00
Ben Blum f780bdb7c1 cgroups: add per-thread subsystem callbacks
Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
for cgroups's subsystem interface.  Unlike can_attach and attach, these
are for per-thread operations, to be called potentially many times when
attaching an entire threadgroup.

Also, the old "bool threadgroup" interface is removed, as replaced by
this.  All subsystems are modified for the new interface - of note is
cpuset, which requires from/to nodemasks for attach to be globally scoped
(though per-cpuset would work too) to persist from its pre_attach to
attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:34 -07:00
Luca Tettamanti 700c4f3325 block: remove unused variable in bio_attempt_front_merge()
sector is never read inside the function.

Signed-off-by: Luca Tettamanti <kronos.it@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-26 21:07:26 +02:00
Tejun Heo 75e3f3ee3c block: always allocate genhd->ev if check_events is implemented
9fd097b149 (block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe
drivers) removed DISK_EVENT_MEDIA_CHANGE from legacy/fringe block
drivers which have inadequate ->check_events().  Combined with earlier
change 7c88a168da (block: don't propagate unlisted DISK_EVENTs to
userland), this enables using ->check_events() for internal processing
while avoiding enabling in-kernel block event polling which can lead
to infinite event loop.

Unfortunately, this made many drivers including floppy without any bit
set in disk->events and ->async_events in which case disk_add_events()
simply skipped allocation of disk->ev, which disables whole event
handling.  As ->check_events() is still used during open processing
for revalidation, this can lead to open failure.

This patch always allocates disk->ev if ->check_events is implemented.
In the long term, it would make sense to simply include the event
structure inline into genhd as it's now used by virtually all block
devices.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ondrej Zary <linux@rainbow-software.org>
Reported-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-26 21:06:50 +02:00
Namhyung Kim 1547010e6e cfq-iosched: free cic_index if cfqd allocation fails
When struct cfq_data allocation fails, cic_index need to be freed.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-24 10:23:22 +02:00
Namhyung Kim 20359f27e8 cfq-iosched: remove unused 'group_changed' in cfq_service_tree_add()
The 'group_changed' variable is initialized to 0 and never changed, so
checking the variable is meaningless.

It is a leftover from 0bbfeb8320 ("cfq-iosched: Always provide group
iosolation."). Let's get rid of it.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-24 10:23:22 +02:00
Namhyung Kim 229836bd63 cfq-iosched: reduce bit operations in cfq_choose_req()
Reduce the number of bit operations in cfq_choose_req() on average
(and worst) cases.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-24 10:23:21 +02:00
Namhyung Kim b9f8ce0599 cfq-iosched: algebraic simplification in cfq_prio_to_maxrq()
Simplify the calculation in cfq_prio_to_maxrq(), plus replace CFQ_PRIO_LISTS to
IOPRIO_BE_NR since they are the same and IOPRIO_BE_NR looks more reasonable in
this context IMHO.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-24 10:23:21 +02:00
Vivek Goyal 4cbadbd16e blk-cgroup: Initialize ioc->cgroup_changed at ioc creation time
If we don't explicitly initialize it to zero, CFQ might think that
cgroup of ioc has changed and it generates lots of unnecessary calls
to call_for_each_cic(changed_cgroup). Fix it.

cfq_get_io_context()
  cfq_ioc_set_cgroup()
     call_for_each_cic(ioc, changed_cgroup)

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23 19:35:04 +02:00
Vivek Goyal 95cf3dd9db block: call elv_bio_merged() when merged
Commit 73c1010119 ("block: initial patch for on-stack per-task plugging")
removed calls to elv_bio_merged() when @bio merged with @req. Re-add them.

This in turn will update merged stats in associated group. That
should be safe as long as request has got reference to the blkio_group.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Divyesh Shah <dpshah@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23 10:02:19 +02:00
Vivek Goyal 317389a773 cfq-iosched: Make IO merge related stats per cpu
Make BLKIO_STAT_MERGED per cpu hence gettring rid of need of taking
blkg->stats_lock.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23 10:02:19 +02:00
Vivek Goyal 2abae55f5a cfq-iosched: Fix a memory leak of per cpu stats for root group
We allocated per cpu stats struct for root group but did not free it.
Fix it.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23 10:02:19 +02:00
Jens Axboe 771949d03b block: get rid of on-stack plugging debug checks
We don't need them anymore, so kill:

- REQ_ON_PLUG checks in various places
- !rq_mergeable() check in plug merging

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:52:16 +02:00
Jens Axboe 0eb8e88572 Merge branch 'for-linus' into for-2.6.40/core
This patch merges in a fix that missed 2.6.39 final.

Conflicts:
	block/blk.h
2011-05-20 20:36:16 +02:00
Vivek Goyal af75cd3c67 blk-throttle: Make no throttling rule group processing lockless
Currently we take a queue lock on each bio to check if there are any
throttling rules associated with the group and also update the stats.
Now access the group under rcu and update the stats without taking
the queue lock. Queue lock is taken only if there are throttling rules
associated with the group.

So the common case of root group when there are no rules, save
unnecessary pounding of request queue lock.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:53 +02:00
Vivek Goyal f0bdc8cdd9 blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats
Now dispatch stats update is lock free. But reset of these stats still
takes blkg->stats_lock and is dependent on that. As stats are per cpu,
we should be able to just reset the stats on each cpu without any locks.
(Atleast for 64bit arch).

On 32bit arch there is a small race where 64bit updates are not atomic.
The result of this race can be that in the presence of other writers,
one might not get 0 value after reset of a stat and might see something
intermediate

One can write more complicated code to cover this race like sending IPI
to other cpus to reset stats and for offline cpus, reset these directly.

Right not I am not taking that path because reset_update is more of a
debug feature and it can happen only on 32bit arch and possibility of
it happening is small. Will fix it if it becomes a real problem. For
the time being going for code simplicity.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:53 +02:00
Vivek Goyal 575969a0dd blk-cgroup: Make 64bit per cpu stats safe on 32bit arch
Some of the stats are 64bit and updation will be non atomic on 32bit
architecture. Use sequence counters on 32bit arch to make reading
of stats safe.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:53 +02:00
Vivek Goyal 5624a4e445 blk-throttle: Make dispatch stats per cpu
Currently we take blkg_stat lock for even updating the stats. So even if
a group has no throttling rules (common case for root group), we end
up taking blkg_lock, for updating the stats.

Make dispatch stats per cpu so that these can be updated without taking
blkg lock.

If cpu goes offline, these stats simply disappear. No protection has
been provided for that yet. Do we really need anything for that?

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal 4843c69d49 blk-throttle: Free up a group only after one rcu grace period
Soon we will allow accessing a throtl_grp under rcu_read_lock(). Hence
start freeing up throtl_grp after one rcu grace period.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal 5617cbef77 blk-throttle: Use helper function to add root throtl group to lists
Use same helper function for root group as we use with dynamically
allocated groups to add it to various lists.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal 269f541555 blk-throttle: Introduce a helper function to fill in device details
A helper function for the code which is used at 2-3 places. Makes reading
code little easier.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal 29b125892f blk-throttle: Dynamically allocate root group
Currently, we allocate root throtl_grp statically. But as we will be
introducing per cpu stat pointers and that will be allocated
dynamically even for root group, we might as well make whole root
throtl_grp allocation dynamic and treat it in same manner as other
groups.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal f469a7b4d5 blk-cgroup: Allow sleeping while dynamically allocating a group
Currently, all the cfq_group or throtl_group allocations happen while
we are holding ->queue_lock and sleeping is not allowed.

Soon, we will move to per cpu stats and also need to allocate the
per group stats. As one can not call alloc_percpu() from atomic
context as it can sleep, we need to drop ->queue_lock, allocate the
group, retake the lock and continue processing.

In throttling code, I check the queue DEAD flag again to make sure
that driver did not call blk_cleanup_queue() in the mean time.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal 56edf7d75d cfq-iosched: Fix a possible race with cfq cgroup removal code
blkg->key = cfqd is an rcu protected pointer and hence we used to do
call_rcu(cfqd->rcu_head) to free up cfqd after one rcu grace period.

The problem here is that even though cfqd is around, there are no
gurantees that associated request queue (td->queue) or q->queue_lock
is still around. A driver might have called blk_cleanup_queue() and
release the lock.

It might happen that after freeing up the lock we call
blkg->key->queue->queue_ock and crash. This is possible in following
path.

blkiocg_destroy()
 blkio_unlink_group_fn()
  cfq_unlink_blkio_group()

Hence, wait for an rcu peirod if there are groups which have not
been unlinked from blkcg->blkg_list. That way, if there are any groups
which are taking cfq_unlink_blkio_group() path, can safely take queue
lock.

This is how we have taken care of race in throttling logic also.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal 3e59cf9d66 cfq-iosched: Get rid of redundant function parameter "create"
Nobody seems to be using cfq_find_alloc_cfqg() function parameter "create".
Get rid of that.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal a23e686955 blk-cgroup: move some fields of unaccounted_time file under right config option
cgroup unaccounted_time file is created only if CONFIG_DEBUG_BLK_CGROUP=y.
there are some fields which are out side this config option. Fix that.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal a29a171e7c blk-throttle: Do the new group initialization with the help of a function
Group initialization code seems to be at two places. root group
initialization in blk_throtl_init() and dynamically allocated group
in throtl_find_alloc_tg(). Create a common function and use at both
the places.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:51 +02:00
Jens Axboe 698567f3fa Merge commit 'v2.6.39' into for-2.6.40/core
Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
had churn in the core area that makes it difficult to handle
patches for eg cfq or blk-throttle. Instead of requiring that they
be based in older versions with bugs that have been fixed later
in the rc cycle, merge in 2.6.39 final.

Also fixes up conflicts in the below files.

Conflicts:
	drivers/block/paride/pcd.c
	drivers/cdrom/viocd.c
	drivers/ide/ide-cd.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:33:15 +02:00
James Bottomley 0a58e077eb block: add proper state guards to __elv_next_request
blk_cleanup_queue() calls elevator_exit() and after this, we can't
touch the elevator without oopsing.  __elv_next_request() must check
for this state because in the refcounted queue model, we can still
call it after blk_cleanup_queue() has been called.

This was reported as causing an oops attributable to scsi.

Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-18 19:30:32 +02:00
Shaohua Li 3ec717b7ca block: don't delay blk_run_queue_async
Let's check a scenario:
1. blk_delay_queue(q, SCSI_QUEUE_DELAY);
2. blk_run_queue_async();
the second one will became a noop, because q->delay_work already has
WORK_STRUCT_PENDING_BIT set, so the delayed work will still run after
SCSI_QUEUE_DELAY. But blk_run_queue_async actually hopes the delayed
work runs immediately.

Fix this by doing a cancel on potentially pending delayed work
before queuing an immediate run of the workqueue.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-18 12:24:03 +02:00
Martin K. Petersen a934a00a69 block: Fix discard topology stacking and reporting
In some cases we would end up stacking discard_zeroes_data incorrectly.
Fix this by enabling the feature by default for stacking drivers and
clearing it for low-level drivers. Incorporating a device that does not
support dzd will then cause the feature to be disabled in the stacking
driver.

Also ensure that the maximum discard value does not overflow when
exported in sysfs and return 0 in the alignment and dzd fields for
devices that don't support discard.

Reported-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-18 10:37:35 +02:00
Vivek Goyal 70087dc38c blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup
Currentlly we first map the task to cgroup and then cgroup to
blkio_cgroup. There is a more direct way to get to blkio_cgroup
from task using task_subsys_state(). Use that.

The real reason for the fix is that it also avoids a race in generic
cgroup code. During remount/umount rebind_subsystems() is called and
it can do following with and rcu protection.

cgrp->subsys[i] = NULL;

That means if somebody got hold of cgroup under rcu and then it tried
to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
is wrong. I was running into this race condition with ltp running on a
upstream derived kernel and that lead to crash.

So ideally we should also fix cgroup generic code to wait for rcu
grace period before setting pointer to NULL. Li Zefan is not very keen
on introducing synchronize_wait() as he thinks it will slow
down moun/remount/umount operations.

So for the time being atleast fix the kernel crash by taking a more
direct route to blkio_cgroup.

One tester had reported a crash while running LTP on a derived kernel
and with this fix crash is no more seen while the test has been
running for over 6 days.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-16 15:24:08 +02:00
Lukas Czerner 8af1954d17 blkdev: Do not return -EOPNOTSUPP if discard is supported
Currently we return -EOPNOTSUPP in blkdev_issue_discard() if any of the
bio fails due to underlying device not supporting discard request.
However, if the device is for example dm device composed of devices
which some of them support discard and some of them does not, it is ok
for some bios to fail with EOPNOTSUPP, but it does not mean that discard
is not supported at all.

This commit removes the check for bios failed with EOPNOTSUPP and change
blkdev_issue_discard() to return operation not supported if and only if
the device does not actually supports it, not just part of the device as
some bios might indicate.

This change also fixes problem with BLKDISCARD ioctl() which now works
correctly on such dm devices.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
CC: Jens Axboe <jaxboe@fusionio.com>
CC: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-06 19:30:01 -06:00
Lukas Czerner 5baebe5c86 blkdev: Simple cleanup in blkdev_issue_zeroout()
In blkdev_issue_zeroout() we are submitting regular WRITE bios, so we do
not need to check for -EOPNOTSUPP specifically in case of error. Also
there is no need to have label submit: because there is no way to jump
out from the while cycle without an error and we really want to exit,
rather than try again. And also remove the check for (sz == 0) since at
that point sz can never be zero.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
CC: Dmitry Monakhov <dmonakhov@openvz.org>
CC: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-06 19:26:28 -06:00
Lukas Czerner 5dba3089ed blkdev: Submit discard bio in batches in blkdev_issue_discard()
Currently we are waiting for every submitted REQ_DISCARD bio separately,
but it can have unwanted consequences of repeatedly flushing the queue,
so we rather submit bios in batches and wait for the entire batch, hence
narrowing the window of other ios going in.

Use bio_batch_end_io() and struct bio_batch for that purpose, the same
is used by blkdev_issue_zeroout(). Also change bio_batch_end_io() so we
always set !BIO_UPTODATE in the case of error and remove the check for
bb, since we are the only user of this function and we always set this.

Remove bio_get()/bio_put() from the blkdev_issue_discard() since
bio_alloc() and bio_batch_end_io() is doing the same thing, hence it is
not needed anymore.

I have done simple dd testing with surprising results. The script I have
used is:

for i in $(seq 10); do
        echo $i
        dd if=/dev/sdb1 of=/dev/sdc1 bs=4k &
        sleep 5
done
/usr/bin/time -f %e ./blkdiscard /dev/sdc1

Running time of BLKDISCARD on the whole device:
with patch              without patch
0.95                    15.58

So we can see that in this artificial test the kernel with the patch
applied is approx 16x faster in discarding the device.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
CC: Dmitry Monakhov <dmonakhov@openvz.org>
CC: Jens Axboe <jaxboe@fusionio.com>
CC: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-06 19:26:27 -06:00
shaohua.li@intel.com 3ac0cc4508 block: hold queue if flush is running for non-queueable flush drive
In some drives, flush requests are non-queueable. When flush request is
running, normal read/write requests can't run. If block layer dispatches
such request, driver can't handle it and requeue it.  Tejun suggested we
can hold the queue when flush is running. This can avoid unnecessary
requeue.  Also this can improve performance. For example, we have
request flush1, write1, flush 2. flush1 is dispatched, then queue is
hold, write1 isn't inserted to queue. After flush1 is finished, flush2
will be dispatched. Since disk cache is already clean, flush2 will be
finished very soon, so looks like flush2 is folded to flush1.

In my test, the queue holding completely solves a regression introduced by
commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:

    block: make the flush insertion use the tail of the dispatch list

    It's not a preempt type request, in fact we have to insert it
    behind requests that do specify INSERT_FRONT.

which causes about 20% regression running a sysbench fileio
workload.

Stable: 2.6.39 only

Cc: stable@kernel.org
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-06 11:36:25 -06:00
shaohua.li@intel.com f387693095 block: add a non-queueable flush flag
flush request isn't queueable in some drives. Add a flag to let driver
notify block layer about this. We can optimize flush performance with the
knowledge.

Stable: 2.6.39 only

Cc: stable@kernel.org
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-06 11:36:25 -06:00
Kees Cook 490b94be02 iosched: remove redundant sprintf
After the anticipatory scheduler was dropped, there was no need to
special-case the request_module string. As such, drop the redundant
sprintf and stack variable.

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-05 18:02:12 -06:00
Tao Ma addd0a09fc block: Remove 'plug/unplug' comment in blk_execute_rq_nowait
unplug is replaced with blk_run_queue now in blk_execute_rq_nowait,
so change the comment accordingly.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-05 15:10:05 -06:00
Tejun Heo 7c88a168da block: don't propagate unlisted DISK_EVENTs to userland
DISK_EVENT_MEDIA_CHANGE is used for both userland visible event and
internal event for revalidation of removeable devices.  Some legacy
drivers don't implement proper event detection and continuously
generate events under certain circumstances.  For example, ide-cd
generates media changed continuously if there's no media in the drive,
which can lead to infinite loop of events jumping back and forth
between the driver and userland event handler.

This patch updates disk event infrastructure such that it never
propagates events not listed in disk->events to userland.  Those
events are processed the same for internal purposes but uevent
generation is suppressed.

This also ensures that userland only gets events which are advertised
in the @events sysfs node lowering risk of confusion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-21 19:43:58 +02:00
Jens Axboe 3aa72873ff elevator: check for ELEVATOR_INSERT_SORT_MERGE in !elvpriv case too
The sort insert is the one that goes to the IO scheduler. With
the SORT_MERGE addition, we could bypass IO scheduler setup
but still ask the IO scheduler to insert the request. This would
cause an oops on switching IO schedulers through the sysfs
interface, unless the disk just happened to be idle while it
occured.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-21 19:28:35 +02:00
Tao Ma 60735b6362 block: Remove the extra check in queue_requests_store
In queue_requests_store, the code looks like
	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
		blk_set_queue_full(q, BLK_RW_SYNC);
	} else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
		blk_clear_queue_full(q, BLK_RW_SYNC);
		wake_up(&rl->wait[BLK_RW_SYNC]);
	}
If we don't satify the situation of "if", we can get that
rl->count[BLK_RW_SYNC} < q->nr_quests. It is the same as
rl->count[BLK_RW_SYNC]+1 <= q->nr_requests.
All the "else" should satisfy the "else if" check so it isn't
needed actually.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-19 13:51:53 +02:00
Liu Yuan ed5302d3c2 block, blk-sysfs: Fix an err return path in blk_register_queue()
We do not call blk_trace_remove_sysfs() in err return path
if kobject_add() fails. This path fixes it.

Cc: stable@kernel.org
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-19 13:51:53 +02:00
Jens Axboe d350e6b6e8 block: remove stale kerneldoc member from __blk_run_queue()
We don't pass in a 'force_kblockd' anymore, get rid of the
stsale comment.

Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-19 13:34:14 +02:00
Jens Axboe c21e6beba8 block: get rid of QUEUE_FLAG_REENTER
We are currently using this flag to check whether it's safe
to call into ->request_fn(). If it is set, we punt to kblockd.
But we get a lot of false positives and excessive punts to
kblockd, which hurts performance.

The only real abuser of this infrastructure is SCSI. So export
the async queue run and convert SCSI over to use that. There's
room for improvement in that SCSI need not always use the async
call, but this fixes our performance issue and they can fix that
up in due time.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-19 13:32:46 +02:00
Jens Axboe 5f45c69589 cfq-iosched: read_lock() does not always imply rcu_read_lock()
For some configurations of CONFIG_PREEMPT that is not true. So
get rid of __call_for_each_cic() and always uses the explicitly
rcu_read_lock() protected call_for_each_cic() instead.

This fixes a potential bug related to IO scheduler removal or
online switching.

Thanks to Paul McKenney for clarifying this.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-19 09:10:35 +02:00
Jens Axboe bd900d4580 block: kill blk_flush_plug_list() export
With all drivers and file systems converted, we only have
in-core use of this function. So remove the export.

Reporteed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-18 22:06:57 +02:00
Christoph Hellwig 24ecfbe27f block: add blk_run_queue_async
Instead of overloading __blk_run_queue to force an offload to kblockd
add a new blk_run_queue_async helper to do it explicitly.  I've kept
the blk_queue_stopped check for now, but I suspect it's not needed
as the check we do when the workqueue items runs should be enough.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-18 11:41:33 +02:00
Jens Axboe 4521cc4ed5 block: blk_delay_queue() should use kblockd workqueue
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-18 11:36:39 +02:00
Jens Axboe 99e22598e9 block: drop queue lock before calling __blk_run_queue() for kblockd punt
If we know we are going to punt to kblockd, we can drop the queue
lock before calling into __blk_run_queue() since it only does a
safe bit test and a workqueue call. Since kblockd needs to grab
this very lock as one of the first things it does, it's a good
optimization to drop the lock before waking kblockd.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-18 09:59:55 +02:00
Jens Axboe b4cb290e0a Revert "block: add callback function for unplug notification"
MD can't use this since it really requires us to be able to
keep more than a single piece of state for the unplug. Commit
048c9374 added the required support for MD, so get rid of this
now unused code.

This reverts commit f75664570d.

Conflicts:

	block/blk-core.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-18 09:54:05 +02:00
NeilBrown 048c9374a7 block: Enhance new plugging support to support general callbacks
md/raid requires an unplug callback, but as it does not uses
requests the current code cannot provide one.

So allow arbitrary callbacks to be attached to the blk_plug.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-18 09:52:22 +02:00
Jens Axboe 49cac01e1f block: make unplug timer trace event correspond to the schedule() unplug
It's a pretty close match to what we had before - the timer triggering
would mean that nobody unplugged the plug in due time, in the new
scheme this matches very closely what the schedule() unplug now is.
It's essentially the difference between an explicit unplug (IO unplug)
or an implicit unplug (timer unplug, we scheduled with pending IO
queued).

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-16 13:51:05 +02:00
Jens Axboe f6603783f9 block: only force kblockd unplugging from the schedule() path
For the explicit unplugging, we'd prefer to kick things off
immediately and not pay the penalty of the latency to switch
to kblockd. So let blk_finish_plug() do the run inline, while
the implicit-on-schedule-out unplug will punt to kblockd.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-15 15:49:07 +02:00
Christoph Hellwig 88b996cd06 block: cleanup the block plug helper functions
It's a bit of a mess currently. task->plug is being cleared
and reset in __blk_finish_plug(), and blk_finish_plug() is
testing for a NULL plug which cannot happen even from schedule()
anymore since it uses blk_needs_flush_plug() to determine
whether to call into this function at all.

So get rid of some of the cruft.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-15 15:20:10 +02:00
Liu Yuan 80656b67b3 block, blk-sysfs: Use the variable directly instead of a function call
In the function blk_register_queue(), var _dev_ is already assigned by
disk_to_dev().So use it directly instead of calling disk_to_dev() again.

Signed-off-by: Liu Yuan <tailai.ly@taobao.com>

Modified by me to delete an empty line in the same function while
in there anyway.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-13 22:14:54 +02:00
Jens Axboe f4af3c3d07 block: move queue run on unplug to kblockd
There are worries that we are now consuming a lot more stack in
some cases, since we potentially call into IO dispatch from
schedule() or io_schedule(). We can reduce this problem by moving
the running of the queue to kblockd, like the old plugging scheme
did as well.

This may or may not be a good idea from a performance perspective,
depending on how many tasks have queue plugs running at the same
time. For even the slightly contended case, doing just a single
queue run from kblockd instead of multiple runs directly from the
unpluggers will be faster.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-12 14:58:51 +02:00
Jens Axboe cf82c79839 block: kill queue_sync_plugs()
The original use for this dates back to when we had to track write
requests for serializing around barriers. That's not needed anymore,
so kill it.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-12 10:30:53 +02:00
Jens Axboe dc6d36c971 block: readd plug trace event
This was removed with the queue plug state. But we can easily readd
by checking if this is the first request going to this queue. It's
good information to have when tracing to see how effective the
plugging is.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-12 10:28:28 +02:00
Jens Axboe f75664570d block: add callback function for unplug notification
MD would like to know when a queue is unplugged, so it can flush
it's bitmap writes. Add such a callback.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-12 10:17:31 +02:00
Jens Axboe 188112722c block: add comment on why we save and disable interrupts in flush_plug_list()
It's done at the top to avoid doing it for every queue we unplug.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-12 10:12:29 +02:00
Jens Axboe 94b5eb28b4 block: fixup block IO unplug trace call
It was removed with the on-stack plugging, readd it and track the
depth of requests added when flushing the plug.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-12 10:12:19 +02:00
NeilBrown 109b81296c block: splice plug list to local context
If the request_fn ends up blocking, we could be re-entering
the plug flush. Since the list is protected by explicitly
not allowing schedule events, this isn't a terribly good idea.

Additionally, it can cause us to recurse. As request_fn called by
__blk_run_queue is allowed to 'schedule()' (after dropping the queue
lock of course), it is possible to get a recursive call:

 schedule -> blk_flush_plug -> __blk_finish_plug -> flush_plug_list
      -> __blk_run_queue -> request_fn -> schedule

We must make sure that the second schedule does not call into
blk_flush_plug again.  So instead of leaving the list of requests on
blk_plug->list, move them to a separate list leaving blk_plug->list
empty.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-11 14:13:10 +02:00