Commit Graph

368 Commits

Author SHA1 Message Date
Jens Axboe a993800655 cfq-iosched: fix sequential write regression
We have a 10-15% performance regression for sequential writes on TCQ/NCQ
enabled drives in 2.6.21-rcX after the CFQ update went in.  It has been
reported by Valerie Clement <valerie.clement@bull.net> and the Intel
testing folks.  The regression is because of CFQ's now more aggressive
queue control, limiting the depth available to the device.

This patches fixes that regression by allowing a greater depth when only
one queue is busy.  It has been tested to not impact sync-vs-async
workloads too much - we still do a lot better than 2.6.20.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-20 22:56:29 -07:00
Alan Stern 44ec95425c [SCSI] sg: cap reserved_size values at max_sectors
This patch (as857) modifies the SG_GET_RESERVED_SIZE and
SG_SET_RESERVED_SIZE ioctls in the sg driver, capping the values at
the device's request_queue's max_sectors value.  This will permit
cdrecord to obtain a legal value for the maximum transfer length,
fixing Bugzilla #7026.

The patch also caps the initial reserved_size value.  There's no
reason to have a reserved buffer larger than max_sectors, since it
would be impossible to use the extra space.

The corresponding ioctls in the block layer are modified similarly,
and the initial value for the reserved_size is set as large as
possible.  This will effectively make it default to max_sectors.
Note that the actual value is meaningless anyway, since block devices
don't have a reserved buffer.

Finally, the BLKSECTGET ioctl is added to sg, so that there will be a
uniform way for users to determine the actual max_sectors value for
any raw SCSI transport.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Douglas Gilbert <dougg@torque.net>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2007-04-17 18:09:56 -04:00
Andrew Morton 2363cc0264 [PATCH] remove protection of LANANA-reserved majors
Revert all this.  It can cause device-mapper to receive a different major from
earlier kernels and it turns out that the Amanda backup program (via GNU tar,
apparently) checks major numbers on files when performing incremental backups.

Which is a bit broken of Amanda (or tar), but this feature isn't important
enough to justify the churn.

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04 21:12:47 -07:00
Thibaut VARENE 1ffb96c587 make elv_register() output atomic
Booting 2.6.21-rc3-g45592145 I noticed the following on one of my
machines in the bootlog:

io scheduler noop registered<6>Time: jiffies clocksource has been installed.

io scheduler deadline registered (default)

Looking at block/elevator.c, it appears that elv_register() uses two
consecutive printks in a non-atomic way, leading to the above glitch. The
attached trivial patch fixes this issue, by using a single printk.

Signed-off-by: Thibaut VARENE <varenet@parisc-linux.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-03-27 08:53:04 +02:00
Vasily Tarasov f772b3d9ca block: blk_max_pfn is somtimes wrong
There is a small problem in handling page bounce.

At the moment blk_max_pfn equals max_pfn, which is in fact not maximum
possible _number_ of a page frame, but the _amount_ of page frames.  For
example for the 32bit x86 node with 4Gb RAM, max_pfn = 0x100000, but not
0xFFFF.

request_queue structure has a member q->bounce_pfn and queue needs bounce
pages for the pages _above_ this limit.  This routine is handled by
blk_queue_bounce(), where the following check is produced:

	if (q->bounce_pfn >= blk_max_pfn)
		return;

Assume, that a driver has set q->bounce_pfn to 0xFFFF, but blk_max_pfn
equals 0x10000.  In such situation the check above fails and for each bio
we always fall down for iterating over pages tied to the bio.

I want to notice, that for quite a big range of device drivers (ide, md,
...) such problem doesn't happen because they use BLK_BOUNCE_ANY for
bounce_pfn.  BLK_BOUNCE_ANY is defined as blk_max_pfn << PAGE_SHIFT, and
then the check above doesn't fail.  But for other drivers, which obtain
reuired value from drivers, it fails.  For example sata_nv uses
ATA_DMA_MASK or dev->dma_mask.

I propose to use (max_pfn - 1) for blk_max_pfn.  And the same for
blk_max_low_pfn.  The patch also cleanses some checks related with
bounce_pfn.

Signed-off-by: Vasily Tarasov <vtaras@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-03-27 08:52:47 +02:00
Peter Zijlstra 6d740cd5b1 [PATCH] lockdep: annotate BLKPG_DEL_PARTITION
>=============================================
>[ INFO: possible recursive locking detected ]
>2.6.19-1.2909.fc7 #1
>---------------------------------------------
>anaconda/587 is trying to acquire lock:
> (&bdev->bd_mutex){--..}, at: [<c05fb380>] mutex_lock+0x21/0x24
>
>but task is already holding lock:
> (&bdev->bd_mutex){--..}, at: [<c05fb380>] mutex_lock+0x21/0x24
>
>other info that might help us debug this:
>1 lock held by anaconda/587:
> #0:  (&bdev->bd_mutex){--..}, at: [<c05fb380>] mutex_lock+0x21/0x24
>
>stack backtrace:
> [<c0405812>] show_trace_log_lvl+0x1a/0x2f
> [<c0405db2>] show_trace+0x12/0x14
> [<c0405e36>] dump_stack+0x16/0x18
> [<c043bd84>] __lock_acquire+0x116/0xa09
> [<c043c960>] lock_acquire+0x56/0x6f
> [<c05fb1fa>] __mutex_lock_slowpath+0xe5/0x24a
> [<c05fb380>] mutex_lock+0x21/0x24
> [<c04d82fb>] blkdev_ioctl+0x600/0x76d
> [<c04946b1>] block_ioctl+0x1b/0x1f
> [<c047ed5a>] do_ioctl+0x22/0x68
> [<c047eff2>] vfs_ioctl+0x252/0x265
> [<c047f04e>] sys_ioctl+0x49/0x63
> [<c0404070>] syscall_call+0x7/0xb

Annotate BLKPG_DEL_PARTITION's bd_mutex locking and add a little comment
clarifying the bd_mutex locking, because I confused myself and initially
thought the lock order was wrong too.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:16 -08:00
Andrew Morton b446b60e4e [PATCH] rework reserved major handling
Several people have reported failures in dynamic major device number handling
due to the recent changes in there to avoid handing out the local/experimental
majors.

Rolf reports that this is due to a gcc-4.1.0 bug.

The patch refactors that code a lot in an attempt to provoke the compiler into
behaving.

Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:13 -08:00
Jesper Juhl a8e14b950c update I/O sched Kconfig help texts - CFQ is now default, not AS.
Change I/O scheduler description to correctly show CFQ as being the default
scheduler and not the anticipatory scheduler that previously was default.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2007-02-17 20:08:22 +01:00
Arjan van de Ven 2b8693c061 [PATCH] mark struct file_operations const 3
Many struct file_operations in the kernel can be "const".  Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data.  In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:45 -08:00
Andrew Morton fdf892be32 [PATCH] register_blkdev(): don't hand out the LOCAL/EXPERIMENTAL majors
As pointed out in http://bugzilla.kernel.org/show_bug.cgi?id=7922, dynamic
blockdev major allocation can hand out majors which LANANA has defined as
being for local/experimental use.

Cc: Torben Mathiasen <device@lanana.org>
Cc: Greg KH <greg@kroah.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tomas Klas <tomas.klas@mepatek.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:27 -08:00
Jens Axboe 9ede209e83 cfq-iosched: improve continue or break logic in cfq_dispatch
This improves performance considerably for sync requests when you
have command queuing enabled.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe 28f95cbc3e cfq-iosched: remove the implicit queue kicking in slice expire
We only really need it for a process going away, so move it to
those locations.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe 3c6bd2f879 cfq-iosched: check whether a queue timed out in accounting
Makes it more fair for the residual slice count.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe cb8874119e cfq-iosched: tweak the FIFO checking
We currently check the FIFO once per slice. Optimize that a bit and
only do it as the first thing for a new slice, so we don't end up
doing a single request and then seek to the FIFO requests.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe 1792669cc1 cfq-iosched: don't pass in queue for cfq_arm_slice_timer()
It must always be the active queue, otherwise it's a bug. So just
use the active_queue, don't pass it in explicitly.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe c5b680f3b7 cfq-iosched: account for slice over/under time
If a slice uses less than it is entitled to (or perhaps more), include
that in the decision on how much time to give it the next time it
gets serviced.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe 44f7c16065 cfq-iosched: defer slice activation to first request being active
This better matches what time the queue is actually spending doing
IO.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe 99f9628aba [PATCH] cfq-iosched: use last service point as the fairness criteria
Right now we use slice_start, which gives async queues an unfair
advantage. Chance that to service_last, and base the resorter
on that.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe b0b8d74941 cfq-iosched: document the cfqq flags
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:44 +01:00
Jens Axboe 98e41c7dfc [PATCH] cfq-iosched: move on_rr check into cfq_resort_rr_list()
Move the on_rr check into cfq_resort_rr_list(), every call site
needs to check it anyway.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:44 +01:00
Jens Axboe aaf1228ddf cfq-iosched: remove cfq_io_context last_queue
It hasn't been used for a while, kill it off and remove the old
if 0 code chunk.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:44 +01:00
Jens Axboe 783660b2f6 elevator: don't sort reads between writes
Don't allow elv_dispatch_sort() to mix reads and writes together,
it's rarely a good idea.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:44 +01:00
Jens Axboe cad9751642 elevator: abstract out the activate and deactivate functions
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:44 +01:00
Linus Torvalds c827ba4cb4 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
  [SPARC64]: Update defconfig.
  [SPARC64]: Add PCI MSI support on Niagara.
  [SPARC64] IRQ: Use irq_desc->chip_data instead of irq_desc->handler_data
  [SPARC64]: Add obppath sysfs attribute for SBUS and PCI devices.
  [PARTITION]: Add whole_disk attribute.
2007-02-11 11:37:45 -08:00
Mathieu Desnoyers 23c887522e [PATCH] Relay: add CPU hotplug support
Mathieu originally needed to add this for tracing Xen, but it's something
that's needed for any application that can be tracing while cpus are added.

unplug isn't supported by this patch.  The thought was that at minumum a new
buffer needs to be added when a cpu comes up, but it wasn't worth the effort
to remove buffers on cpu down since they'd be freed soon anyway when the
channel was closed.

[zanussi@us.ibm.com: avoid lock_cpu_hotplug deadlock]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:28 -08:00
Fabio Massimo Di Nitto d18d7682c1 [PARTITION]: Add whole_disk attribute.
Some partitioning systems create special partitions that
span the entire disk.  One example are Sun partitions, and
this whole-disk partition exists to tell the firmware the
extent of the entire device so it can load the boot block
and do other things.

Such partitions should not be treated as normal partitions,
because all the other partitions overlap this whole-disk one.
So we'd see multiple instances of the same UUID etc. which
we do not want.  udev and friends can thus search for this
'whole_disk' attribute and use it to decide to ignore the
partition.

Signed-off-by: Fabio Massimo Di Nitto <fabbione@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-10 23:50:00 -08:00
Neil Brown 387bb17374 [PATCH] md: fix various bugs with aligned reads in RAID5
It is possible for raid5 to be sent a bio that is too big for an underlying
device.  So if it is a READ that we pass stright down to a device, it will
fail and confuse RAID5.

So in 'chunk_aligned_read' we check that the bio fits within the parameters
for the target device and if it doesn't fit, fall back on reading through
the stripe cache and making lots of one-page requests.

Note that this is the earliest time we can check against the device because
earlier we don't have a lock on the device, so it could change underneath
us.

Also, the code for handling a retry through the cache when a read fails has
not been tested and was badly broken.  This patch fixes that code.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: "Kai" <epimetreus@fastmail.fm>
Cc: <stable@suse.de>
Cc: <org@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-09 09:25:46 -08:00
Mike Christie c0d4d573fe [PATCH] Fix SG_IO timeout jiffy conversion
Commit 85e04e371b cleaned up the timeout
conversion, but did it exactly the wrong way.  We get msecs from user
space, and should convert them into jiffies. Not the other way around.

Here is a fix with the overflow check sg.c has added in.  This fixes DVD
burnign with Nero.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
[ "you'll be wanting a comma there" - Andrew ]
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-29 20:32:03 -08:00
Linas Vepstas 95543179f1 [PATCH] elevator: move clearing of unplug flag earlier
A flag was recently added to the elevator code to avoid
performing an unplug when reuests are being re-queued.
The goal of this flag was to avoid a deep recursion that
can occur when re-queueing requests after a SCSI device/host
reset.  See http://lkml.org/lkml/2006/5/17/254

However, that fix added the flag near the bottom of a case
statement, where an earlier break (in an if statement) could
transport one out of the case, without setting the flag.
This patch sets the flag earlier in the case statement.

I re-discovered the deep recursion recently during testing;
I was told that it was a known problem, and the fix to it was
in the kernel I was testing. Indeed it was ... but it didn't
fix the bug. With the patch below, I no longer see the bug.

Signed-off by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Jens Axboe <axboe@suse.de>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-23 11:01:17 -08:00
Jens Axboe ec8acb6904 [PATCH] cfq-iosched: merging problem
Two issues:

- The final return 1 should be a return 0, otherwise comparing cfqq is
  a noop.

- bio_sync() only checks the sync flag, while rq_is_sync() checks both
  for READ and sync. The latter is what we want. Expand the bio check
  to include reads, and relax the restriction to allow merging of async
  io into sync requests.

In the future we want to clean up the SYNC logic, right now it means
both sync request (such as READ and O_DIRECT WRITE) and unplug-on-issue.
Leave that for later.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-02 09:46:16 -08:00
Jens Axboe 719d34027e [PATCH] cfq-iosched: tighten allow merge criteria
The logic in cfq_allow_merge() wasn't clear enough - basically allow
merging for the same queues only.  Do a fast check for 'rq and bio both
sync/async' before doing the cfqq hash lookup.

This is verified to work with the fixed elv_try_merge() from commit
bb4067e341.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 14:13:08 -08:00
Randy Dunlap af9997e426 [PATCH] fix kernel-doc warnings in 2.6.20-rc1
Fix kernel-doc warnings in 2.6.20-rc1.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:47 -08:00
Jens Axboe bb4067e341 [PATCH] elevator: fixup typo in merge logic
The recent io scheduler allow_merge commit left the block layer with
no merging, oops. This patch fixes that up.

That means the CFQ change needs to be verified again, it might not fix
the original bug now.  But that's a seperate thing, I'll double check
that tomorrow.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-21 22:01:04 -08:00
Jens Axboe da77526502 [PATCH] cfq-iosched: don't allow sync merges across queues
Currently we allow any merge, even if the io originates from different
processes. This can cause really bad starvation and unfairness, if those
ios happen to be synchronous (reads or direct writes).

So add a allow_merge hook to the io scheduler ops, so an io scheduler can
help decide whether a bio/process combination may be merged with an
existing request.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-20 11:04:12 +01:00
Jens Axboe 8e5cfc45e7 [PATCH] Fixup blk_rq_unmap_user() API
The blk_rq_unmap_user() API is not very nice. It expects the caller to
know that rq->bio has to be reset to the original bio, and it will
silently do nothing if that is not done. Instead make it explicit that
we need to pass in the first bio, by expecting a bio argument.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-19 11:12:46 +01:00
Jens Axboe 48785bb9fa [PATCH] __blk_rq_unmap_user() fails to return error
If the bio is user copied, the copy back could return -EFAULT. Make
sure we return any error seen during unmapping.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-19 11:07:59 +01:00
Jens Axboe 9c9381f942 [PATCH] __blk_rq_map_user() doesn't need to grab the queue_lock
It was for driver private back_merge_fn hooks, but they don't exist
anymore.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-19 08:34:17 +01:00
Jens Axboe 1aa4f24fe9 [PATCH] Remove queue merging hooks
We have full flexibility of merging parameters now, so we can remove the
hooks that define back/front/request merge strategies. Nobody is using
them anymore.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-19 08:33:11 +01:00
Jens Axboe 2985259b0e [PATCH] ->nr_sectors and ->hard_nr_sectors are not used for BLOCK_PC requests
It's a file system thing, for block requests the only size used in the
io paths is ->data_len as it is in bytes, not sectors.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-19 08:27:31 +01:00
Jens Axboe c65fb61b3c [PATCH] Allow as-iosched to be unloaded
We implemented the missing bits to allow this some time ago, and
they are integrated in AS. So remove the __module_get() to allow
the module to be unloaded.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-13 13:25:18 +01:00
Jens Axboe 7749a8d423 [PATCH] Propagate down request sync flag
We need to do this, otherwise the io schedulers don't get access to the
sync flag. Then they cannot tell the difference between a regular write
and an O_DIRECT write, which can cause a performance loss.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-13 13:02:26 +01:00
FUJITA Tomonori 335302618f [PATCH] remove unnecessary blk_queue_bounce in SG_IO
When I converted the original patch, I left unnecessary blk_queue_bounce in
SG_IO.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-12 10:26:55 +01:00
FUJITA Tomonori 77d172ce27 [PATCH] fix SG_IO bio leak
This patch fixes bio leaks in SG_IO. rq->bio can be changed after io
completion, so we need to reset rq->bio before calling blk_rq_unmap_user()

http://marc.theaimsgroup.com/?l=linux-kernel&m=116570666807983&w=2

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-12 10:22:23 +01:00
Boaz Harrosh 2b02a17920 [PATCH] remove blk_queue_activity_fn
While working on bidi support at struct request level
I have found that blk_queue_activity_fn is actually never used.
The only user is in ide-probe.c with this code:

	/* enable led activity for disk drives only */
	if (drive->media == ide_disk && hwif->led_act)
		blk_queue_activity_fn(q, hwif->led_act, drive);

And led_act is never initialized anywhere.
(Looking back at older kernels it was used in the PPC arch, but was removed around 2.6.18)
Unless it is all for future use off course.
(this patch is against linux-2.6-block.git as off 2006/12/4)

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-12 10:22:23 +01:00
Andrew Morton faccbd4b26 [PATCH] io-accounting: read accounting
Wire up read accounting for block devices, within submit_bio().

Cc: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: David Wright <daw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:41 -08:00
Akinobu Mita c17bb49517 [PATCH] fault-injection capability for disk IO
This patch provides fault-injection capability for disk IO.

Boot option:

fail_make_request=<probability>,<interval>,<space>,<times>

	<interval> -- specifies the interval of failures.

	<probability> -- specifies how often it should fail in percent.

	<space> -- specifies the size of free space where disk IO can be issued
		   safely in bytes.

	<times> -- specifies how many times failures may happen at most.

Debugfs:

/debug/fail_make_request/interval
/debug/fail_make_request/probability
/debug/fail_make_request/specifies
/debug/fail_make_request/times

Example:

	fail_make_request=10,100,0,-1
	echo 1 > /sys/blocks/hda/hda1/make-it-fail

generic_make_request() on /dev/hda1 fails once per 10 times.

Cc: Jens Axboe <axboe@suse.de>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:02 -08:00
Josef Sipek c5a20b6c26 [PATCH] struct path: convert block
Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:44 -08:00
Peter Zijlstra 2e7b651df1 [PATCH] remove the old bd_mutex lockdep annotation
Remove the old complex and crufty bd_mutex annotation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jason Baron <jbaron@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:38 -08:00
Ingo Molnar 0231606785 [PATCH] hotplug CPU: clean up hotcpu_notifier() use
There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
generating compiler warnings of unused symbols, hence forcing people to add
#ifdefs.

the compiler can skip truly unused functions just fine:

    text    data     bss     dec     hex filename
 1624412  728710 3674856 6027978  5bfaca vmlinux.before
 1624412  728710 3674856 6027978  5bfaca vmlinux.after

[akpm@osdl.org: topology.c fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:39 -08:00
Christoph Lameter e18b890bb0 [PATCH] slab: remove kmem_cache_t
Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

	#!/bin/sh
	#
	# Replace one string by another in all the kernel sources.
	#

	set -e

	for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
		quilt add $file
		sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
		mv /tmp/$$ $file
		quilt refresh
	done

The script was run like this

	sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Alan Stern a120586873 [PATCH] Allow NULL pointers in percpu_free
The patch (as824b) makes percpu_free() ignore NULL arguments, as one would
expect for a deallocation routine.  (Note that free_percpu is #defined as
percpu_free in include/linux/percpu.h.) A few callers are updated to remove
now-unneeded tests for NULL.  A few other callers already seem to assume
that passing a NULL pointer to percpu_free() is okay!

The patch also removes an unnecessary NULL check in percpu_depopulate().

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:22 -08:00
David Howells 4796b71fbb Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	drivers/pcmcia/ds.c

Fix up merge failures with Linus's head and fix new compile failures.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-12-06 15:01:18 +00:00
Linus Torvalds ec0bf39a47 Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (73 commits)
  [SCSI] aic79xx: Add ASC-29320LPE ids to driver
  [SCSI] stex: version update
  [SCSI] stex: change wait loop code
  [SCSI] stex: add new device type support
  [SCSI] stex: update device id info
  [SCSI] stex: adjust default queue length
  [SCSI] stex: add value check in hard reset routine
  [SCSI] stex: fix controller_info command handling
  [SCSI] stex: fix biosparam calculation
  [SCSI] megaraid: fix MMIO casts
  [SCSI] tgt: fix undefined flush_dcache_page() problem
  [SCSI] libsas: better error handling in sas_expander.c
  [SCSI] lpfc 8.1.11 : Change version number to 8.1.11
  [SCSI] lpfc 8.1.11 : Misc Fixes
  [SCSI] lpfc 8.1.11 : Add soft_wwnn sysfs attribute, rename soft_wwn_enable
  [SCSI] lpfc 8.1.11 : Removed decoding of PCI Subsystem Id
  [SCSI] lpfc 8.1.11 : Add MSI (Message Signalled Interrupts) support
  [SCSI] lpfc 8.1.11 : Adjust LOG_FCP logging
  [SCSI] lpfc 8.1.11 : Fix Memory leaks
  [SCSI] lpfc 8.1.11 : Fix lpfc_multi_ring_support
  ...
2006-12-05 16:09:46 -08:00
David Howells 9db7372445 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	drivers/ata/libata-scsi.c
	include/linux/libata.h

Futher merge of Linus's head and compilation fixups.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-12-05 17:01:28 +00:00
David Howells 4c1ac1b491 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	drivers/infiniband/core/iwcm.c
	drivers/net/chelsio/cxgb2.c
	drivers/net/wireless/bcm43xx/bcm43xx_main.c
	drivers/net/wireless/prism54/islpci_eth.c
	drivers/usb/core/hub.h
	drivers/usb/input/hid-core.c
	net/core/netpoll.c

Fix up merge failures with Linus's head and fix new compilation failures.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-12-05 14:37:56 +00:00
Matthew Wilcox e62438630c [PATCH] Centralise definitions of sector_t and blkcnt_t
CONFIG_LBD and CONFIG_LSF are spread into asm/types.h for no particularly
good reason.

Centralising the definition in linux/types.h means that arch maintainers
don't need to bother adding it, as well as fixing the problem with
x86-64 users being asked to make a decision that has absolutely no
effect.

The H8/300 porters seem particularly confused since I'm not aware of any
microcontrollers that need to support 2TB filesystems.

Signed-off-by: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-04 19:41:15 -08:00
Jens Axboe a863055b10 [PATCH] blktrace: don't return blktrace_seq from trace_note()
Only the process notifier needs it, and it can set it manually.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-04 09:30:58 +01:00
Jens Axboe d3d9d2a5ea [PATCH] blktrace: uninline trace_note()
It's too large to inline. Additionally clean it up, by fast pathing
the likely path.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-04 09:27:41 +01:00
Jens Axboe bb37b94c68 [BLOCK] Cleanup unused variable passing
- ->init_queue() does not need the elevator passed in
- ->put_request() is a hot path and need not have the queue passed in
- cfq_update_io_seektime() does not need cfqd passed in

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-01 10:42:33 +01:00
Mike Christie 0e75f9063f [PATCH] block: support larger block pc requests
This patch modifies blk_rq_map/unmap_user() and the cdrom and scsi_ioctl.c
users so that it supports requests larger than bio by chaining them together.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-01 10:40:55 +01:00
Olaf Kirch be1c63411a [PATCH] blktrace: add timestamp message
This adds a new timestamp message to blktrace, giving the timeofday when
we starting tracing. This helps user space correlate block trace events
with eg an application strace.

This requires a (compatible) update to blkparse. The changed blkparse
is still able to process traces generated by older kernels, and older
versions of blkparse should silently ignore the new records (because
they have a pid of 0).

Signed-off-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-01 10:39:12 +01:00
James Bottomley 0bd2af4683 Merge ../scsi-rc-fixes-2.6 2006-11-22 12:06:44 -06:00
David Howells 65f27f3844 WorkStruct: Pass the work_struct pointer instead of context data
Pass the work_struct pointer to the work function rather than context data.
The work function can use container_of() to work out the data.

For the cases where the container of the work_struct may go away the moment the
pending bit is cleared, it is made possible to defer the release of the
structure by deferring the clearing of the pending bit.

To make this work, an extra flag is introduced into the management side of the
work_struct.  This governs auto-release of the structure upon execution.

Ordinarily, the work queue executor would release the work_struct for further
scheduling or deallocation by clearing the pending bit prior to jumping to the
work function.  This means that, unless the driver makes some guarantee itself
that the work_struct won't go away, the work function may not access anything
else in the work_struct or its container lest they be deallocated..  This is a
problem if the auxiliary data is taken away (as done by the last patch).

However, if the pending bit is *not* cleared before jumping to the work
function, then the work function *may* access the work_struct and its container
with no problems.  But then the work function must itself release the
work_struct by calling work_release().

In most cases, automatic release is fine, so this is the default.  Special
initiators exist for the non-auto-release case (ending in _NAR).


Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:55:48 +00:00
Tejun Heo 097b8457da [PATCH] scsi: clear garbage after CDBs on SG_IO
ATAPI devices transfer fixed number of bytes for CDBs (12 or 16).  Some
ATAPI devices choke when shorter CDB is used and the left bytes contain
garbage.  Block SG_IO cleared left bytes but SCSI SG_IO didn't.  This patch
makes SCSI SG_IO clear it and simplify CDB clearing in block SG_IO.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Mathieu Fluhr <mfluhr@nero.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Douglas Gilbert <dougg@torque.net>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: <stable@kernel.org>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-16 11:43:38 -08:00
Hannes Reinecke 85e04e371b [SCSI] block: convert jiffies to msecs in scsi_ioctl()
Use the proper conversion function for convert jiffies to msecs in
sg_io().

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2006-11-15 17:57:33 -06:00
Jens Axboe 616e8a091a [PATCH] Fix bad data direction in SG_IO
Contrary to what the name misleads you to believe, SG_DXFER_TO_FROM_DEV
is really just a normal read seen from the device side.

This patch fixes http://lkml.org/lkml/2006/10/13/100

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-13 09:47:00 -08:00
Andrew Morton df66b8552b [PATCH] tidy "md: check bio address after mapping through partitions"
Neil's xterms are too wide.

Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:55 -08:00
Jens Axboe 5fccbf61be [PATCH] CFQ: request <-> request merging rr_list fixup
In very rare circumstances would we be pruning a merged request and at
the same time delete the implicated cfqq from the rr_list, and not readd
it when the merged request got added. This could cause io stalls until
that process issued io again.

Fix it up by putting the rr_list add handling into cfq_add_rq_rb(),
identical to how pruning is handled in cfq_del_rq_rb(). This fixes a
hang reproducible with fsx-linux.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-31 08:12:45 -08:00
NeilBrown 5ddfe9691c [PATCH] md: check bio address after mapping through partitions.
Partitions are not limited to live within a device.  So we should range
check after partition mapping.

Note that 'maxsector' was being used for two different things.  I have
split off the second usage into 'old_sector' so that maxsector can be still
be used for it's primary usage later in the function.

Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-31 08:07:01 -08:00
Jens Axboe c1b707d253 [PATCH] CFQ: bad locking in changed_ioprio()
When the ioprio code recently got juggled a bit, a bug was introduced.
changed_ioprio() is no longer called with interrupts disabled, so using
plain spin_lock() on the queue_lock is a bug.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-30 11:01:50 -08:00
Jens Axboe 0261d6886e [PATCH] CFQ: use irq safe locking in cfq_cic_link()
If cfq_set_request() is called for a new process AND a non-fs io
request (so that __GFP_WAIT may not be set), cfq_cic_link() may
use spin_lock_irq() and spin_unlock_irq() with interrupts already
disabled.

Fix is to always use irq safe locking in cfq_cic_link()

Acked-By: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-30 10:21:58 -08:00
Andrew Morton 3fcfab16c5 [PATCH] separate bdi congestion functions from queue congestion functions
Separate out the concept of "queue congestion" from "backing-dev congestion".
Congestion is a backing-dev concept, not a queue concept.

The blk_* congestion functions are retained, as wrappers around the core
backing-dev congestion functions.

This proper layering is needed so that NFS can cleanly use the congestion
functions, and so that CONFIG_BLOCK=n actually links.

Cc: "Thomas Maier" <balagi@justmail.de>
Cc: "Jens Axboe" <jens.axboe@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Howells <dhowells@redhat.com>
Cc: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:35 -07:00
Thomas Maier 79e2de4bc5 [PATCH] export clear_queue_congested and set_queue_congested
Export the clear_queue_congested() and set_queue_congested() functions
located in ll_rw_blk.c

The functions are renamed to blk_clear_queue_congested() and
blk_set_queue_congested().

(needed in the pktcdvd driver's bio write congestion control)

Signed-off-by: Thomas Maier <balagi@justmail.de>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:35 -07:00
Vasily Tarasov c584164224 [PATCH] block layer: elv_iosched_show should get elv_list_lock
elv_iosched_show function iterates other elv_list, hence
elv_list_lock should be got.

Signed-off-by: Vasily Tarasov <vtaras@openvz.org>
Signed-off-by: Vasily Tarasov <jens.axboe@oracle.com>
2006-10-12 15:08:51 +02:00
Vasily Tarasov a22b169df1 [PATCH] block layer: elevator_find function cleanup
We can easily produce search through the elevator list
without introducing additional elevator_type variable.

Signed-off-by: Vasily Tarasov <vtaras@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-10-12 15:08:51 +02:00
David C Somayajulu f583f4924d [PATCH] helper function for retrieving scsi_cmd given host based block layer tag
This was necessitated by the need for a function to get back
to a scsi_cmnd, when an hba the posts its (corresponding) completion
interrupt with a block layer tag as its reference.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Somayajulu <david.somayajulu@qlogic.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-10-04 19:32:09 +02:00
Alasdair G Kergon 7006f6eca8 [PATCH] dm: export blkdev_driver_ioctl
Export blkdev_driver_ioctl for device-mapper.

If we get as far as the device-mapper ioctl handler, we know the ioctl is not
a standard block layer BLK* one, so we don't need to check for them a second
time and can call blkdev_driver_ioctl() directly.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:13 -07:00
Peter Zijlstra 6e9a4738c9 [PATCH] completions: lockdep annotate on stack completions
All on stack DECLARE_COMPLETIONs should be replaced by:
DECLARE_COMPLETION_ONSTACK

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:24 -07:00
Jens Axboe 51d7513a8a [PATCH] Only enable CONFIG_BLOCK option for embedded
It's too easy for people to shoot themselves in the foot, and it
only makes sense for embedded folks anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 21:14:05 +02:00
Jens Axboe 059af497c2 [PATCH] blk_queue_start_tag() shared map race fix
If we share the tag map between two or more queues, then we cannot
use __set_bit() to set the bit. In fact we need to make sure we
atomically acquire this tag, so loop using test_and_set_bit() to
protect from that.

Noticed by Mike Christie <michaelc@cs.wisc.edu>

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:34 +02:00
Jens Axboe 0fe2347957 [PATCH] Update axboe@suse.de email address
As people often look for the copyright in files to see who to mail,
update the link to a neutral one.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:34 +02:00
David Howells 9361401eb7 [PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer.  Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.

This patch does the following:

 (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
     support.

 (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
     an item that uses the block layer.  This includes:

     (*) Block I/O tracing.

     (*) Disk partition code.

     (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.

     (*) The SCSI layer.  As far as I can tell, even SCSI chardevs use the
     	 block layer to do scheduling.  Some drivers that use SCSI facilities -
     	 such as USB storage - end up disabled indirectly from this.

     (*) Various block-based device drivers, such as IDE and the old CDROM
     	 drivers.

     (*) MTD blockdev handling and FTL.

     (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
     	 taking a leaf out of JFFS2's book.

 (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
     linux/elevator.h contingent on CONFIG_BLOCK being set.  sector_div() is,
     however, still used in places, and so is still available.

 (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
     parts of linux/fs.h.

 (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.

 (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.

 (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
     is not enabled.

 (*) fs/no-block.c is created to hold out-of-line stubs and things that are
     required when CONFIG_BLOCK is not set:

     (*) Default blockdev file operations (to give error ENODEV on opening).

 (*) Makes some /proc changes:

     (*) /proc/devices does not list any blockdevs.

     (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.

 (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.

 (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
     given command other than Q_SYNC or if a special device is specified.

 (*) In init/do_mounts.c, no reference is made to the blockdev routines if
     CONFIG_BLOCK is not defined.  This does not prohibit NFS roots or JFFS2.

 (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
     error ENOSYS by way of cond_syscall if so).

 (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
     CONFIG_BLOCK is not set, since they can't then happen.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:31 +02:00
Martin Peschke 4090959aee [PATCH] blktrace: cleanup using on_each_cpu
This patch kills a few lines of code in blktrace by making use of
on_each_cpu().

Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:31:19 +02:00
Oleg Nesterov 25034d7a83 [PATCH] exit_io_context: don't disable irqs
We don't need to disable irqs to clear current->io_context, it is protected
by ->alloc_lock. Even IF it was possible to submit I/O from IRQ on behalf of
current this irq_disable() can't help: current_io_context() will re-instantiate
->io_context after irq_enable().

We don't need task_lock() or local_irq_disable() to clear ioc->task. This can't
prevent other CPUs from playing with our io_context anyway.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:31:18 +02:00
Jens Axboe 7457e6e2d7 [PATCH] blktrace: support for logging metadata reads
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:43 +02:00
Jens Axboe 374f84ac39 [PATCH] cfq-iosched: use metadata read flag
Give meta data reads preference over regular reads, as the process
often needs to get that out of the way to do the io it was actually
interested in.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:43 +02:00
Jens Axboe 5404bc7a87 [PATCH] Allow file systems to differentiate between data and meta reads
We can use this information for making more intelligent priority
decisions, and it will also be useful for blktrace.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:42 +02:00
Jens Axboe da20a20f3b [PATCH] ll_rw_blk: allow more flexibility for read_ahead_kb store
It can make sense to set read-ahead larger than a single request.
We should not be enforcing such policy on the user. Additionally,
using the BLKRASET ioctl doesn't impose such a restriction. So
additionally we now expose identical behaviour through the two.

Issue also reported by Anton <cbou@mail.ru>

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:41 +02:00
Jens Axboe bf57225670 [PATCH] cfq-iosched: improve queue preemption
Don't touch the current queues, just make sure that the wanted queue
is selected next. Simplifies the logic.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:41 +02:00
Jens Axboe dc72ef4ae3 [PATCH] Add blk_start_queueing() helper
CFQ implements this on its own now, but it's really block layer
knowledge. Tells a device queue to start dispatching requests to
the driver, taking care to unplug if needed. Also fixes the issue
where as/cfq will invoke a stopped queue, which we really don't
want.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:40 +02:00
Jens Axboe 981a79730d [PATCH] cfq-iosched: kill the empty_list
No point in having a place holder list just for empty queues, so remove
it. It's not used for anything other than to keep ->cfq_list busy.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:40 +02:00
Jens Axboe 53b03744e5 [PATCH] cfq-iosched: Kill O(N) runtime of cfq_resort_rr_list()
Currently it scales with number of processes in that priority group,
which is potentially not very nice as it's called quite often.
Basically we always need to do tail inserts, except for the case of a
new process. So just mark/detect a queue as such.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:39 +02:00
Jens Axboe b5deef9012 [PATCH] Make sure all block/io scheduler setups are node aware
Some were kmalloc_node(), some were still kmalloc(). Change them all to
kmalloc_node().

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:39 +02:00
Jens Axboe 1ea25ecb72 [PATCH] Audit block layer inlines
Kill a few inlines that bring in too much code to more than one location
Shrinks kernel text by about 300 bytes on 32-bit x86.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:38 +02:00
Jens Axboe 4050cf1674 [PATCH] cfq-iosched: use new io context counting mechanism
It's ok if the read path is a lot more costly, as long as inc/dec is
really cheap. The inc/dec will happen for each created/freed io context,
while the reading only happens when a disk queue exits.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:37 +02:00
Jens Axboe e4313dd423 [PATCH] as-iosched: use new io context counting mechanism
It's ok if the read path is a lot more costly, as long as inc/dec is
really cheap. The inc/dec will happen for each created/freed io context,
while the reading only happens when a disk queue exits.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:37 +02:00
Jens Axboe fc46379daf [PATCH] cfq-iosched: kill cfq_exit_lock
cfq_exit_lock is protecting two things now:

- The per-ioc rbtree of cfq_io_contexts

- The per-cfqd linked list of cfq_io_contexts

The per-cfqd linked list can be protected by the queue lock, as it is (by
definition) per cfqd as the queue lock is.

The per-ioc rbtree is mainly used and updated by the process itself only.
The only outside use is the io priority changing. If we move the
priority changing to not browsing the rbtree, we can remove any locking
from the rbtree updates and lookup completely. Let the sys_ioprio syscall
just mark processes as having the iopriority changed and lazily update
the private cfq io contexts the next time io is queued, and we can
remove this locking as well.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:36 +02:00
Jens Axboe 89850f7ee9 [PATCH] cfq-iosched: cleanups, fixes, dead code removal
A collection of little fixes and cleanups:

- We don't use the 'queued' sysfs exported attribute, since the
  may_queue() logic was rewritten. So kill it.

- Remove dead defines.

- cfq_set_active_queue() can be rewritten cleaner with else if conditions.

- Several places had cfq_exit_cfqq() like logic, abstract that out and
  use that.

- Annotate the cfqq kmem_cache_alloc() so the allocator knows that this
  is a repeat allocation if it fails with __GFP_WAIT set. Allows the
  allocator to start freeing some memory, if needed. CFQ already loops for
  this condition, so might as well pass the hint down.

- Remove cfqd->rq_starved logic. It's not needed anymore after we dropped
  the crq allocation in cfq_set_request().

- Remove uneeded parameter passing.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:35 +02:00
Jens Axboe 51da90fcb6 [PATCH] ll_rw_blk: cleanup __make_request()
- Don't assign variables that are only used once.

- Kill spin_lock() prefetching, it's opportunistic at best.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:34 +02:00
Jens Axboe cb78b285c8 [PATCH] Drop useless bio passing in may_queue/set_request API
It's not needed for anything, so kill the bio passing.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:23 +02:00
Jens Axboe cdd6026217 [PATCH] Remove ->rq_status from struct request
After Christophs SCSI change, the only usage left is RQ_ACTIVE
and RQ_INACTIVE. The block layer sets RQ_INACTIVE right before freeing
the request, so any check for RQ_INACTIVE in a driver is a bug and
indicates use-after-free.

So kill/clean the remaining users, straight forward.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:23 +02:00
Jens Axboe 49171e5c6f [PATCH] Remove struct request_list from struct request
It is always identical to &q->rq, and we only use it for detecting
whether this request came out of our mempool or not. So replace it
with an additional ->flags bit flag.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:22 +02:00
Jens Axboe c00895ab2f [PATCH] Remove ->waiting member from struct request
As the comments indicates in blkdev.h, we can fold it into ->end_io_data
usage as that is really what ->waiting is. Fixup the users of
blk_end_sync_rq().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:29:12 +02:00
Jens Axboe 8a8e674cb1 [PATCH] as-iosched: kill arq
Get rid of the as_rq request type. With the added elevator_private2, we
have enough room in struct request to get rid of any arq allocation/free
for each request.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
2006-09-30 20:27:02 +02:00
Jens Axboe 5e70537479 [PATCH] cfq-iosched: kill crq
Get rid of the cfq_rq request type. With the added elevator_private2, we
have enough room in struct request to get rid of any crq allocation/free
for each request.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:27:02 +02:00
Jens Axboe 5380a101d3 [PATCH] cfq-iosched: remove the crq flag functions/variable
There's just one flag currently (SYNC), and that one can be grabbed from
the request.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:27:01 +02:00
Jens Axboe 8840faa1ee [PATCH] deadline-iosched: remove elevator private drq request type
A big win, we now save an allocation/free on each request! With the
previous rb/hash abstractions, we can just reuse queuelist/donelist
for the FIFO data and be done with it.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:27:00 +02:00
Jens Axboe 9e2585a8a2 [PATCH] as-iosched: remove arq->is_sync member
We can track this in struct request.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
2006-09-30 20:27:00 +02:00
Jens Axboe d4f2f4629e [PATCH] as-iosched: reuse rq for fifo
Saves some space in arq.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
2006-09-30 20:27:00 +02:00
Jens Axboe 95e8810b28 [PATCH] cfq-iosched: convert to using the FIFO elevator defines
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:26:59 +02:00
Jens Axboe b8aca35af5 [PATCH] deadline-iosched: migrate to using the elevator rb functions
This removes the rbtree handling from deadline.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:26:58 +02:00
Jens Axboe 21183b07ee [PATCH] cfq-iosched: migrate to using the elevator rb functions
This removes the rbtree handling from CFQ.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:26:58 +02:00
Jens Axboe e37f346e34 [PATCH] as-iosched: migrate to using the elevator rb functions
This removes the rbtree handling from AS.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
2006-09-30 20:26:57 +02:00
Jens Axboe 2e662b65f0 [PATCH] elevator: abstract out the rbtree sort handling
The rbtree sort/lookup/reposition logic is mostly duplicated in
cfq/deadline/as, so move it to the elevator core. The io schedulers
still provide the actual rb root, as we don't want to impose any sort
of specific handling on the schedulers.

Introduce the helpers and rb_node in struct request to help migrate the
IO schedulers.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:26:57 +02:00
Jens Axboe 10fd48f237 [PATCH] rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev
The conditions got reserved. Also make rb_next() and rb_prev() check
for the empty condition.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:26:56 +02:00
Jens Axboe 9817064b68 [PATCH] elevator: move the backmerging logic into the elevator core
Right now, every IO scheduler implements its own backmerging (except for
noop, which does no merging). That results in duplicated code for
essentially the same operation, which is never a good thing. This patch
moves the backmerging out of the io schedulers and into the elevator
core. We save 1.6kb of text and as a bonus get backmerging for noop as
well. Win-win!

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:26:56 +02:00
Jens Axboe 4aff5e2333 [PATCH] Split struct request ->flags into two parts
Right now ->flags is a bit of a mess: some are request types, and
others are just modifiers. Clean this up by splitting it into
->cmd_type and ->cmd_flags. This allows introduction of generic
Linux block message types, useful for sending generic Linux commands
to block devices.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:23:37 +02:00
Alexey Dobriyan 6c5c934153 [PATCH] ifdef blktrace debugging fields
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:09 -07:00
Randy Dunlap 87a5726110 [PATCH] block: handle subsystem_register() init errors
Check and handle init errors.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:05 -07:00
Theodore Ts'o 8e18e2941c [PATCH] inode_diet: Replace inode.u.generic_ip with inode.i_private
The following patches reduce the size of the VFS inode structure by 28 bytes
on a UP x86.  (It would be more on an x86_64 system).  This is a 10% reduction
in the inode size on a UP kernel that is configured in a production mode
(i.e., with no spinlock or other debugging functions enabled; if you want to
save memory taken up by in-core inodes, the first thing you should do is
disable the debugging options; they are responsible for a huge amount of bloat
in the VFS inode structure).

This patch:

The filesystem or device-specific pointer in the inode is inside a union,
which is pretty pointless given that all 30+ users of this field have been
using the void pointer.  Get rid of the union and rename it to i_private, with
a comment to explain who is allowed to use the void pointer.  This is just a
cleanup, but it allows us to reuse the union 'u' for something something where
the union will actually be used.

[judith@osdl.org: powerpc build fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Judith Lebzelter <judith@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:17 -07:00
James Bottomley 1aedf2ccc6 Merge mulgrave-w:git/linux-2.6
Conflicts:

	include/linux/blkdev.h

Trivial merge to incorporate tag prototypes.
2006-09-23 21:03:52 -05:00
Trond Myklebust 275a082fe9 Add a real API for dealing with blk_congestion_wait()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2006-09-22 23:24:54 -04:00
James Bottomley 492dfb4896 [SCSI] block: add support for shared tag maps
The current block queue implementation already contains most of the
machinery for shared tag maps.  The only remaining pieces are a way to
allocate and destroy a tag map independently of the queues (so that
the maps can be managed on the life cycle of the overseeing entity)

Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2006-08-31 11:17:18 -04:00
Oleg Nesterov 2d8f613160 elv_unregister: fix possible crash on module unload
An exiting task or process which didn't do I/O yet have no io context,
elv_unregister() should check it is not NULL.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-22 12:52:23 -07:00
Oleg Nesterov be33c3a67b [PATCH] cfq_cic_link: fix usage of wrong cfq_io_context
Obviously, cfq_cic_link() shouldn't free a just allocated cfq_io_context?
The dead key is from __cic, so drop that.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-08-21 10:02:54 +02:00
Oleg Nesterov 9f83e45eb5 [PATCH] Fix current_io_context() vs set_task_ioprio() race
I know nothing about io scheduler, but I suspect set_task_ioprio() is not safe.

current_io_context() initializes "struct io_context", then sets ->io_context.
set_task_ioprio() running on another cpu may see the changes out of order, so
->set_ioprio(ioc) may use io_context which was not initialized properly.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-08-21 08:34:15 +02:00
Jens Axboe 44eb123126 [PATCH] cfq-iosched: don't use a hard jiffies value, translate from msecs
The CIC_SEEKY() test really wants to use the minimum of either:

- 2 msecs (not jiffies)

- or, the pending slice time

So code it like that.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-07-25 15:05:21 +02:00
Milton Miller ad01b1ca79 [PATCH] blktrace: fix read-ahead bit
It should be toggling the same bit on and off, fix it up.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-07-25 15:04:13 +02:00
Arjan van de Ven ddca60c590 [PATCH] lockdep: annotate the BLKPG_DEL_PARTITION ioctl
The delete partition IOCTL takes the bd_mutex for both the disk and the
partition; these have an obvious hierarchical relationship and this patch
annotates this relationship for lockdep.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:53 -07:00
Jens Axboe 1959d21232 [PATCH] Only the first two bits in bio->bi_rw and rq->flags match
Not three, as assumed. This causes the barrier bit to be needlessly set
for some IO.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-07-06 10:18:05 +02:00
Nathan Scott 40359ccb83 [PATCH] blktrace: readahead support
Provide the needed kernel support for distinguishing readahead
from regular read requests when tracing block devices.

Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-07-06 10:03:28 +02:00
Ingo Molnar 60be6b9a41 [PATCH] lockdep: annotate on-stack completions
lockdep needs to have the waitqueue lock initialized for on-stack waitqueues
implicitly initialized by DECLARE_COMPLETION().  Annotate on-stack completions
accordingly.

Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:09 -07:00
Linus Torvalds 22a3e233ca Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
  Remove obsolete #include <linux/config.h>
  remove obsolete swsusp_encrypt
  arch/arm26/Kconfig typos
  Documentation/IPMI typos
  Kconfig: Typos in net/sched/Kconfig
  v9fs: do not include linux/version.h
  Documentation/DocBook/mtdnand.tmpl: typo fixes
  typo fixes: specfic -> specific
  typo fixes in Documentation/networking/pktgen.txt
  typo fixes: occuring -> occurring
  typo fixes: infomation -> information
  typo fixes: disadvantadge -> disadvantage
  typo fixes: aquire -> acquire
  typo fixes: mecanism -> mechanism
  typo fixes: bandwith -> bandwidth
  fix a typo in the RTC_CLASS help text
  smb is no longer maintained

Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S
2006-06-30 15:39:30 -07:00
Christoph Lameter f8891e5e1f [PATCH] Light weight event counters
The remaining counters in page_state after the zoned VM counter patches
have been applied are all just for show in /proc/vmstat.  They have no
essential function for the VM.

We use a simple increment of per cpu variables.  In order to avoid the most
severe races we disable preempt.  Preempt does not prevent the race between
an increment and an interrupt handler incrementing the same statistics
counter.  However, that race is exceedingly rare, we may only loose one
increment or so and there is no requirement (at least not in kernel) that
the vm event counters have to be accurate.

In the non preempt case this results in a simple increment for each
counter.  For many architectures this will be reduced by the compiler to a
single instruction.  This single instruction is atomic for i386 and x86_64.
 And therefore even the rare race condition in an interrupt is avoided for
both architectures in most cases.

The patchset also adds an off switch for embedded systems that allows a
building of linux kernels without these counters.

The implementation of these counters is through inline code that hopefully
results in only a single instruction increment instruction being emitted
(i386, x86_64) or in the increment being hidden though instruction
concurrency (EPIC architectures such as ia64 can get that done).

Benefits:
- VM event counter operations usually reduce to a single inline instruction
  on i386 and x86_64.
- No interrupt disable, only preempt disable for the preempt case.
  Preempt disable can also be avoided by moving the counter into a spinlock.
- Handling is similar to zoned VM counters.
- Simple and easily extendable.
- Can be omitted to reduce memory use for embedded use.

References:

RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:36 -07:00
Jörn Engel 6ab3d5624e Remove obsolete #include <linux/config.h>
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 19:25:36 +02:00
Chandra Seetharaman 5a67e4c5b6 [PATCH] cpu hotplug: use hotplug version of cpu notifier in appropriate places
Make use the of newly defined hotplug version of cpu_notifier functionality
wherever appropriate.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Chandra Seetharaman 054cc8a2d8 [PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17
This patch reverts notifier_block changes made in 2.6.17

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Andreas Mohr d6e05edc59 spelling fixes
acquired (aquired)
contiguous (contigious)
successful (succesful, succesfull)
surprise (suprise)
whether (weather)
some other misspellings

Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-26 18:35:02 +02:00
Andi Kleen 8269730b38 [BLOCK] Fix bounce limit address check
Do a safer check for when to enable DMA. Currently we enable ISA DMA
for cases that do not need it, resulting in OOM conditions when ZONE_DMA
runs out of space.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:39 +02:00
Jens Axboe dd67d05152 [PATCH] rbtree: support functions used by the io schedulers
They all duplicate macros to check for empty root and/or node, and
clearing a node. So put those in rbtree.h.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:39 +02:00
Jens Axboe fd61af0384 [PATCH] cfq-iosched: rq update fixes
- Remember to set ->last_sector so that the cfq_choose_req() logic
  works correctly.

- Remove redundant call to cfq_choose_req()

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:39 +02:00
Jens Axboe caaa5f9f0a [PATCH] cfq-iosched: many performance fixes
This is a collection of patches that greatly improve CFQ performance
in some circumstances.

- Change the idling logic to only kick in after a request is done and we
  are deciding what to do. Before the idling included the request service
  time, so it was hard to adjust. Now it's true think/idle time.

- Take advantage of TCQ/NCQ/queueing for seeky sync workloads, but keep
  it in control for sync and sequential (or close to) workloads.

- Expire queues immediately and move on to other busy queues, if we are
  not going to idle after the current one finishes.

- Don't rearm idle timer if there are no busy queues. Just leave the
  system idle.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:39 +02:00
Jens Axboe 35e6077cb1 [PATCH] cfq-iosched: correctly set ioprio on both targets
Patch originally from Vasily Tarasov <vtaras@sw.ru>

If you set io-priority of process 1 using sys_ioprio_set system call by
another process 2 (like ionice do), then cfq_init_prio_data() function
sets priority of process 2 (current) on queue of process 1 and clears
the flag, that designates change of ioprio.  So the process  1 will work
like with priority of process 2.

I propose not to call cfq_init_prio_data() on io-priority change, but
only mark queue as queue with changed prority.  Every time when new
request comes cfq-scheduler checks for this flag and atomaticaly changes
priority of queue to new value.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:39 +02:00
Jens Axboe b17fd9bceb [PATCH] Make CFQ the default IO scheduler
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:39 +02:00
Jens Axboe b31dc66a54 [PATCH] Kill PF_SYNCWRITE flag
A process flag to indicate whether we are doing sync io is incredibly
ugly. It also causes performance problems when one does a lot of async
io and then proceeds to sync it. Part of the io will go out as async,
and the other part as sync. This causes a disconnect between the
previously submitted io and the synced io. For io schedulers such as CFQ,
this will cause us lost merges and suboptimal behaviour in scheduling.

Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
the O_DIRECT path just directly indicate that the writes are sync
by using WRITE_SYNC instead.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:39 +02:00
Jens Axboe 271f18f102 [PATCH] cfq-iosched: Don't set the queue batching limits
We cannot update them if the user changes nr_requests, so don't
set it in the first place. The gains are pretty questionable as
well. The batching loss has been shown to decrease throughput.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:38 +02:00
Dave Jones acf4217555 [PATCH] remove dead code from elevator switching
We already drop the refcount in elevator_exit(), and as
we're setting 'e' to NULL, we'll never take that branch anyway.
Finally, as 'e' is a local var that isn't referenced afterwards,
setting it to NULL is pointless.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:38 +02:00
Paolo 'Blaisorblade' Giarrusso a038e25364 [PATCH] blk_start_queue() must be called with irq disabled - add warning
The queue lock can be taken from interrupts so it must always be taken with
irq disabling primitives.  Some primitives already verify this.
blk_start_queue() is called under this lock, so interrupts must be
disabled.

Also document this requirement clearly in blk_init_queue(), where the queue
spinlock is set.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:38 +02:00
Akinobu Mita bae386f788 [PATCH] iosched: use hlist for request hashtable
Use hlist instead of list_head for request hashtable in deadline-iosched
and as-iosched. It also can remove the flag to know hashed or unhashed.

Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Jens Axboe <axboe@suse.de>

 block/as-iosched.c       |   45 +++++++++++++++++++--------------------------
 block/deadline-iosched.c |   39 ++++++++++++++++-----------------------
 2 files changed, 35 insertions(+), 49 deletions(-)
2006-06-23 17:10:38 +02:00
Oleg Nesterov 626ab0e69d [PATCH] list: use list_replace_init() instead of list_splice_init()
list_splice_init(list, head) does unneeded job if it is known that
list_empty(head) == 1.  We can use list_replace_init() instead.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00