The aoe driver already had some congestion handling, but it was limited in
its ability to cope with the kind of congestion that can arise on more
complex networks such as those involving paths through multiple ethernet
switches.
Some of the lessons from TCP's history of development can be applied to
improving the congestion control and avoidance on AoE storage networks.
These changes use familar concepts from Van Jacobson's "Congestion
Avoidance and Control" paper from '88, without adding significant
overhead.
This patch depends on an upcoming patch that covers the failover case when
AoE commands being retransmitted are transferred from one retransmit queue
to another. Another upcoming patch increases the timing accuracy.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the aoe driver follow expected behavior when the user uses ioctl to
get the ATA device identify information, allowing access to model, serial
number, etc.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ATA over Ethernet config query response contains a "buffer count"
field reflecting the AoE target's capacity to buffer incoming AoE
commands.
By taking the current value of this field into accound, we increase
performance throughput or avoid network congestion, when the value
has increased or decreased, respectively.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In general, specific is better when it comes to messages about AoE usage
problems. Also, explicit checks for the AoE broadcast addresses are
added.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ATA over Ethernet protocol uses a major (shelf) and minor (slot)
address to identify a particular storage target. These changes remove an
artificial limitation the aoe driver imposes on the use of AoE addresses.
For example, without these changes, the slot address has a maximum of 15,
but users commonly use slot numbers much greater than that.
The AoE shelf and slot address space is often used sparsely. Instead of
using a static mapping between AoE addresses and the block device minor
number, the block device minor numbers are now allocated on demand.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The internal version number of the aoe driver appears in a console message
when the driver loads and is usually obtained by the user with the
userland aoe-version tool, part of the aoetools.[1]
Although this patchset includes bugfixes backported from higher-numbered
versions published on the coraid.com website, it is a form of version 49.
1. http://aoetools.sourceforge.net/
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This change removes some unused code and attempts to increase code
consistency.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the driver code, "target" and aoetgt refer to a particular remote
interface on the AoE storage target. The latter is identified by its AoE
major and minor addresses. Commands that are being sent to an AoE storage
target {major, minor} can be sent or retransmitted to any of the remote
MAC addresses associated with the AoE storage target.
That is, frames are naturally associated with not an aoetgt (AoE major,
AoE minor, remote MAC address) but an aoedev (AoE major, AoE minor).
Making the code reflect that reality simplifies the driver, especially
when the path to a remote MAC address becomes unusable.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The aoe_deadsecs module parameter allows the user to specify a hard limit
on the number of seconds an AoE command can be retransmitted before the
AoE block device is considered to have failed.
Using aoe_deadsecs to determine the time we try using a different remote
interface helps to ensure that the hard limit is not reached before we've
tried to recover by sending to a different remote port.
As a data storage target, the AoE target is unambiguously identified by
its {major, minor} AoE address tuple, and an AoE target can have multiple
MAC addresses. However, note that "target" in the driver code and
comments means a {major, minor, MAC address} tuple, as in "somewhere to
send packets".
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Users with several network interfaces dedicated to AoE generally do not
configure them to support different-sized AoE data payloads on purpose.
For a given AoE target, there will be a set of local network interfaces
that can reach it. Using only the payload that will fit in the
smallest-sized MTU of all those local interfaces greatly simplifies the
driver, especially in failure scenarios.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dev_queue_xmit function needs to have interrupts enabled, so the most
simple way to get the locking right but still fulfill that requirement is
to use a process that can call dev_queue_xmit serially over queued
transmissions.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To allow users to choose an elevator algorithm for their particular
workloads, change from a make_request-style driver to an
I/O-request-queue-handler-style driver.
We have to do a couple of things that might be surprising. We manipulate
the page _count directly on the assumption that we still have no guarantee
that users of the block layer are prohibited from submitting bios
containing pages with zero reference counts.[1] If such a prohibition now
exists, I can get rid of the _count manipulation.
Just as before this patch, we still keep track of the sk_buffs that the
network layer still hasn't finished yet and cap the resources we use with
a "pool" of skbs.[2]
Now that the block layer maintains the disk stats, the aoe driver's
diskstats function can go away.
1. https://lkml.org/lkml/2007/3/1/374
2. https://lkml.org/lkml/2007/7/6/241
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the frames the aoe driver uses to track the relationship between bios
and packets more flexible and detached, so that they can be passed to an
"aoe_ktio" thread for completion of I/O.
The frames are handled much like skbs, with a capped amount of
preallocation so that real-world use cases are likely to run smoothly and
degenerate gracefully even under memory pressure.
Decoupling I/O completion from the receive path and serializing it in a
process makes it easier to think about the correctness of the locking in
the driver, especially in the case of a remote MAC address becoming
unusable.
[dan.carpenter@oracle.com: cleanup an allocation a bit]
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
tAdd adds the ability to work with large packets composed of a number of
segments, using the scatter gather feature of the block layer (biovecs)
and the network layer (skb frag array). The motivation is the performance
gained by using a packet data payload greater than a page size and by
using the network card's scatter gather feature.
Users of the out-of-tree aoe driver already had these changes, but since
early 2011, they have complained of increased memory utilization and
higher CPU utilization during heavy writes.[1] The commit below appears
related, as it disables scatter gather on non-IP protocols inside the
harmonize_features function, even when the NIC supports sg.
commit f01a5236bd
Author: Jesse Gross <jesse@nicira.com>
Date: Sun Jan 9 06:23:31 2011 +0000
net offloading: Generalize netif_get_vlan_features().
With that regression in place, transmits always linearize sg AoE packets,
but in-kernel users did not have this patch. Before 2.6.38, though, these
changes were working to allow sg to increase performance.
1. http://www.spinics.net/lists/linux-mm/msg15184.html
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andy Whitcroft reported an oops in aoe triggered by use of an
incorrectly initialised request_queue object:
[ 2645.959090] kobject '<NULL>' (ffff880059ca22c0): tried to add
an uninitialized object, something is seriously wrong.
[ 2645.959104] Pid: 6, comm: events/0 Not tainted 2.6.31-5-generic #24-Ubuntu
[ 2645.959107] Call Trace:
[ 2645.959139] [<ffffffff8126ca2f>] kobject_add+0x5f/0x70
[ 2645.959151] [<ffffffff8125b4ab>] blk_register_queue+0x8b/0xf0
[ 2645.959155] [<ffffffff8126043f>] add_disk+0x8f/0x160
[ 2645.959161] [<ffffffffa01673c4>] aoeblk_gdalloc+0x164/0x1c0 [aoe]
The request queue of an aoe device is not used but can be allocated in
code that does not sleep.
Bruno bisected this regression down to
cd43e26f07
block: Expose stacked device queues in sysfs
"This seems to generate /sys/block/$device/queue and its contents for
everyone who is using queues, not just for those queues that have a
non-NULL queue->request_fn."
Addresses http://bugs.launchpad.net/bugs/410198
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13942
Note that embedding a queue inside another object has always been
an illegal construct, since the queues are reference counted and
must persist until the last reference is dropped. So aoe was
always buggy in this respect (Jens).
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Bruno Premont <bonbons@linux-vserver.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The Welland ME-747K-SI AoE target generates unsolicited AoE responses that
are marked as vendor extensions. Instead of ignoring these packets, the
aoe driver was generating kernel messages for each unrecognized response
received. This patch corrects the behavior.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Reported-by: <karaluh@karaluh.pl>
Tested-by: <karaluh@karaluh.pl>
Cc: <stable@kernel.org>
Cc: Alex Buell <alex.buell@munted.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add %pm to omit the colons when printing a mac address.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the no longer used aoedev_isbusy().
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the year in the copyright notices.
Signed-off-by: Ed L. Cashin <ecashin@coraid.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
What this Patch Does
Even before this recent series of 12 patches to 2.6.22-rc4, the aoe
driver was reusing a small set of skbs that were allocated once and
were only used for outbound AoE commands.
The network layer cannot be allowed to put_page on the data that is
still associated with a bio we haven't returned to the block layer,
so the aoe driver (even before the patch under discussion) is still
the owner of skbs that have been handed to the network layer for
transmission. We need to keep track of these skbs so that we can
free them, but by tracking them, we can also easily re-use them.
The new patch was a response to the behavior of certain network
drivers. We cannot reuse an skb that the network driver still has
in its transmit ring. Network drivers can defer transmit ring
cleanup and then use the state in the skb to determine how many data
segments to clean up in its transmit ring. The tg3 driver is one
driver that behaves in this way.
When the network driver defers cleanup of its transmit ring, the aoe
driver can find itself in a situation where it would like to send an
AoE command, and the AoE target is ready for more work, but the
network driver still has all of the pre-allocated skbs. In that
case, the new patch just calls alloc_skb, as you'd expect.
We don't want to get carried away, though. We try not to do
excessive allocation in the write path, so we cap the number of skbs
we dynamically allocate.
Probably calling it a "dynamic pool" is misleading. We were already
trying to use a small fixed-size set of pre-allocated skbs before
this patch, and this patch just provides a little headroom (with a
ceiling, though) to accomodate network drivers that hang onto skbs,
by allocating when needed. The d->skbpool_hd list of allocated skbs
is necessary so that we can free them later.
We didn't notice the need for this headroom until AoE targets got
fast enough.
Alternatives
If the network layer never did a put_page on the pages in the bio's
we get from the block layer, then it would be possible for us to
hand skbs to the network layer and forget about them, allowing the
network layer to free skbs itself (and thereby calling our own
skb->destructor callback function if we needed that). In that case
we could get rid of the pre-allocated skbs and also the
d->skbpool_hd, instead just calling alloc_skb every time we wanted
to transmit a packet. The slab allocator would effectively maintain
the list of skbs.
Besides a loss of CPU cache locality, the main concern with that
approach the danger that it would increase the likelihood of
deadlock when VM is trying to free pages by writing dirty data from
the page cache through the aoe driver out to persistent storage on
an AoE device. Right now we have a situation where we have
pre-allocation that corresponds to how much we use, which seems
ideal.
Of course, there's still the separate issue of receiving the packets
that tell us that a write has successfully completed on the AoE
target. When memory is low and VM is using AoE to flush dirty data
to free up pages, it would be perfect if there were a way for us to
register a fast callback that could recognize write command
completion responses. But I don't think the current problems with
the receive side of the situation are a justification for
exacerbating the problem on the transmit side.
Signed-off-by: Ed L. Cashin <ecashin@coraid.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When an AoE device is detected, the kernel is informed, and a new block device
is created. If the device is unused, the block device corresponding to remote
device that is no longer available may be removed from the system by telling
the aoe driver to "flush" its list of devices.
Without this patch, software like GPFS and LVM may attempt to read from AoE
devices that were discovered earlier but are no longer present, blocking until
the I/O attempt times out.
Signed-off-by: Ed L. Cashin <ecashin@coraid.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By returning unsigned long long, mac_addr does not generate compiler warnings
on 64-bit architectures.
Signed-off-by: Ed L. Cashin <ecashin@coraid.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A remote AoE device is something can process ATA commands and is identified by
an AoE shelf number and an AoE slot number. Such a device might have more
than one network interface, and it might be reachable by more than one local
network interface. This patch tracks the available network paths available to
each AoE device, allowing them to be used more efficiently.
Andrew Morton asked about the call to msleep_interruptible in the revalidate
function. Yes, if a signal is pending, then msleep_interruptible will not
return 0. That means we will not loop but will call aoenet_xmit with a NULL
skb, which is a noop. If the system is too low on memory or the aoe driver is
too low on frames, then the user can hit control-C to interrupt the attempt to
do a revalidate. I have added a comment to the code summarizing that.
Andrew Morton asked whether the allocation performed inside addtgt could use a
more relaxed allocation like GFP_KERNEL, but addtgt is called when the aoedev
lock has been locked with spin_lock_irqsave. It would be nice to allocate the
memory under fewer restrictions, but targets are only added when the device is
being discovered, and if the target can't be added right now, we can try again
in a minute when then next AoE config query broadcast goes out.
Andrew Morton pointed out that the "too many targets" message could be printed
for failing GFP_ATOMIC allocations. The last patch in this series makes the
messages more specific.
Signed-off-by: Ed L. Cashin <ecashin@coraid.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can just use skb_mac_header now, and we don't need a wrapper function to
perform the cast. Instead of requiring the reader to check aoe.h to look
up what an aoe_hdr function does, I'd rather do without it.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
For the places where we need a pointer to the mac header, it is still legal to
touch skb->mac.raw directly if just adding to, subtracting from or setting it
to another layer header.
This one also converts some more cases to skb_reset_mac_header() that my
regex missed as it had no spaces before nor after '=', ugh.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For consistency with other skb->mac.raw users.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch addresses the concern that the aoe driver should
not introduce unecessary conventions that must be learned by
the reader. It reverts patch 6.
Signed-off-by: "Ed L. Cashin" <ecashin@coraid.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Update aoe driver version number to 32.
Signed-off-by: "Ed L. Cashin" <ecashin@coraid.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Use simple macros to clean up the printks.
(This patch is reverted by the 14th patch to follow.)
Signed-off-by: "Ed L. Cashin" <ecashin@coraid.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add support for jumbo ethernet frames.
(This patch depends on patch 7 to follow.)
Signed-off-by: "Ed L. Cashin" <ecashin@coraid.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Avoid memory copy on writes.
(This patch depends on fixes in patch 9 to follow.)
Although skb->len should not be set when working with linear skbuffs,
the skb->tail pointer maintained by skb_put/skb_trim is not relevant
to what happens when the skb_fill_page_desc function is called. This
issue was raised without comment in linux-kernel and netdev earlier
this month:
http://thread.gmane.org/gmane.linux.kernel/446474/http://thread.gmane.org/gmane.linux.network/45444/
So until there is something analogous to skb_put that works for
zero-copy write skbuffs, we will do what the other callers of
skb_fill_page_desc are doing.
Signed-off-by: "Ed L. Cashin" <ecashin@coraid.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Update the copyright year to 2006.
Signed-off-by: "Ed L. Cashin" <ecashin@coraid.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Allow the driver to recognize AoE devices that have changed size.
Devices not in use are updated automatically, and devices that are in
use are updated at user request.
Signed-off-by: "Ed L. Cashin" <ecashin@coraid.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Change the number of supported AoE slot addresses per AoE shelf
address to 16.
Signed-off-by: Ed L. Cashin <ecashin@coraid.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>