194 lines
8.6 KiB
Plaintext
194 lines
8.6 KiB
Plaintext
CFQ (Complete Fairness Queueing)
|
|
===============================
|
|
|
|
The main aim of CFQ scheduler is to provide a fair allocation of the disk
|
|
I/O bandwidth for all the processes which requests an I/O operation.
|
|
|
|
CFQ maintains the per process queue for the processes which request I/O
|
|
operation(syncronous requests). In case of asynchronous requests, all the
|
|
requests from all the processes are batched together according to their
|
|
process's I/O priority.
|
|
|
|
CFQ ioscheduler tunables
|
|
========================
|
|
|
|
slice_idle
|
|
----------
|
|
This specifies how long CFQ should idle for next request on certain cfq queues
|
|
(for sequential workloads) and service trees (for random workloads) before
|
|
queue is expired and CFQ selects next queue to dispatch from.
|
|
|
|
By default slice_idle is a non-zero value. That means by default we idle on
|
|
queues/service trees. This can be very helpful on highly seeky media like
|
|
single spindle SATA/SAS disks where we can cut down on overall number of
|
|
seeks and see improved throughput.
|
|
|
|
Setting slice_idle to 0 will remove all the idling on queues/service tree
|
|
level and one should see an overall improved throughput on faster storage
|
|
devices like multiple SATA/SAS disks in hardware RAID configuration. The down
|
|
side is that isolation provided from WRITES also goes down and notion of
|
|
IO priority becomes weaker.
|
|
|
|
So depending on storage and workload, it might be useful to set slice_idle=0.
|
|
In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
|
|
keeping slice_idle enabled should be useful. For any configurations where
|
|
there are multiple spindles behind single LUN (Host based hardware RAID
|
|
controller or for storage arrays), setting slice_idle=0 might end up in better
|
|
throughput and acceptable latencies.
|
|
|
|
back_seek_max
|
|
-------------
|
|
This specifies, given in Kbytes, the maximum "distance" for backward seeking.
|
|
The distance is the amount of space from the current head location to the
|
|
sectors that are backward in terms of distance.
|
|
|
|
This parameter allows the scheduler to anticipate requests in the "backward"
|
|
direction and consider them as being the "next" if they are within this
|
|
distance from the current head location.
|
|
|
|
back_seek_penalty
|
|
-----------------
|
|
This parameter is used to compute the cost of backward seeking. If the
|
|
backward distance of request is just 1/back_seek_penalty from a "front"
|
|
request, then the seeking cost of two requests is considered equivalent.
|
|
|
|
So scheduler will not bias toward one or the other request (otherwise scheduler
|
|
will bias toward front request). Default value of back_seek_penalty is 2.
|
|
|
|
fifo_expire_async
|
|
-----------------
|
|
This parameter is used to set the timeout of asynchronous requests. Default
|
|
value of this is 248ms.
|
|
|
|
fifo_expire_sync
|
|
----------------
|
|
This parameter is used to set the timeout of synchronous requests. Default
|
|
value of this is 124ms. In case to favor synchronous requests over asynchronous
|
|
one, this value should be decreased relative to fifo_expire_async.
|
|
|
|
slice_async
|
|
-----------
|
|
This parameter is same as of slice_sync but for asynchronous queue. The
|
|
default value is 40ms.
|
|
|
|
slice_async_rq
|
|
--------------
|
|
This parameter is used to limit the dispatching of asynchronous request to
|
|
device request queue in queue's slice time. The maximum number of request that
|
|
are allowed to be dispatched also depends upon the io priority. Default value
|
|
for this is 2.
|
|
|
|
slice_sync
|
|
----------
|
|
When a queue is selected for execution, the queues IO requests are only
|
|
executed for a certain amount of time(time_slice) before switching to another
|
|
queue. This parameter is used to calculate the time slice of synchronous
|
|
queue.
|
|
|
|
time_slice is computed using the below equation:-
|
|
time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To increase the
|
|
time_slice of synchronous queue, increase the value of slice_sync. Default
|
|
value is 100ms.
|
|
|
|
quantum
|
|
-------
|
|
This specifies the number of request dispatched to the device queue. In a
|
|
queue's time slice, a request will not be dispatched if the number of request
|
|
in the device exceeds this parameter. This parameter is used for synchronous
|
|
request.
|
|
|
|
In case of storage with several disk, this setting can limit the parallel
|
|
processing of request. Therefore, increasing the value can imporve the
|
|
performace although this can cause the latency of some I/O to increase due
|
|
to more number of requests.
|
|
|
|
CFQ IOPS Mode for group scheduling
|
|
===================================
|
|
Basic CFQ design is to provide priority based time slices. Higher priority
|
|
process gets bigger time slice and lower priority process gets smaller time
|
|
slice. Measuring time becomes harder if storage is fast and supports NCQ and
|
|
it would be better to dispatch multiple requests from multiple cfq queues in
|
|
request queue at a time. In such scenario, it is not possible to measure time
|
|
consumed by single queue accurately.
|
|
|
|
What is possible though is to measure number of requests dispatched from a
|
|
single queue and also allow dispatch from multiple cfq queue at the same time.
|
|
This effectively becomes the fairness in terms of IOPS (IO operations per
|
|
second).
|
|
|
|
If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
|
|
to IOPS mode and starts providing fairness in terms of number of requests
|
|
dispatched. Note that this mode switching takes effect only for group
|
|
scheduling. For non-cgroup users nothing should change.
|
|
|
|
CFQ IO scheduler Idling Theory
|
|
===============================
|
|
Idling on a queue is primarily about waiting for the next request to come
|
|
on same queue after completion of a request. In this process CFQ will not
|
|
dispatch requests from other cfq queues even if requests are pending there.
|
|
|
|
The rationale behind idling is that it can cut down on number of seeks
|
|
on rotational media. For example, if a process is doing dependent
|
|
sequential reads (next read will come on only after completion of previous
|
|
one), then not dispatching request from other queue should help as we
|
|
did not move the disk head and kept on dispatching sequential IO from
|
|
one queue.
|
|
|
|
CFQ has following service trees and various queues are put on these trees.
|
|
|
|
sync-idle sync-noidle async
|
|
|
|
All cfq queues doing synchronous sequential IO go on to sync-idle tree.
|
|
On this tree we idle on each queue individually.
|
|
|
|
All synchronous non-sequential queues go on sync-noidle tree. Also any
|
|
request which are marked with REQ_NOIDLE go on this service tree. On this
|
|
tree we do not idle on individual queues instead idle on the whole group
|
|
of queues or the tree. So if there are 4 queues waiting for IO to dispatch
|
|
we will idle only once last queue has dispatched the IO and there is
|
|
no more IO on this service tree.
|
|
|
|
All async writes go on async service tree. There is no idling on async
|
|
queues.
|
|
|
|
CFQ has some optimizations for SSDs and if it detects a non-rotational
|
|
media which can support higher queue depth (multiple requests at in
|
|
flight at a time), then it cuts down on idling of individual queues and
|
|
all the queues move to sync-noidle tree and only tree idle remains. This
|
|
tree idling provides isolation with buffered write queues on async tree.
|
|
|
|
FAQ
|
|
===
|
|
Q1. Why to idle at all on queues marked with REQ_NOIDLE.
|
|
|
|
A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
|
|
with REQ_NOIDLE. This helps in providing isolation with all the sync-idle
|
|
queues. Otherwise in presence of many sequential readers, other
|
|
synchronous IO might not get fair share of disk.
|
|
|
|
For example, if there are 10 sequential readers doing IO and they get
|
|
100ms each. If a REQ_NOIDLE request comes in, it will be scheduled
|
|
roughly after 1 second. If after completion of REQ_NOIDLE request we
|
|
do not idle, and after a couple of milli seconds a another REQ_NOIDLE
|
|
request comes in, again it will be scheduled after 1second. Repeat it
|
|
and notice how a workload can lose its disk share and suffer due to
|
|
multiple sequential readers.
|
|
|
|
fsync can generate dependent IO where bunch of data is written in the
|
|
context of fsync, and later some journaling data is written. Journaling
|
|
data comes in only after fsync has finished its IO (atleast for ext4
|
|
that seemed to be the case). Now if one decides not to idle on fsync
|
|
thread due to REQ_NOIDLE, then next journaling write will not get
|
|
scheduled for another second. A process doing small fsync, will suffer
|
|
badly in presence of multiple sequential readers.
|
|
|
|
Hence doing tree idling on threads using REQ_NOIDLE flag on requests
|
|
provides isolation from multiple sequential readers and at the same
|
|
time we do not idle on individual threads.
|
|
|
|
Q2. When to specify REQ_NOIDLE
|
|
A2. I would think whenever one is doing synchronous write and not expecting
|
|
more writes to be dispatched from same context soon, should be able
|
|
to specify REQ_NOIDLE on writes and that probably should work well for
|
|
most of the cases.
|