Deadline Scheduler

Deadline Scheduler is an I/O scheduler, or disk scheduler, for the Linux kernel. It was written in 2002 by Jens Axboe.

Overview
The main purpose of Deadline Scheduler is to guarantee a start service time for a request. The scheduler imposes a deadline on all I/O operations to prevent a starvation of requests. It maintains two deadline queues, as well as sorted queues (both read and write). Deadline queues are sorted by their expiration time, while the sorted queues are sorted by the sector number.

Before serving the next request, Deadline Scheduler decides which queue to use. Read queues are given a higher priority, because processes usually block on read operations. Next, Deadline Scheduler checks if the first request in the deadline queue has expired. Otherwise, the scheduler serves a batch of requests from the sorted queue. In both cases, the scheduler also serves a batch of requests following the chosen request in the sorted queue.

By default, read requests have an expiration time of 500 ms, and write requests expire in 5 seconds.

A rough version of the scheduler was published on the Linux Kernel Mailing List by Axboe in January 2002.

Measurements have shown that the deadline I/O scheduler outperforms the CFQ I/O scheduler for certain multi-threaded workloads.

fifo_batch (integer)
Deadline executes I/O Operations (IOPs) through the concept of "batches" which are sets of operations ordered in terms of increasing sector number. This tunable determines how big a batch will have to be before the requests are queued to the disk (barring expiration of a currently-being-built batch). Smaller batches can reduce latency by ensuring new requests are executed sooner (rather than possibly waiting for more requests to come in), but may degrade overall throughput by increasing the overall movement of drive heads (since sequencing happens within a batch and not between them). Additionally, if the number of IOPs is high enough the batches will be executed in a timely fashion anyway.

read_expire (integer)
The ‘read_expire’ time is the maximum time in milliseconds after which the read request is considered ‘expired’. The read request is best used before the expiration date. Deadline Scheduler will not attempt to make sure all I/O is issued before its expiration date. However, if the I/O is past expiration, then it gets prioritized.

The read expiration queue is only checked when Deadline Scheduler re-evaluates read queues. For read requests, this means that a sorted read request is dispatched (except for the case of streaming I/O). While the scheduler is streaming I/O from the read queue, the read expired is not evaluated. If there are expired reads, then the first one is pulled from the FIFO. Note that this expired read then is the new nexus for read sort ordering. The cached next pointer will be set to point to the next I/O from the sort queue after this expired one…. The thing to note is that the algorithm doesn’t just execute all expired I/O once they are past their expiration date. This allows some reasonable performance to be maintained by batching up ‘write_starved’ sorted reads together before checking the expired read queue again.

The maximum number of I/O that can be performed between read expired I/O is 2 * 'fifo_batch' * 'writes_starved'. One set of ‘fifo_batch’ streaming reads after the first expired read I/O and if this stream happened to cause the write starved condition, then possibly another ‘fifo_batch’ streaming writes. This is worse case, after which the read expired queue would be re-evaluated. At best, the expired read queue will be evaluated ‘write_starved’ times in a row before being skipped because the write queue would be used.

write_expire (integer)
The 'write_expire' has the same function as read_expire, but instead is used for write operations (grouped into separate batches from read requests).

writes_starved (integer)
Deadline Scheduler prioritizes read requests to write requests, so this can lead to situations where the operations executed are almost entirely read requests. This becomes more of an important tunable as write_expire is elongated or overall bandwidth approaches saturation. Decreasing this gives more bandwidth to writes (relatively speaking) at the expense of read operations. If the application workload, however, is read-heavy (for example most HTTP or directory servers) with only an occasional write, decreased latency of average IOPs may be achieved by increasing this (so that more reads must be performed before a write batch is queued to disk).

front_merges (bool integer)
A "front merge" is an operation where the I/O Scheduler, seeking to condense (or "merge") smaller requests into fewer (larger) operations, will take a new operation then examine the active batch and attempt to locate operations where the beginning sector is the same or immediately after another operation's beginning sector. A "back merge" is the opposite, where ending sectors in the active batch are searched for sectors that are either the same or immediately after the current operation's beginning sectors. Merging diverts operations from the current batch to the active one, decreasing "fairness" in order to increase throughput.

Due to the way files are typically laid out, back merges are much more common than front merges. For some workloads, time may be wasted while attempting to front merge requests. Setting 'front_merges' to 0 disables this functionality. Front merges may still occur due to the cached 'last_merge' hint, but since that comes at basically zero cost, it is still performed. This boolean simply disables front sector lookup when the I/O scheduler merging function is called. Disk merge totals are recorded per-block device in '/proc/diskstats'.

Other I/O schedulers

 * CFQ scheduler
 * Anticipatory scheduler
 * Noop scheduler