Monday, January 5, 2015

Linux 2.6 IO Schedulers

2.6 kernel have 4 IO schedulers to choose from:

Deadline IO Scheduler - In addition to the standard sorted IO queue sorted by block number, the scheduler also maintains 2 additional queue - read queue and write queue.  A new request will insert into the standard queue and to the end of the read or write queue.  Each read or write queue item has a expiry time set and when it goes off, the schedule will schedule the item at the top of the queue (which is the oldest as the insertion is by submission time).  In other word, the schedule imposes a soft limit on the service time for each IO and at the same time minimizes seek using the sorted queue for most of time.

Anticipatory IO Scheduler - it's a common program behaviour to issue successive read calls (note that write is not relevant as write is not synchronized as read is).  Therefore, after the scheduler serviced a read from the read queue and goes back to the sorted queue, another read may come by in a short time.  The result is a constant shifting of disk arm to services between the sorted request and the periodic read requests.  As a result, IO throughput is constantly throttled by this behaviour.  The anticipatory scheduler starts off operating like the deadline scheduler.  However, it will wait up to 6ms after a read IO to see if there is another one coming in.  If yes, it will service the next read.  If no, it will return to the deadline scheudle routine and continue.

CFQ Scheduler - a queue is set up for each process in the system.  The schedule process each queue in turn for a timeslice.  When the timeslice end, the schedule moves on to the next process and thus make it "fair" for all processes in the system. If a queue is clear and the timeslice has not ended, the schedule will wait for 10ms to anticipate another read coming in.  The CFQ scheduler also favour read request over write to avoid the starvation problem.  The CFQ scheduler is a good choice for most of the workloads.

Noop IO Scheduler - no sorting is performed and just merging only.  This is for device that does not require sorting of request (such as SSD that has no seeking penalty).

The default scheduler is chosen using the boot option iosched.  The scheduler can also be selected at runtime for each block device by modifying the file /sys/block/"block device name e.g. hda"/queue/scheduler

No comments: