Saturday, April 6, 2013

Read and Write System Calls

read() can return a few possible scenarios:

(1) call returns a value equal to len and the data is stored in buf
(2) call returns a value less than len by >0.  This happens because the read is interrupted by a signal midway, or an error has occurred during read, or there is less data avaliable than len bytes, or EOF is reached.  Issue read() again with the remaining len value can complete the call or detect the cause of the problem.
(3) call returns 0 (or EOF)
(4) call is blocked because there is no data available for read
(5) call returns -1 and errno equals to EINTR means the call is disrupted before any byte is read.   If the errno equal to EAGAIN, there is no data to read and the read() call is operated in non-blocking mode.  Issue the call again.
(6) call returns -1 with other errno values indicated a more severe problem has happen

write() is less likely to return a partial write than a read().  For regular files, write() is guaranteed to perform the entire requested write unless an error occurs.  For other type (e.g. socket), partial write may be possible and it can be re-issue when the write is incomplete.

Using O_APPEND mode ensure file corrupted by 2 racing processing competing for write.  If the file is not open using O_APPEND, the write will occurs at the filepos for each processes.  O_APPEND ensure the write always occurs at the end of the file.  This mode is useful for log files but less sensible for other type.
EPIPE indicates that the reading end of a pipe has closed.  The process will also receive a SIGPIPE signal, with default action to terminate the process.  The process intends to handle this errno must ignore, block or handle this signal.

When a write() returns, the kernel has copied the data from the supplied buffer into a kernel buffer.  There is no guarantee that the data will be sent to the disk.  The kernel will batch the dirty buffer and write to disk later.  This delayed write behaviour also means the write order is not preserved.  Another problem is that write error may not be reported immediately as the actual write occurs later and asynchronously with the actual system call.  To mitigate the risk of deferred write, kernel institute a maximum buffer age to write out all dirty pages when it is reached.  This is configured via /proc/sys/vm/dirty_expire_centiseconds.

fsybc() ensures all dirtt data associated with a file (mapped by fd) is written to disk.  The call writes back both data and metadata (e.g. creation timestam and other attributes in an inode).  It will returns when the disk acknowledged the data externalization has completed.

fdatasync() writes data only.  Neither call guarabtees that any updated directory entries containing the file are synchronously to disk.  To ensure this, fsync() must be called against the fd representing the directory itself.

sync() wrties out all buffers to disk.  Both data and metadata are written out.  Sync returns before all buffers are written out.  It just initiates the action.  So processes may invoke the call multiple time to ensure all buffer is committed to disk.  For Linus, sync() returns after all buffers are written out.  sync() may take some time in a busy system.

No comments: