Saturday, November 22, 2014

Linux epoll

Both POSIX poll and select accepts and walks the list of fd passed over and if the list is long, this affect efficiency and performance. Linex epoll() is created to address this by separate the registration from the monitoring using different calls.

epoll works by firstly creating a polling context using epoll_create1() call.    The call returns a handle which must be freed via close() call.

fd to be monitored are then add to the context using epoll_ctl().  The events to be monitored are also specified.

Program then goes into wait for the events via the epoll_wait() call.

Scatter/Gather IO

One single system call to read or write data into a string of non-contiguous buffers.  Its advantage is performance due to reduction of overhead comparing to a serial of individual read or write system calls.  In addition, the call is atomic and it will not interleave with calls from other threads.

readv() and writev() accept 3 parameters - a fd, an array of iovec structure and a count of the number of buffers.

struct iovec {
    void *iov_base  /* ptr to the start of buffer */
    size_t iov_len
}

The buffers are processed sequentially.

Linux Process Priority

Each process has a static priority in the task_struct which is its nice value (between -20 to 19 which lower number signifies higher priority).  It is static because it does not change throughout the lifetime of the process.  Schedule pick the next task to run based on the dynamic priority.

The dynamic priority is a function of the static priority and also the interactivity of processes.  Dynamic priority uses effective_prio() which add a bonus or penalty value (in range of -5 to +5) based on the interactivity of the task.  This bonus value is added to the static priority to form the dynamic priority.

Kernel tracks how active (interactive) a process via the sleep_avg field in the task_struct.  When a task is created, it receives a high sleep_avg value.  The time a task spent sleeping will be add to sleep_avg (up to MAX_SLEEP_AVG = 10 msec).  Each time a task runs, the corresponding time will be subtract from sleep_avg.  A task with high sleep_avg is I/O bound and a zero sleep_avg value means the task is processor-bound.

The size of timeslice is proportional to the task's priority.  Higher priority task receive longer timeslice.  When a task spawn a child, the child share the timeslice with the parent.  This is to prevent task keep spawning child to get unlimited supply of processor time.

Double Buffering in Standard IO

Standard IO improved performance by maintaining buffers in user space thus avoid making successive system calls.  On the other hand, the down side is that data is required to transfer between the standard IO buffer to the user supplied buffer.

Call like getc triggers a read() call which causes data to be read from the disk to the kernel buffer, copy to the standard IO buffer and finally copy to the buffer passed into the standard IO call.

putc copies the data from the supplied buffer to the standard IO buffer.  The data will then be copied out to kernel buffer via write().

Standard IO File Locking

Standard IO is thread-safe.  When there is concurrent calls made to a stream, the call will be serially applied to the stream.  If an application needs to ensure several calls are performed together without interleaved with calls from other program, it need to lock the file using flockfile().  After the series of calls are issue, unlock the file using funlockfile().

Controlling Buffering in Standard IO

There are 3 types of buffering

Unbuffer - no user buffering is performed.  Data is written to kernel.  This option is rarely used except for stderr

Line-buffered - the buffer is submit to kernel upon every newline character in the stream.  This is default for stdout (screen).

Block-buffered (Full buffering) - data is buffered in form of block.  This is the default for file.

The type of buffering is specified via the setvbuf() call.  The 3 buffering types are specified as _IONBF, _IOLBF and _IOFBF respectively.  The call must be issued after opening a stream and before performing the first IO.  The caller must also supply the buffer to be managed by Standard IO

Typically application seldom need to manipulate the buffering mode or default buffer size.

C Standard IO Library

stdio provides a platform independent user buffering solution.  Files are referred by file pointer instead of the system level fd.  The file point type FILE is in cap because stdio as originally written as MACRO and thus follow the convention.

  • fopen - open a file and return a pointer to FILE.  The file opened is called a stream.
  • fdopen - open a file using fd
  • fclose - close file
  • fcloseall - close all streams
  • fgetc - read a char from a stream
  • fgets - reading multiple characters until a NEWLINE char or EOF is reached.  The \n will be stored as part of line read.  A NULL char will be added to the end of the string
  • fungetc - put a character (casted as unsigned int) into the stream.  You can issue more than one calls and the char are pushed back like a stack - so the next fread will return the last pushed char.  The standard defines only one push back is allowed.  Linux allows multiple push back as long as memory is available.  If fungetc is followed by a seek, the pushed back char will be lost because the buffer will be reused by a new block.
  • fread - reading binary data (in form of a structure, record) from the stream.  Caller pass the size of the structure and the number of structure to be read.  The function returns the number of structure read.  If the number is less than specified, it could be due to EOF reached or error is encountered during read.  Use ferror() or feof() to identify the condition.  Note that the program needs to assume the file is created (could be by another problem in another system) with same variable sizes, alignment, padding abd byte order.
  • fputc - write a byte
  • fputs - write a string
  • fwrite - writ binary data
  • fseek - advance the position in a stream.  Uses whence parameter to indicate if the offset provided represents the absolute position, relative to the current file position or to the end of file
  • fsetpos - like fseek with whence = SEEK_SET (absolute position).  This API is provided for non-Linux system that uses complex type to store file position.
  • rewind - set the position to the start of the stream
  • fgetpos - fseek does not return the current position like lseek.  This API acheive the requirement.
  • fflush - write out standard io buffer to the kernel buffer.
  • fileno - return the file descriptor of a stream.   Caution - do not intermix standard io with system IO call



Linux Run Queue

There is one runqueue per processor.  In a run queue, there are 2 priority arrays - one active and one expired.  the array contains a bitmap with each bit reflect if there is any task in the corresponding priority list head.  Task with its timeslice used up will be transferred to the expired array.  Scheduler will schedule task in the active array.  When all task exhausted its timeslice, scheduler will switch the active and expire array.  When a task is transfer to the expire queue, its timeslice is also calculated.  This approach avoids a long delay during priority queue switching time.

Friday, November 7, 2014

Linux Process Termination

exit(EXIT_SUCCESS or EXIT_FAILURE) terminates a process.  Before that, C lib will perform the following shutdown steps:

- call functions that registered with atexit() or on_exit() calles in reverse order of registration
- flush io streams
- remove temporary files created by the tmpfile() calls

atexit() is a POSIX call that maintain a list of clean up function registered by caller  for process normal termrination.  The clean up functions accepts no parameters and do not return any return code.  If the process is terminated via signal (e.g. kill), the clean up functions will not be called.  The list is also not inherited by child process.

on_exit() is a function equivalent to atexit() in the older version of SUNOS.  Latter version or SUNOS uses atexit().

exit() will finally call the system call _exit() that cleans up kernel resources allocated for the process.  In the kernel level, do_exit() performs the following work:

- set the PF_EXITING flag in task_struct
- remove kernel timer set using del_timer_sync()
- if BSD process accounting is enabled, calls acct_process() to write out the accounting info.
- release and remove resources, including
- __exit_mm() to release mm_struct
- sem_exit() to release IPC semaphore
- __exit_files()/fs()/namespaces()/sighand()to release and remove resources
   - set the exit code
- exit_notify() to send SIGCHLD to parent, reparent the task's child to their threadgroup or to init process.  Also set the task state to TASK_ZOMBIE
- call schedule()

What's left at this point of time is the task structure and the kernel stack and the thread_info in it.

Once the parent or init calls wait(), the above is released via the put_task_struct() function.  The user's process count is decremented.  The task struct is unlinked from the task list.

Sunday, November 2, 2014

Virtual File System (or Switch)

VFS is an abstraction by providing a common file model.  Using function pointers, VFS framework provides hooks to support reading, creating link, synchronization etc.  Each file system supplies functions to handle the relevant operations.

VFS talks in term of inode, superblock and directory entries.  A filesystem not align to UNIX will have to provide equivalent abstraction support.

poll() and select() Comparison

(1) poll does not require the user to calculate and pass in the highest fd + 1 as one of the parameter.
(2) select() need to manage the fds bitmask which has a length dependent on the fd number (e.g. 1000) and the bit mask may be sparse if only a few fd to watch.  so poll is more efficient as the list of fd depend on the number of fd to watch and not on fd number
(3) select() modify the fdslist upon return.  poll() use a separate event lists
(4) select() is more portable and some systems do not support poll()
(5) select() timeout has higher resolution (microsecond vs millisecond) than poll()



Portable Sleep using select()

Put NULL to the fds lists and set the timeval parameter - this allows the process to sleep for a specific time period.  As select() is implemented in most systems, it become a portable way for sleep.

Multiplexed IO

Programs typically services a few file descriptors concurrently (e.g. terminal, file, pipe to communicate with other programs).  Without threading, each process can only wait on 1 fd at a time.  One solution is to use non-blocking IO but this is not efficient.

Multiplexed IO allows the process to block on multiple fd.  The process will sleep and be notified when any one of them could be read or write without blocking.

select()
Introduced by 4.2BSD, it uses 3 fd lists - readfds, writefds and exceptfds.  fd in the readfds list are those that are waiting for data to read without blocking. Likewise for writefds.  Exceptfds is to monitor for exception of out-of-band data (applicable to socket only) is available.  Upon return, each set is modified to contain only the fd that is ready for the specific IO type.  The fds list is really a bit mask.  Thus select() has a parameter to indicate the highest fd in the list (so it can calculate how long the bit mask should be).

select() can pass in a timeout structure which specifies the time for select to return even if there is no ready fd by then.  Setting the time zero causes select() to return immediately with any fd that are ready.

fd lists are build using macro:

fd_set writefds
FD_ZERO(&writefds) - remove all fds
FD_SET(fd, &writefds) - add a fd, e.g. FD_SET(STDIN_FILENO, &readfds)
FD_CLR(fd, &writefds) - remove a fd
FD_ISSET(fd, &writefds) - test if a fd is set, e.g. if (FD_ISSET(STDIN_FILENO, &readfds)) {...}

POSIX defines its won equivalent to select(), called pselect().  The call signature is slightly different but it uses the same macro to build and test the fd lists.

Difference between pselect() and select()
(1) pselect() uses a different timeout structure - timespec.  Timespec uses second and nanosecond while timeval (used in select) uses second and microsec.  However, neither call is accurate even on microsecond level.

(2) When select returns, the value in timeval is undefined.  So it must be reinitialized before the next select call.  pselect does not modify timespec and so no need to reinitialize in successive calls.

(3) pselect has an additional parameter called sigmask, which uses to address a race condition between waiting on the file descriptor and signal.  for example, a process checks for a flag to be set by the signal handle before issue the select() call.  If the signal arrives (which wakes up the handler to se the flag) after the check and before the select() call, the process may be blocked indefinitely and never response to the set flag.  segmask resolve this issue by providing a list of signal for pselect to block.  When the signal is block, it will not be handled by the handler until it is unblocked at end of the pselect().  In other words, the signmask serialize the pselect() and the signal handler.

poll()
System V introduced the poll() which addresses some deficiencies of select().  Instead of 3 fd list, pool uses an array of pollfd structure.

struct pollfd {
    int fd; /* fd */
    short events; /* events to watch */
    short revents; /* return the events observed */
};

Events to watch are
POLLIN - data to read
POLLOUT - write will not block

POLLRDNORM - normal data to read = POLLIN
POLLRDBAND - priority data to read (socket?)
POLLPRI - urgent data to read

POLLWRNORM - writing normal data will not block = POLLOUT
POLLWRBAND
POLLWR

POLLMSG - a SIGPOLL message is available

Return events include the above and the following too:
POLLER - fd has encountered error
POLLHUP - fd has hang up
POLLINVAL - fd is invalid

In comparison,

POLLIN | POLLPRI = select() read event
POLLOUT | POLLWRBAND = select() write event


poll() uses a timeout parameter up to milli-second precision.  A zero value makes poll() return immediately.  A negative value makes poll() to wait indefinitely until an event is observered.  An example to use poll

struct pollfd fds[2];
fds[0].fd = STDIN_FILENO;
fds[1].fd = STDOUT_FILENO;

fds[0].events = POLLIN;
fds[1].events = POLLOUT;

ret = poll(fds, 2, 5*1000);
if (fds[0].revents & POLLIN) {...};

Linux also offer ppoll() that is similar to pselect() with a timespec and sigmask. ppoll is not POSIX but a Linux specific call.

Truncating Files

ftruncate() and truncate() trim the file to the length specified.  ftruncate() works on file descriptor and truncate() works on a pathname.

If the length given is larger than the current file size, the file is extended in a fashion similar to lseek+write.  The extended length will be filled with zeros.

Combine READ/WRITE with SEEK

pread() and pwrite() is like the normal read/write call with addition parameter on position.  The p calls ignore the current file position and perform IO at the position passed.  The call also does not update the current file position.  Thus, mixing read/write with pread/pwrite may cause data corruption.

The advantage p call is that it elimnates the race condition in a mutithread environment that shared the file table.  lseek and the following read/write call is not atomic.  Therefore, after a lseek and before the next read/write call made by one thread, the current file position can be altered by another thread.  Using pread/pwrite avoids this racy situation.

Linux lseek()

This system API advances the file position pointer for the next I/O.  Position can be specified as follow:

SEEK_CUR = current position + offset
SEEK_END = end of file + offset
SEEK_SET = beginning of file + offset

Note that offset can be 0, positive or negative.

When seeking beyond the end of file, READ will return EOF.  Write at that position will create a hole between the previous EOF point to the new data.  The hole will be filled with zeros but they will not occupy any space on disk.  The file becomes a sparse file.

The max file poisiton that lseek can go is defined as off_t type.  This is typically implemented as C long type (i.e. word size = size of the general purpose register).  In kernel, the position is kept as long long.

File Closing in Linux

In Linux, data will not automatically flushed to disk. When the last user of the file closes the file, the inode will be released.  If the file has marked for deletion, the last close will cause the file to be finally removed from disk.

Linux Direct IO

The O_DIRECT flag to open() makes IO bypasses the system file cache.  Data is transfered directly from the user space buffer to the device.  All IO is synchronous.