Friday, January 23, 2015

MLGPO

Multiple Local GPO has 3 layers:

Layer 1 computer node policy will be applied to all computer.  Layer 1 user node could be overridden by Layer 2

Application of Layer 2 policy depends on if the user is an administrator or normal user.

Application of Layer 3 policy depends on specific user.

The resultant policy is the sum of all three layers except those lower layer settings that are overridden by the upper layer.

Group Policy Nodes

Group policy has 2 branches or nodes - computer and user.  The computer node contains settings that are applied to computer (e.g. start up/shut down script, firewall etc).  The user node contains settings that are applied to users (e.g. profile, logon script etc)

Monday, January 5, 2015

Uses of spin locks

Shared memory typically protected by spinlock because the wait period is typically small compare to wait that result in context switching.

Linux fork()

Pending signals are cleared and files locks are released for the child process.  In older version of Linux, the page table is duplicated and the pages are copied page by page from the parent address space to the child's.
Nowadays, the copying is replaced by using copy-on-write feature which shorten the process creation time and minimize wastage of copying pages that are not required by the child.

vfork() was introduced in 3.0 BSD release.  vfork() stipulates that the caller must immediately call exec() or _exit() after vfork().  vfork() will also suspend the parent and so the child can share the parent memory without incurring the memory copy.  Consequently the child must not modify any memory.

Linux exec()

A family of exec calls build upon a single system call - execl().  The call accepts a string containing the path of the executable and 1 or more optional arguments to be passed to the program.  There is at least 1 argument which is the name of the program.  The execl() is varidic which means it accepts a variable number of arguments, to be terminated by a NULL char.

e.g. ret = execl("/bin/vi", "vi", "abc.txt", NULL);

In general, execl does not return as the program image will be replaced by a new one specified in the call.  A successful execl call also have the following effects:

- pending signals are lost (s original program is not there to handle it)
- the signal handler setting is reverted back to default
- memory locks are all freed, mapped file is freed
- thread attribute are returned to default
- process stat is reset
- atexit is reset

Some attributes will be retained across exec
- pid
- process priority
- owning user and group
- opened files which means the new program can access all files of its parent if it know the fd number.  More commonly, files are closed before the exec call.

Other exec calls are built upon execl():
- execlp - only specifies the file name and uses the PATH to resolve the full path name
- execle - also passes over the environment variables
- execv - uses vector (array) for argument passing
- execvp - uses file name and vector for argument passing
- execve - uses file name, vector and environment variables

e.g. const char *args[] = {"vi", "/abc.txt", NULL};
ret = execvp("vi",args);

Use of path has security risk: if the attacker manipulates the PATH variable, it could trick the application to execute a rogue replacement of a program resided in the head of the PATH chain.

Process Group

Each process is owned by a user and a user group defined in /etc/passwd ad /etc/group respectively.  Each process also belongs to one process group.  The child process will belong to the same process group as the parent.  All commands in a pipeline belongs to the same process group.

The process group is just a construct to make it easier to send signal or get information from a group of related processes.  A process group is related to a job from user's perspective.

Linux 2.6 IO Schedulers

2.6 kernel have 4 IO schedulers to choose from:

Deadline IO Scheduler - In addition to the standard sorted IO queue sorted by block number, the scheduler also maintains 2 additional queue - read queue and write queue.  A new request will insert into the standard queue and to the end of the read or write queue.  Each read or write queue item has a expiry time set and when it goes off, the schedule will schedule the item at the top of the queue (which is the oldest as the insertion is by submission time).  In other word, the schedule imposes a soft limit on the service time for each IO and at the same time minimizes seek using the sorted queue for most of time.

Anticipatory IO Scheduler - it's a common program behaviour to issue successive read calls (note that write is not relevant as write is not synchronized as read is).  Therefore, after the scheduler serviced a read from the read queue and goes back to the sorted queue, another read may come by in a short time.  The result is a constant shifting of disk arm to services between the sorted request and the periodic read requests.  As a result, IO throughput is constantly throttled by this behaviour.  The anticipatory scheduler starts off operating like the deadline scheduler.  However, it will wait up to 6ms after a read IO to see if there is another one coming in.  If yes, it will service the next read.  If no, it will return to the deadline scheudle routine and continue.

CFQ Scheduler - a queue is set up for each process in the system.  The schedule process each queue in turn for a timeslice.  When the timeslice end, the schedule moves on to the next process and thus make it "fair" for all processes in the system. If a queue is clear and the timeslice has not ended, the schedule will wait for 10ms to anticipate another read coming in.  The CFQ scheduler also favour read request over write to avoid the starvation problem.  The CFQ scheduler is a good choice for most of the workloads.

Noop IO Scheduler - no sorting is performed and just merging only.  This is for device that does not require sorting of request (such as SSD that has no seeking penalty).

The default scheduler is chosen using the boot option iosched.  The scheduler can also be selected at runtime for each block device by modifying the file /sys/block/"block device name e.g. hda"/queue/scheduler

Read Starvation

As IO schedule insert requests in ascending sequence as they come in, read to latter block may be kept waiting in the queue for a long time.  The problem can be further deteriorated by writes which batch up several requests and flushes to the disk periodically.  This is called writes-starving-read problem.

IO scheduler is further optimized to address this issue.  In Linux 2.4 kernel, the Linux Elevator scheduler will stop inserting new requests when the list have sufficient number of old request to process.  The algorithm is simple and it could help in some situations.

Synchronous and Synchronized Operations

Synchronous means the IO operation does not return control to the caller.  Synchronized operation means the operation ensures data are up to date (on disk).

When synchronous write is synchronized, the write will not return until data is flushed to disk.  This is equivalent of O_SYNC flag effect when opening file.

When synchronous write is not synchronized, the write will return when the data is copied to the kernel buffer.  So the disk copy is different from the memory copy.  This is the default effect of Linux write operation.

When asynchronous write is synchronized, the control return to caller as soon as the request is queued.  When the write is later executed, the data is guaranteed to be written to disk.

When asynchronous write is not synchronized, the control is returned immediately after the request has been queued.  The data is guaranteed to store in kernel buffer only and so the disk copy could be different from the memory copy.

Read is always synchronized (i.e. the data return is always up to date).  Synchronous or asynchronous read call determine when control is returned.

IO Scheduler

Modern processor speed is much higher than disk access.  A disk seek can take 8 msec and is equivalent to 25M processor cycle.  Therefore, the function of IO scheduler is to optimize seek optimization.

There are 2 basic tasks of IO schedule - sorting and merging.  Sorting is to arrange pending IO by block order to minimize head movement.  Merging is to coalesce several IO requests into one, thus reducing the number of requests.  For example, one IO to read block 6 and one to read block 7 can be merged into one IO to read block 6 to 7.

Linux File Advisory

Similar to madvise, the posix_fadvise() call hints the kernel to optimize file access. The options available are similar to madvise - POSIX_FADV_NORMAL/RANDOM/SEQUENTIAL/WILLNEED.

POSIX_FADV_NOREUSE indicates that the data are only used once.

POSIX_FADV_DONTNEED evicts pages in the range from the file cache.

In general, file advise can benefit application performance.  Before reading, application an give FADV_WILLNEED and kernel will read the data block in asynchronously.  When the application reach the point to read the data, blocking can be avoided.  The FADV_DONTNEED hint can return pages to the file cache for other purpose.  An application that intends to read in the whole files can use FADV_SEQUENTIAL hint to speed up the processing.

Linux File Mapping Management Functions

mprotect() can be used to change the protection setings (NONE, READ, WRITE, EXEC) for a specific region in the mapping.  The address passed in must be on page boundary.

The function of msync is equivalent to fsync in normal IO.  The data in the file or the specified region of the file associated with a mapping is written back to the disk.  The mode of flushing is controlled by a flag - MS_SYNC, MS_ASYNC and MS_INVALIDATE.  The last option invalidates all copies of the file mapping and future access will see the current file content on disk.

madvice() allows caller to hint kernel on the intended usage of the file memory mapping so that kernel can make intelligent optimization of the operation:

MADV_NORMAL - perform a moderate amount of read ahead
MADV_RANDOM - disable read ahead and read a minimal amount of data for each physcial read
MADV_SEQUENTIAL - perform read ahead aggressively
MADV_WILLNEED - initiate read ahead
MADV_DONTNEED - free the pages and discard all dirty ones.  Subsequent access to the pages will caused the page to be read in from backing store or zero-filled page (anonymous) again.
MADV_DONTFORK - do not copy the pages across fork. Used mainly for managing DMA pages

for example, madvise(addr, len, MADV_SEQUENTIAL) informs the kernel that the program intend to access the memory regional sequentially.

Linux mmap()

The call maps content of file, started from an offset passed over from the caller, into a memory location. The unit of mapping is page.  The call will round the size up to page boundary.  The caller can specify if the mapping are PRIVATE or or SHARED.

The advantage of mmap over normal read/write call is that mmap transfers data directly to user space without keeping a copy in the kernel buffer.  Once the file is map, no system calls are required to manipulate the data except potential page fault and context switching overhead.  The mapped file can be shared by multiple processes.

mmap() is typically used for large file to avoid wasting memory.  The mapping is terminated using the unmmap() call.