Record: 2013

Sunday, December 29, 2013

Web-Safe Colors

In the past when computer displays in 256 colors, there are 216 colors are the same between Mac and DOS-PC. These 216 colors are known as web-safe as the viewer will see similar color in either system.

Each web-safe color has RGB value in multiple of 51 and include 0. For examples, 51 (#33), 102 (#66), 153 (#99) etc.

HTML Color Names

HTML 3.2 and 4.0 defined a set of 16 standard colors which can be referenced by their names (e.g. black, white, silver, yellow, blue etc). These color names continue to be included in the CSS standards.

Example of usage:

which is same as

Saturday, December 14, 2013

NTFS Boot Sectors and MFT

The first 16 sectors in a NTFS volume is allocated to contain the boot code. Only half of them contains the code and the other half contains null bytes. Windows will refuse to mount the volume if these null bytes contains non-null value.

After the boot sectors is the Master File Table (MFT). MFT contains metadata on file. It consists a series of records. Each file and director has at least 1 record in MFT. MFT record is 1K in size.

The first 16 records in MFT describes special system files created toge4ther with the NTFS volume. They are hidden files. These files implement the file system and its metadata.

Rec 0 - $Mft - The MFT itself
Rec 1 - $MftMorr - Partial mirror of the MFT's first 4 records
Rec 2 - $LogFile - transaction log
Rec 3 - $Volume - volume metadata such as label, creation time
Rec 4 - $AttDef - metadata on NTFS attributes
Rec 5 - . - root directory folder
Rec 6 - $Bitmap - allocation status of cluster (adjacent sectors)
Rec 7 - $Boot - code and data used to bootstrap the system
Rec 8 - $BadClus - bad clusters
Rec 9 - $Secure - contains security descriptor for all files
Rec 10 - $Upcase - contains upper case table to convert lower case character to upper case unicode characters
Rec 11 - $Extend - used for NTFS extension such as quota and object ID
Rec 12- 15 - reserved

Friday, December 13, 2013

_declspec(naked)

This specifies a storage class attribute causing the compiler not to add prolog or epilog into the code.

Call Gate

A call gate is a type of GDT descriptor. It is 8-bytes long. A call gate is used to allow code running at lower privilege invoke a routine running at a higher privilege.

Thursday, December 12, 2013

Portable Executable (PE) and IAT

The first 40 bytes contains the MSDOS header defined by IMAGE_DOS_HEADER structure. Following the header is a stub program which displays "This program cannot be run in DOS mode" message. The MSDOS header contains a magic number "MZ" in the first 2 bytes. MZ is initials for Mark Zbikowski which develop the DOS format. The last field of the header contains the RVA (relative virtual address) of the PE file header.

RVA signifies the offset from the base address of the PE module, return by GetModuleHandle(). The PE header is defined by IMAGE_PE_HEADER structure:

typedef struct _IMAGE_NT_HEADERS {
DWORD Signature; // magic number "PE\0\0"
IMAGE_FILE_HEADER FileHeader;
IMAGE_OPTIONAL_HEADER32 OptionalHeader;
} IMAGE_NT_HEADER32, *PIMAGE_NT_HEADER32;

IMAGE_FILE_HEADER stores a number of file attributes such as number of sections, date/time stamp, Characteristics that indicate if this is a DLL (1) or EXE (0) based on the value in the 14th bit.

IMAGE_OPTIONAL_HEADER32 contains an array of 16 IMAGE_DATA_DIRECTORY structures. The 16 entries can be referenced individually using an integer macro:

IMAGE_DIRECTORY_ENTRY_EXPORT = 0
IMAGE_DIRECTORY_ENTRY_IMPORT = 1 which corresponds to the IAT
IMAGE_DIRECTORY_ENTRY_RESOURCE = 2

typedef struct _IMAGE_DATA_DIRECTORU {
DWORD VirtualAddress; //RVA of data
DWORD Size; // size in bytes
} IMAGE_DATA_DIRECTORY, *PIMAGE_DATA_DIRECTORY;

For the IMPORT DIRECTORY, the RVA points to the start of an array of IMAGE_IMPORT_DESCRIPTOR, one for each DLL imported by the module.

typedef struct _IMAGE_IMPORT_DESCRIPTOR {
union {
DWORD Characteristics; //0 for the last descriptor
DWORD OriginalFirstThunk; //RVA of the IMPORT Lookup Table (ILT)
};
DWORD TineDateStamp;
DWORD ForwarderChain; // -1 if no forwarders
DWORD Name; //RVA of the DLL name terminated by \0
DWORD FirstThunk; //RVA to IAT
} IMAGE_IMPORT_DESCRIPTOR;

Both FirstThunk and OriginalFirstThunk points to the an array of IMAGE_THUNK_DATA structure:

typedef struct _IMAGE_THUNK_DATA {
union {
PBYTE ForwarderString;
PDWORD Function; //address of the imported routine stored in IAT
DWORD Ordinal;
PIMAGE_IMPORT_BY_NAME AddressOfData; //size and string name of the imported routine stored in ILT
} u1;
} IMAGE_THUNK_DATA32;

The ordinal field indicate if the function is imported by name or by its cardinal number.

In summary, the structures are linked:

IMAGE_DOS_HEADER -> IMAGE_NT_HEADERS {IMAGE_OPTIONAL_HEADER32} -> IMAGE_DATA_DESCRITPOR -> IMAGE_IMPORT_DESCRIPTOR -> ILT and IAT

Wednesday, December 11, 2013

SetWindowsHookEx()

This API allows one to execute a DLL routine upon the trigger of specific events. The list of events are documented in winuser.h. Some examples are

WH_KEYBOARD = 2
WH_MOUSE = 7
WH_SHELL = 10

The API accepts 4 parameters

int hooktype - event to be hooked
HOOKPROC procPtr - exported DLL routine to call
HINSTANCE dllHandle - handle to DLL containing the hook routine
DWORD dwThreadId - specific thread or all thread (set to 0) that trigger this event

It return the pointer to the hooked routine or NULL if call fails.

To release the hooked event, use UnHookWindowsHookEx()

The calling program first will call LoadLibrary() to load the DLL. Then it uses GetProcAddress() to get the address of specific routine to used in the hook. Finally, it issues SetWindowsHookEx() to hook to the event.

The hock routine should call CallNextHookEx() to propagate the event to the next hook, passing along the parameters.

Appint_DLL

Appint_DLL is a REG-SZ value that stores a space delimited list of DLL with fully qualified path. This registry entry is stored under

HKLM\Software\Miscrosoft\Windows NT\CurrentVersion\Windows

This feature is enabled by setting the LoadAppInit_DLLs (REG_DWORD) to 0x00000001.

When user32.dll is loaded by a new process (DLL_PROCESS_ATTACH event), user32.dll will call LoadLibrary() to load all DLL specified in Appinit_DLL user32.dll is included in most applications.

Import Address Table (IAT)

IAT is a call table (an array of routine addresses) of user mode modules. Most executables have one or more IAT used to store the addresses of library routines that the module import from DLLs.

When the module is compiled with load-time dynamic linking option, the linker will take the addresses of each exported routine and place into an IAT specific to each DLL. When the application is loaded, the system will map the DLL into the address space and call the DLL entry point (DllMain with DLL_PROCESS_ATTACH argument).

Run-time dynamic linking does not rely on IAT. The module will specify the DLL and routine name at run time using LoadLibrary() and GetProcAddress() calls. One advantage of run-time dynamic linking is that the module can recover in case the DLL is not found.

Deferred Procedure Call (DPC)

ISR needs to finish processing as much as possible as the normal processing of the system is suspended. To expedite processing, ISR may delay some processing which is not time sensitive to a later time. This processing is done by scheduling a DPC. DPCs are executed in the IRQL of DISPATCH_LEVEL. DPC can be scheduled to run on specific CPU.

Interrupte Request Level (IRQL)

Each interrupt is mapped to a IRQL representing its relative priority to other interrupts. When an interrupt happens, the system looks up the ISR via the IDT and assigns it to a processor.. If the IRQL of the CPU is lower than the IRQL of the interrupt, the thread is pre-empted, the IRQL of the CPU is raised to that of the ISR and the ISR is executed. When the ISR completes, the IRQL of the CPU is lower to its previous value and the pre-empted code is resumed.

If the IRQL of the CPU is same as the ISR's, the ISR must wait till the current ISR completes. Similarly, if the IRQL of the CPU is higher than the ISR's, the ISR will wait too.

Each IRQL is assigned a number. PASSIVE_LEVEL is 0 which is the lowest. All user mode programs run in PASSIVE_LEVEL as do common Kernel Mode Driver routines such as DriverEntry(), Unload, and IRP dispatch routines.

APC_LEVEL is 1. DISPATCH_LEVEL in which the scheduler runs is set to 2. Thread runing above DISPATCH_LEVEL will not be pre-empted as the scheduler will not run. It means the code and data pages used by such thread must be pinned to memory and cannot be paged out.

The PROFILE_LEVEL is used by the timer used for profiling and is set to 27. Between 2 and 27 are the hardware device IRQL known as DIRQL.

Sunday, December 1, 2013

Mobile Network

1G - Using analog signals based on Advanced Mobile Phone System (AMPS) standard operating in the range of 824Mhz to 894MHz (also dubbed as 800Mhz band)

2G - Converted to use digital signals to cram more calls into the available frequency. There are 2 competing standards - CDMA (Code Division Multiple Access) which operates in the same 800Mhz band, and GSM (Global System for Mobile Communication) which operates in 1900MHz. These 2 standards are not interoperatable. 2G allows transmission of data in form of SMS (Short Message Service) and MMS (Multimedia Message Service). Transfer speed is around 144Kbps.

3G - For smartphone with transfer speed to around 2Mbps. Standards include CDMA2000 which evolved from CDMA and UMTS (Universal Mobile Telecommunication System) which evolved from GSM.

4G - Transmission speed up exceeds 1Gbps. Competing standards are:
(1) LTE (Long Term Evolution) with download rate up to 300Mbps
(2) HSPA+ (Evolved High Speed Packet Access) with download rate up to 168Mbps. Current rate is around 42Mbps
(3) WiMax (Worldwide Operability for Microwave Access) with download rate of 128Mbps

Saturday, November 23, 2013

C Pointers

A pointer is an integer which designates the location in memory.

int* pA ; // declares a pointer to an integer in a pointer variable.

The position "*" is flexible. You can declare a pointer as such too:

int * pA or int *pA

The form int* is more natural as it is easy to call out the type of pA is int*

To initialize a pointer to NULL,

int *pA = 0 or int *pA = NULL

Note that *pA when used outside of declaration means the thing pointer pA points to. This allows you to access the item at the far end of the pointer and "*" is called dereferencing the pointer.

An object is implemented using pointer. However, you never dereferencing an object when used. You just use the name of the object.

Pointer of type void* is a generic pointer. It can point to anything. Effectively, pointer to void bypass type checking. Both array and string in C are pointers.

To obtain a pointer from a variable, use the address (&) operator. For example

int result = 0;
pA = &result;

int** ptr; // declare a pointer to a pointer

To create a pointer to a function, just use the function name without the parameters. For example

int square(int a, int b);
&square is the pointer to the function square

Address Conversion

(1) Printable to Numeric: int inet_pton(int addressFamily, const char *src, void *dst)

(2) Numeric to Printable: const char* inet_ntop(int addressFamily, const void *src, char *dst, socklen_t dstBytes)

The dst is a pointer to a block of memory allocate in the caller space to hold the result. The size of the block is determined by the address family.

For examples,

struct sockaddr_in servAddr;
int result = inet_pton(AF_INET, servIP, &servAddr.sin_addr.s_addr);

struct sockaddr_in clientAddr;
char clientName[INET_ADDRSTRLEN]; // INET_ADDRSTRLEN6 for IPV6
char *paddr = inet_ntop(AF_INET, &clintAddr.sin_addr.s_addr, clientName, sizeof(clientName));

sockaddr structure

The socket API specifies a generic data type called sockaddr for used by API calls.

struct sockaddr {
sa_family_t sa_family; // address family e.g. AF_INET or AF_INET6
char sa_data[14]; // address info - A blob of bits to handle diff OS and network
};

Note that this sockaddr structure is not large enough to handle a IPV6 address which is 16 bytes long. The actual data structure used in socket call are sockaddr_in (for IPV4) and sockaddr_in6 (for IPV6). They have just a more detailed layout of sockaddr.

struct in_addr { uint32_t s_addr; }; // 4-byte IPV4 address

struct sockadr_in {
sa_family_t sin_family; //address family AF_INET
in_port_t sin_port; //16-bit port
struct in_addr sin_addr;
char sin_zero[8]; //padding
};

struct in_addr { uint32_t s_addr[16]; }; //128-bit address

struct sockadr_in6 {
sa_family_t sin6_family; //address family AF_INET6
in_port_t sin_port; //16-bit port
uint32_t sin6_flowinfo; //flow info
struct in6_addr sin6_addr;
uint32_t sin6_scope_id; //scope ID
};

The structure is casted with (struct sockaddr *) when used. For example,

result = bind(servSock, (struct sockaddr*) &servAddr, sizeof(servAddr));

As sockaddr_in is not big enough to hold a IPV6 address, program allocate space using a sockaddr_storage structure

struct sockaddr_storage { sa_family_t .... } ; //the sa_faimily is used to determine the actual address type.

struct sockaddr_storage sockAddr
:
:
switch (sockAddr->sa_family) {
case AF_INET: ...
case AF_INET6: ...
default: ...
};

Sunday, November 17, 2013

Socket

It is a general abstraction through which programs send and receive data. Different types of socket correspond to different underlying protocol suites and different stacks of protocol within the suite.

The main types of TCPIP socket are stream socket and datagram socket. A stream socket represents one end of the TCP connection. It consists of an IP addressm a port number and the end to end protocol (TCP).

A socket is created by a socket call which returns a handle to the socket:

int socket(int domain, int type, int protocol)

"Domain" refers to the communication domain, recall that socket API is a generic interface for a large number of communication domains (e.g. AF_INET for IPV4 and AF_INET6 for IPV6).

HSocket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)

"Type" determines the semantics of the data transmission with the socket. For example, if the transmission is reliable or message boundary is preserved etc. Valid values are SOCK_STREAM or SOCK_DGRAM.

"Protocol" refers to the end to end protocol to be used. Valid values are IPPROTO_TCP or IPPROTO_UDP. A value of 0 means to use the default protocol for the "Type".

The close() call close the socket.

Special Network Addresses

(1) Loopback address

It is assigned to a loopback interface which is a virtual device that echoes transmitted packets back to the sender. For IPV4, it is 127.0.0.1 and for IPV6, it is ::1.

(2) Private addresses

This group of address is for used by locations which connect to internet via NAT. These addresses cannot be reached from the global internet. For IPV4, they start with 10 or 192.168 or 172.16-31. There is no correspondence for IPV6.

(3) Link Local or Autoconfiguration addresses

These addresses can only be used to communicate with hosts on the same network. Routers will not forward these addresses. For IPV4, it is 169.254. For IPV6, it is start with FE80, FE90, FEA0 and FEB0.

(4) Multicast addresses

For IPV4, it is 224. to 239. For IPV6, it start with FF.

JVM

It has a stack based architecture without registers. This allow JVM to run the same code regardless of underlining hardware. Real hardware machines differs in number and size of registers and how they relate to memory. The only register like structure is the program counter. Result of method call is returned on stack.

Mutex

It is referred as Mutants when in the kernel. Mutexes are global objects for syncronizing execution. Mutex names are usually hard-coded because the name must be consistent if it is used by 2 processes or threads. Only one thread can own a mutex at any one time. Thread gains access to mutex using WaitFor SingleObject. ReleaseMutex call release the mutex after use. CreateMutex function creates a mutex. The other thread uses OpenMutex to obtain a handle to the mutex before using it.

First and Second Chance Exceptions

Debuggers are given 2 chances to handle an exception of the program being debugged. When an exception occurs, the execution of the program will stop and the debugger is given a first chance to handle the exception. The debugger can handle it or choose to pass it on to the program. In the latter, the program registered exception handler will be given control.

If the program does not handle the exception, the debugger is given a second chance to handle the exception. If there is no debugger attached, the program will usually crash at this point. The debugger must resolve the exception to enable the program to continue to run.

Breakpoint

Software breakpoints are implemented by overwriting the instruction at the break location with 0xCC which is a INT 3 instruction. This allows control passed to the debugger when execution reach that point. The debugger will show the instruction before patching but if one inspect the memory, the value has changed to INT 3.

Software breakpoints may not work when a code is self modifying (e.g. malware). In this case, the patch may be overwritten and the breakpoint will not be effective

Hardware breakpoints are assisted by hardware. For each instruction being executed, hardware will compare the address with the special register to determine if a breakpoint is reached. One major drawback is that there are only 4 debug register in x86. DR0 to DR3 store the addresses of breakpoints. DR7 is the control register which indicates if any of the DR0-3 is active and if the address represent a read, write or execute breakpoint. Read/write breakpoint allow the program to break out when an address is referenced.

To protect the DR from modified by malware, set the General Detect flag in DR7. It will break prior to any mov instruction that modify the DR0-3.

Conditional breakpoint breaks when certain predefined condition is reached. For example, break when the second parameter of a function is of a particular value. This facilitate debugging to stop frequently executed point only on condition of interest. Conditional breakpoints are implemented as software breakpoints

Stack Layout

ESP points to the top of the stack. EBP is usually not change during the call to provide a reference point to access local variable using offset.

(1) arguments was pushed onto the stack first

(2) Next is the return address is pushed automatically because of the CALL instruction

(3) The old EBP is pushed next

(4) Lastly the local variable is allocated

pusha and pushad push a set of 16- and 32-bit registers onto the stack - EAX, EBX, ECX,EDX, EBP, ESP, ESI and EDI.

ESP always points to the top element in the stack.

NOP (Intel)

Actually a XCHG EAX,EAX instruction. Opcode is 0x90. NOP is commonly seen in buffer overflow hack when the exact code address can only be approximate. So lacing a series of NOP allow the code jump to complete.

Windows Thread

Threads share the address space of the process. Each thread has its own stack and registers. When OS switches thread, the CPU context is stored in a structure called thread context.

CreateThread fucntion create a new thread. The function call specify a start address of the program to be executed. If the start address is LoadLibrary call, the DLLMain will be executed after the DLL is loaded

Windows Network API

Berkeley Compatible Sockets function similar to UNIx. It is implemented in the Winsock libraries, primarily in ws2_32.dll. Common socket functions:

socket - create a socket

bind - attach a socket to a port

listen - start a socket to listen to a port

accept - open a connection to a remote socket and accept the connection

connect - open a connection to a remote socket which is waiting for a connection

recv - receive data

send - send data

Prior to use these function, the WSAStartup function must be call to load the network library and allocate resources.

WinINet is a higher level API which implement HTTP and FTP protocols. It is implemented in Wininet.dll.

InternetOpen - initialize a connect to Internet

InternetOpen Url - open a connection to HTTP or FTP site

InternetReadFile - retrieve a file from the site

reg File

File with reg suffix is a readable text file. When user double-click the reg file, the content will be automatically merge with the registry. For example, the following add a program to run automatically when Windows starts:

Windows Registry Editor Version x.xx

[HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Run]

"abcvalue"="C:\abc.exe"

Alternate Data Stream (ADS)

It is a feature allows additional data to be added to existing file in NTFS, essentially adding 1 file to another. The extra data does not show up in DIR command listing. It is not visible when the file is browsed or edited. Program can access the stream via the name file.txt:Stream:$DATA

Long Pointer (LP)

Strings are usually named as lp (e.g lpStr1) as they really point to memory location where the strings start. LP is 32-bit. P (pointer) is same as LP in 32-bit systems. They only make a difference in 16-bit system.

Windows Handles

Like pointers, handle refer to object or memory location. However, handles cannot be used in arithmatic operations and they do not always represent memory addresses. They can only be used in function calls to refer to the same objects.

Friday, September 27, 2013

Intel Assembly Addressing Mode

Global variable is defined in a .DATA section. DB, DW and DD declare variables of 1, 2 and 4 bytes in length.

.DATA
var1 DB 64 ; initialize variable with value 64
var2 DB ? ; uninintialized variable
var3 DD 1, 2, 3 ; declare 3 doubleword variables and initialized to value 1, 2 and 3

arr1 DD 100 DUP (0) ; declare and array of 100 entries. Initialized to 0
str1 DB 'hello',0 ; declare a null terminating string of 6-bytes long

mov eax, [ebx] ; move the eax content to 4 byte pointed to by address in ebx
mov [eax], ebx ; move the ebx content to the address stored in eax

Somtimes, the size of data during manipulation is ambiguous e.g. when immediate value is used

mov BYTE PTR [ebx], 2 ; move 2 into a single byte at address stored in ebx
mov WORD PTR [ebx], 2
mov DWORD PTR [ebx], 2

Windows System Call Flow

(1) user mode program call BOOL WINAPI WriteFile()
(2) control transfer to Writefile() routine implemented by kernel32.dll
(3) kernal32.dll calls ZwWriteFile() in ntdll.dll (user mode)
(4) ZwWriteFile() calls KiFastSystemCall() in ntdll.dll which execute the SYSENTER instruction to transit to kernel mode
(5) SYSENTER transfers control to KiFastCallEntry() in ntoskrnl.exe (Executive) via the MSR_CS and MSR_EIP settings
(6) KiFastCallEntry() calls KiSystemService() in ntoskrnl.exe
(7) KiSystemService() dispatch 0x163 which is NtWriteFile() in ntoskrnl.exe

Windows API

With the exception of NtGetTickCount() and NtCurrentTeb(), each Nt* function has a matching Zw* function. To the user mode program, calling Nt* function eventually ends up calling Zw* function. In kernel mode, calling Zw* module will follow a formal transition path via KiSystemService() routine. Calling Nt* will not.

Windows user mode components

Environmental subsystem provide API for specific applications to run. NT4 supports 5 environmental subsystems:

Win32 or later Windows subsystem
Windows on Windows (WOW) for 16-bit Windows applications e.g. Win 3.1
NT Virtual DOS machine (NTVDM) for DOS applications
OS/2
POSIX and later Services for UNIX (SFU) or Subsystem for UNIX based application (SUA)

Windows subsystem consists of 3 basic components:
(1) csrss.exe - Client Server Runtime Subsystem (user mode) It plays a role in managing processes and threads. It supports command line interface.
(2) win32k.sys - Kernel mode device driver
(3) User mode DLL that implement the subsystem's API, e.g. kernel32.dll, gdi.dll, shell32.dll, rpcrt4.dll, advapi32.dll, user32.dll etc.

When a Windows API need to access services in executives, it goes through ntdll.dll which reroutes code to ntoskrnl.exe

Service Control Manager (SCM) is implemented by service.exe in system32 directory. SCM launches and manages user mode service which is just a user-mode application runs in background.

Windows kernel mode components

The core is implemented in ntoskrnl.exe. This executable implements its functionalist in 2 layers - executive and kernel.

The executive implements the system call interface and major OS components such as I/O manager, memory manager, process and thread manager). Kernel mode device drives is in layer between the executive's I/O manager and HAL. The kernel implements low level routines (e.g. synchronization, thread scheduling, interrupt handling) that executive uses to provide high level services.

There are several version of kernel executives

ntoskrnl.exe - uniprocessor without PAE
ntkrnlpa.exe - uniprocessor with PAE
ntkrnlmp.exe - multiprocessor without PAE
ntkrpamp.exe - multiprocessor with PAE

win32k.sys is a kernel mode driver tat implement both user and graphic device interface (GDI) services. GDI is pushed to run in kernel mode for speed.

User to Kernel Model Switching

In real mode, MSDOS uses the Interrupt Vector Table (IVT) to expose system services runs in supervisor mode. Applications call INT 0x21 with a function code placed in AH.

Windows use IDT (Interrupt Descriptor Table). In a multiprocessor environment, each processor has its own IDTR register. Windows check the processor it's running on during start up to determine its system call invocation mechanism.

For Pentium II, INT 0x2E instruction and IDT are used to implement system call mechanism. For later IA32 processors, Windows uses SYSENTER instruction to jump to kernel space. IDT is only used to handle hardware exceptions IDT contain up to 256 8-byte descriptors. To dump the descriptor registers content, use rM 0x100 command in debugger. idtr shows the base address and idtl shows the limit (length). To format idt content, use debugger command !idt -a

In Windows, most of the entries point to KiUnexpectedInterrupt routines, which in turn jump to nt!KiEndUnexpectedRange routine. Even those later processor uses SYSENTER, the IDT entry at 0x2E also implement the functionality by pointing to nt!KiSystemService (System Service Dispatcher). It uses information passed on from application to invoke the native API routine. Nowadays, switching from user to kernel mode is done via the SYSENTER instruction. 3 64-bit machine specific registers (MSR) is used to identify the target to jump to, the location of kernel-level stack (in case the user mode stack needs to copy over).

IA32_SYSENTER_CS (0x174 register address) - kernel mode code code and stack segment
IA32_SYSENTER_ESP (0x175) - stack pointer in the stack segment
IA32_SYSENTER_ISP (0x176) - first instruction to execute These registers are manipulated using the RDMSR and WRMSR instructions.

SYSENTER_CS usually points to a Ring 0 code segment that spans the entire address range. Thus SYSENTER_EIP is a full 32-bit linear address in a kernel module called KiFastCallEntry. The module will eventually jump to KiSystemService.

Like INT 0x2E,the service number needs to stow in EAX before calling SYSENTER. KiFastCallEntry involve KiSystemService to dispatch the target Nt funciton. The dispatch is achieved via a service number to index a lookup table. The system service number is 32-bit. Bit 0 to 11 represents the service number to be invoked. Bit 12-13 specify 1 of 4 possible service descriptor tables. In fact, only 2 of the service tables are used. If the table number is 0x00, the KeServiceDescriptorTable is used. If the table number is 0x01, the KeSErviceDescriptorTableShadow is to be used. The KeServiceDescriptorTable is exported by ntoskrnl.exe and KeServiceDescriptorTableShadow is not exposed and used internally in the executive.

The 2 descriptor tables contain a structure called System Service Table (SST):

serviceTable points to an array of linear addresses which are entry points of routines. The array is called SSDT System Service Dispatch Table and contains 391 elements. SSDT is similar to IVT.
nEntries specifies the number of elements in the SSDT
argumentTable is a pointer to an array of bytes called SSPT (System Service Parameter Table). Each byte represent the number of bytes allocated for function arguments for the corresponding SSDT routine.

KeServiceDescriptorTable contain one SST. KeServiceDescriptorTableShadow contains 2 SST. The first one is same as the one contains in KeServiceDescriptorTable. The second one points to the SSDT for the GDI routines implemented by win32k.sys and contain 772 entries.

HAL and bootvid

Hardware Abstraction Layer (HAL) insulates the OS from hardware by wrapping machine-specific details with an API that is implemented by HAL.DLL. Kernel mode device drivers invoke HAL routines rather than interface to hardware directly.

HAL implementation depends on hardware on which Windows runs on. HAL is located in system32 directory:

hal.dll - standard PC
halacpi.dll - hardware with advanced configuration and power interface (ACPI)
halmacpi.dll - hardware uses multiple processors Sitting with HAL,

bootvid.dll offers primitive VGA graphic support during boot phase. It can be controlled via the /noguiboot option in boot.ini.

ASLR (Address Space Layout Randomization)

Memory Manager in early version of Windows tried to load binaries in the same location in the linear address pace each time they are loaded. The /BASE linker option allows the developer to specify a preferred address for a DLL or executable. The preferred address is stored in the header of the binary. If preferred address is not specified, the default load address for executable is 0x400000 and for DLL is 0x10000000. If the address is in used, system will relocate the binary to another region. /FIXED linker address will prevent relocation and causes an error message to be issued instead.

ASLR allows the binaries to be loaded in random addresses. It is enabled with the /DYNAMICBASE linker option. Common DLL will still be shared by multiple address spaces that use them.

I/O Techniques

(1) Programmed I/O When a processor encounters an I/O instruction, it issues a command to the appropriate I/O module. The I/O module sets the appropriate bits in the I/O status register but does not alert the processor. The processor will need to check for the I/O completion periodically after the I/O instruction is executed. The processor is also responsible to transfer the data from the hardware buffer to memory. Processor has various I/O instruction to control the device (e.g. unwind a tape drive), test status and transfer data.

(2) Interrupt driven I/O The I/O module will interrupt the processor when the I/O completes. However, the processor is still required to transfer the data to memory. There are 2 drawbacks: the I/O transfer rate is limited by the speed the processor can test and service a device. The processor is also tied up in managing I/O transfer

(3) Direct Memory Access Interrupted driven I/O is not efficient when a large amount of data are to be transferred. DMA is performed by a separate module on the system bus. Processor issues a command to the DMA module with information such as operation (READ/WRITE) required, address of the I/O device, starting address of the memory location and number of words to be moved. The DMA module will transfer the data directly, one word at a time. When the transfer is completed, DMA module alerts the processor. As the DMA module needs to take control of the bus to transfer data, it may contend with the processor for the use. The processor will wait for one bus cycle when DMA is using the bus. However, no context switch is incurred. Overall, DMA is more efficient when transfer multi-words I/O.

Thread Models

In a User-Level Thread environment, all thread management is done by application. Kernel is unaware of the existence of thread. Application uses a thread library for thread management (creation, destroy, pass data, scheduling and storing thread context). The application begins in a process with a single thread. Application spawn threads in the same process. The context of the thread consists of registers, program counter and stack pointer. The kernel schedule execution in the level of process.

Advantages of user level thread are:
(1) Thread switching completely in user mode and no context switching
(2) Differetn application can use different scheduling algorithm
(3) The threading model can run on any OS as there is no need for the kernel to support

Disadvantages of user level thread are:
(1) When ULT executes a system call, the process and all the thread will be blocked. A technique call jacketing which convert the blocking call to a non-blocking call. The jacket routine checks if the device is busy. If it is, the router will block the thread and pass control to another thread.
(2) As kernel schedule process to only 1 processor, ULT cannot take advantage of multiprocessor environment.

In a Kernel-Level Thread environment, thread management is done by kernel. Application create thread using a kernel API. The disadvantage is that transferring control from one thread to another requires switching from user to kernel mode. In a benchmark, kernel mode thread switching can be 30 times slower.

Windows Boot Process

(1) Machine starts in POST (Power On Self Test) which will detect the amount of memory and enumerates storage devices attached.

(2) BIOS search for the bootavle devices for a boot sector. If the bootable device is a hard disk, the boot sector is a MBR (Master Boot Record) written by Windows setup. MBR contains code and a partition table used to identify the active partition. The active partition is also called the bootable partition or the system volume.

(3) MBR load the partition boot sector (called VBR or volume boot record) into memory

(4) If the boot devices is not hard disk (e.g. DVD or floppy), the BIOS will load the device's VBR into memory

(5) VBR boot code reads the partition's file system just well enough to locate and load 16-bit boot manager program. The boot mamager is actually 2 executables concatenated together. The first module is 16-bit and execute in real mode. It sets up the necessary data structure and switches to protected mode and load the protected mode boot manager (32-bit or 66-bit) into memory.

(6) For EFI (Extensible Firmware Interface) machine, the boot code is in the firmware and there is no need for MBR or VBR. The boot manager path is provided to EFI via a varaiable setting. EFI firmware switches to protected mode in a flat memory model with paging disabled and run bootmgr.efi (boot manager)

(7) Both BIOS and EFI load the boot manager. The boot manager uses configuration data stored in registry (BCD or boot configuration data). The BCD has 2 elements. A Windows boot manager object control the character-based boot menu (locale, default timeout etc). The boot loader objects represent different boot configuration (e.g. normal, debugging etc). If there is only 1 boot loader object, the boot manager will not display the character UI.

(8) boot manager will load the Windows boot loader (winload.exe) whose location is specified in the boot loader object.

(9) winload.exe is a successor of the NTLDR. winload starts by loading the SYSTEM registry hive (c:\Windows\system32\config). SYSTEM hive is mounted under HKLM\SYSTEM.

(10) winload load nt5.cat which contains digit signature catalog and performs an integrity test of its own memory image. If the signature not matches, winload will halt.

(11) winload then loads ntoskrnl.exe and hal.dll. If a debugger is attached, winload will also load the kernel mode driver for the debugger (kdcom.dll for null modem, kd1394.dll for firewire and kdusb.dll for USB debug cable). Winload will check the integrity of the loaded module against nt5.cat.

(12) winload then continue to load the DLL imported bu ntoskrnl.exe and checks their image against nt5.cat. The DLL loaded are pshed.dll, bootvid.dll, clfs.sys and ci.dll.

(13) winload scans`HKLM\SYSTEM\CurrentControlSet\SErvices for device drivers that belong to boot class category (i.e. Start parameter with value equal to 0x00000000 or SERVICE_BOOT_START. If integrity check is enabled, winload will check the signatures of these drivers against nt5.cat. Again, winload will halt if integrity check fails.

(14) winload enables paging, save the bootlog and transfer control to ntoskrnl.exe via its exported function kiSystemStartUp().

(15) ntoskrnl builds the data structure (e.g. page table) and load ntdll.dll. The executive searches HKLM\SYSTEM\CurrentcontrolSet\Services for system class driver and services (subkey with Start value equals to 0x00000001). If integrity check is enabled, the executive will check it against ci.dll. Any driver that fails the test will not be loaded.

(16) The executive initiates the session manager (smss.exe). smss starts the Windows subsystem that support the Windows API. It means smss uses only native API.

(17) Windows subsystem consists of 2 parts. win32k.sys is the kernel mode driver, csrss.exe is the user mode component. smss locates the kernel mode driver in the registry HKLM\SYSTEM\CurrentcontrolSet\Control\Session Manager\SubSystems\Kmode. win32k.sys switches from the default boot VGA mode to the target display mode.

(18) smss also loaded the user component specifed in HKLM\SYSTEM\CurrentControl\Set\Session Manager\Subsystems\Required. The entry points to two other subkeys - Debug and Windows. Normally Debug is empty and Windows points to csrss.exe.

(19) csrsss.exe enable sessions to support user-mode applications that make call to Windows API

(20) smss.exe continue to load "known" DLL specified under \HKLM\SYSTEM\Current\Control\Set\Session Manager\KnownDLLs\

(21) smss.exe creates 2 session (0 and 1). Session 0 hosts the init process. Session 1 hosts the logon process.

(22) Session 0 version of smss.exe launches wininit.exe

(23) Session 1 version of smss.exe launches winlogon.exe

(24) The original smss.exe then waits in a loop and listen for LPC requests to spawn other subsystems, create new sessions or shutdown the system.

(25) wininit creates 3 child processes. Local Security Authority Subsystem (lsass.exe) sits in a loop listening for LPC for security related request. The Service Control Manager (services.exe) load drivers and services marked as SERVICE_AUTO_START in the registry (0x00000002). The Local Session Manager (lsm.exe) hanbdles connections to the machine made via terminal services.

(26) winlogon.exe handles user logons. It runs logonui.exe to display the logon prompt. logonui.exe passes the credentials to lsass.exe. If successful, winlogon.exe will launch the application specified by UserInit and Shell values under HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon. By default, UserInit specifies userinit.exe and Shell specifies explorer.exe

(27) userinit.exe process the group policy objects. It also cycle through several registry subkeys and directory to launch start up programs and scripts:

HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\RunOnce\
HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Run\
HKLM\Software\Microsoft\Windows\CurrentVersion\RunOnce\
HKLM\Software\Microsoft\Windows\CurrentVersion\Run\
C:\ProgramData\Microsoft\Windows\Start Menu\Programs\Startup\
C:\Users\%USERNAME%\AppData\Roaming\Microsoft\Windows\Start Memu\

Kernel Mode Driver

KMD layers between I/O manager (Io*) and hal.dll. KMD uses API exposed by hal.dll to interact with the hardware.

KMD process IRP (I/O Request Packets) handed down from I/O manager on behalf of user applications. Microsoft introduced device framework to ease devlopment of KMD. WDM (Windows Driver Model) was released to support Win98 and W2K. WDF (Windows Driver Framework) encapsulates WDM with another layer of abstraction.

The DriverEntry() routine is executed when KMD is first loaded into kernel space. DriverEntry() returns the status in NTSTATUS type. DriverEntry() takes 2 parameters. The first IN parameter is of type DRIVER_OBJECT which contains information of the driver, including a list of function pointers:

DriverInit - by default, I/O manager set this to the address of DriverEntry()

DriverUnload - to be set by KMD for the routine to execute when KMD is to be unload

DriverDispatch - an array of MajorFunction which define the routines to be executed in response to the major function codes (e.g.IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_DEVICE_CONTROL etc) in the IRP passed down Dispatch routines carry the following signature: NTSTATUS DispatchRoutine(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp); For device control, IRP contains a 32-bit field, IoControlCode, which provide further information on the IRP. IocontrolCode comprises four sub-fields:

(1) DeviceType - Microsoft reserves type value 0x0000 to 0x7FFF e.g.FILE_DEVICE_DISK, FILE_DEVICE_KEYBOARD. User can define its own type using 0x8000 to 0xFFFF (32K)

(2) Function - program specific integer value defines action to be performed. MS reserves 0x0000 to 0x7FFF. User defined function span 0x8000 to 0xFFFF

(3) Method - defines how data are to be passed between user and kernel mode code. e.g. METHOD_BUFFER means OS to create a non-paged system buffer

(4) Access - READ or WRITE access to be declared before opening the file object representing the device.
To use the KMD, it must firstly be registered to the OS via RegisterDriverDeviceName(). Then use RegisterDriverDeviceLink() call to create a symbolic link for user mode program to communicate with the KMD. User mode program first use CreateFile() to open the device. It then can use Windows API DeviceIoControl() to communicate with the KMD

Kernel Patch Protection (KPP) or PatchGuard

Originally deployed in 2005 and have 2 later upgrade (v2 and v3) to counter bypass techniques. PatchGuard monitor several vital system components (SSDT, IDT, GDT, MSR, ntoskrnl.exe, hal.dll and ndis.sys) periodically (5 to 10 min) against known singatures. It issues a bug check with stop code 0x00000109 (CRITICAL_STRUCTURE_CORRUPTION) when it detects any component change.

Kernel Mode Code Signing (KMCS)

KMD are required to be digitally signed in order to be loaded. Boot drivers are loaded early by winload.exe. Any driver that fails the integrity fail will prevent Windows from starting up. ntoskrnl.exe uses routines exported from ci.dll to check the rest of the drivers.

Service Control Manager

SCM is used to load and start drivers (KMD) and services in kernel space.

sc.exe is a utility to define, starts, stop and delete services and drivers.

The corresponding programmatic calls are

OpenSCManager - open and obtaina handle to the SCM database. The handle is required for subsequent SCM calls.
CreateService() - defines the services using supplied information including, name, binary path, START type etc. The handle return is required for the StartService call
StartService() - load the driver or services
ControlService() - use to stop the service DeleteService() - remove the driver information from SCM database

Thursday, September 19, 2013

DEP (Data Execution Prevention)

Windows feature that prohibit execution in designated pages to protect data, stack or heap pages. Hardware enforced DEP is applicable to both OS and applicaiton Software enforced DRP is applicable to application Enabling DEP will also enable PAE. DEP is enable using /NXCOMPAT liner option.

PAE and AWE

The amount of RAM that can be accessed by Windows depends on OS version and underlining hardware.
For IA32 hardware, Windows can access beyond 4G RAM using Intel PAE (Physical Address Extension available since Pentium Pro). PAE is an extension to the system level bookeeping that allows a machine (via paging mechanism) to increase the number of address lines from 32 to 36. PAE is enabled in Windows via the /PAE boot option.

AWE (Address Windowing Extension) is a Microsoft specific feature that allows an application to access RAM beyond the 4G linear address space limit. AWE is an API (declard in winbase.h). AWE uses a set of fixed size regions (windows) in an application linear address space and maps them to a larger set of fixed size windows in physical memory.
AWE can operate with or without PAE. Application needs "Lock Page in Memory" privilege to use AWE.
VirtualAlloc() or VirtualAllocEx () - reserve a region in linear address space

AllocateUserPhysicalPages() - allocate pages of physical memory to be mapped to linear memory

MapUserPhysicalPages() or MapUserPhysicalPagesScatter() - map allocated pages of phsyical memory to linear memory

FreeUserPhysicalPages() - release the allocated pages

Tuesday, September 17, 2013

Debugger

A machine debugger (e.g. debug command in DOS) views program as a stream of bytes. It can examine content stored in registers and memory location. It has no concept of variables or routines.

A symbolic debugger is a source level debugger. To perform debugging on source level, it uses the target's program's debug symbol table. The table contains a collection of variable length records which generated by compiler. The records contains information about variable (name, type, address) and functions (name, start address, end address, statement start and end address range).

These information allow the debugger to step execute the source code by running the machine instructions within defined ranges.

All operating systems provide hooks for debugger. Under DOS, debugger is driven off by 2 ISR:

INT 0x3 - signal to breakpoint.

INT 0x1 - allow single stepping

When the TF (Trace Flag) is set in the FLAGS/EFLAGS, the processor will execute a single instruction and then automatically execute an INT 0x1 instruction. This caused the ISR for 0x1 to execute. Processor will clear the TF automatically whenever it invokes a ISR so that the debugger does not need to operate in single step mode.

Friday, August 2, 2013

Protection via Paging

When segment and page level protection is enabled, segment check is done first followed by page level check. Page level check occurs in parallel with the address resolution process and thus no performance overhead is incurred. Segment based violations generate a general protection fault (#GP). Page based violations generate a page fault exception (#PF). Also segment protection cannot be overridden by page protection settings. For example, setting page writable will not make a code segment writable.

Paging is optional. Even paging is enable, it's effect can be nullified by clearing the WP (write protection) flag in CR0, plus setting the R/W and U/S flag in PDE and PTE. This makes all memory pages writable and assign all of them to user privilege level.

There are 2 different types of check in paging mode:
- User/Supervisor mode check (bit 2)
- Page type checks (facilitated by R/W bit 1)

When CPL of program is 0, 1 or 2, the mode is supervisor (U/S clear). When CPL is 3, the mode is user.

Code execution in supervisor mode can access every page of memory (with the exception of user level read-only page, if the WP set in CR0).

Code execution in user mode are limited to reading other user-level pages where the R/W flag is clear (i.e. RO). User level code can read and write to other user-level pages where the R/W flag is set.

Though segmentation is mandatory, the effect can be nullify by implementing a flat memory model. GDT will contains 5 entries - 1 null descriptor, and 2 sets of code/data segments. One set with DPL of 0 and another set of 3. Each segment covers the entire linear address space (4G)

Saturday, July 20, 2013

Interrupt

Interrupt service routine (ISR) or interrupt handler is triggered to handle the event. In real mode, the first 1K address 0x000 to 0x3FF contains the IVT (interrupt vector table). In protected mode, the structure is called IDT (interrupt descriptor table). Both IVT and IDT map interrupts to the ISRs.

In real mode, IVT stores the logical address of each ISR sequentially. Each entry is 4 bytes - 2 for the segment selector and 2 for the effective address. The IVT contains 256 entries.

Under MSDOS, the BIOS handle interrupt 0-31. DOS system calls map to interrupts 32-63. The remaining 64-255 interrupts are user defined.

There are 3 types of interrupts:
(1) hardware interrupts (external interrupts) are generated by external devices. They are either maskable or non-maskable. Maskable interrupts can be disabled by clearing the IF flag using the CLI instruction. Non-maskable interrupt cannot be ignored and will always be handled by the processor.

(2) Software interrupts (internal interrupts) are implemented in programs using INT instruction. INT takes an integer operand which represent the interrupt vector to invoke. INT clears the TF (Trace Flag) and IF (no tracing and disable interrupt while executing), pushes FLAGS, CS, IP onto the stack (save the state and return address), jump to the ISR until IRET.

(3) Exceptions are generated when processor detects an error when execute an instruction. There are 3 types of exception which differ in how the error is reported and how the instruction is restarted. When a fault occurs, the processor reports the exception at the boundary preceding the offending instruction. In other words, the state is reset to allow the instruction to restart. Interrupt 0 (divided by zero) is an example of a fault. When a trap occurs, no instruction restart is possible. The processor report it at the boundary preceding the next instruction. Example of traps are 3 (breakpoint) and 4 (overflow). When an abort occurs, the program cannot be restarted.

In protected mode, the IDT stores an array of 64-bit gate descriptors. These gate descriptor can be interrupt gate, trap gate or task gate.

Unlike IVT, IDT exists in any location in the linear address space. The 32-bit base address of the IDT is stored in the 48-bit IDTR register (position 16 to 47). The size of IDT (in bytes) stored in but 0 to 15. IDTR can be manipulated by LIDT and SIDT instructions. Reference beyond the IDT size limit will generate a general-protection exception. As in real mode, the maximum number of IDT entries is 256. Entry 0 to 31 is reserved by IA-32 processor for various interrupts and exceptions.

Gate descriptors allow programs to access code segments with different privilege levels. Gate descriptors are system descriptor (with S-flag cleared). The types of gate descriptor are encoded in the TYPE field. Gates can be 16-bit or 32-bit. This allows the systems to determine if the stack push is 16- or 32-bit variant.

Call Gate Descriptors live in GDT. Instead of storing 32-bit base linear address line a code or data segment, it stores a 16-bit segment selector and 32-bit offset address. The segment selector references a code segment in the GDT. The offset address points to the entry point of the linear address of the procedure in the segment. In effect, it is a descriptor in GDT points to another descriptor (via selector) in GDT points to a code segment (then applies the offset address).

To jump to a new segment using a call gate have 2 conditions:
(1) CPL of program and RPL of the selector for the call gate <= DPL of the call gate descriptor
(2) CPL of program >= DPL of the destination code segment

Interrupt gate and trap gate descriptors behave like call gate, except they reside in the Interrupt Descriptor Table (IDT). The segement selector specified a code segment in GDT. The effective address points to the entry point of the service routine in the segment. So both descriptor ends up in GDT. The only difference between interrupt gate and trap gate is that processor will clear the IF in EFLAGS when access bis interrupt gate. For trap gate, IF value remains.

For security check, CPL of program invoking the handler must be less than or equal to the DPL of the gate. This condition only holds when the handling routine is invoked by software (e.g INT). The DPL of the segment selector points to the code segment must be less of equal to the CPL

Sunday, July 14, 2013

Protection through Segmentation

Checks are perform during logical to linear address translation when segmentation is enabled.

(1) Limit check uses the 20-bit limit field to ensure program does not access memory beyond the segment, The processor also check the limit field in GDTR to ensure the segment selectors do not access entries beyond the GDT.

(2) Type check uses the S-flag and Type field to ensure the proper type is use. For example, CS can only be loaded with code segment. Access to the null descriptor will generate a general protection exception.

(3) Privilege check used privilege levels. Current Privilege Level (CPL) is the RPL in the CS or SS register used by executing program. CPL can be changed via a far call or jump instruction. Privilege check happens when segment selector associated with segment descriptor is loaded. This happens when program access data in another code segment or pass control to another segment. Privilege violation generates a general protection exception.

To access data in another data segment, the selector must be loaded into the SS or one of the data segment (DS, ES, FS, GS). To load selector into CS, it can only be done via instructions like JMP, CALL, RET, IRET, SYSENTER and SYSEXIT.

To access data in another segment, the DPL of target segment must be same or higher than CPL and RPL.

To load the stack segment register, both DPL of the stack segment and the corresponding RPL must be same as CPL.

When transferring control to a nonconforming code, the CPL must be equal to the DPL of destination segment. In other words, the privilege level must be equal ob both sides of the fence. In addition, the RPL of the selector for the destination segment must be less than or equal to the CPL. Nonconforming code cannot be accessed by program with less privilege.

When transferring control to conforming code, the calling code's CPL must be greater than or equal to the DPL of the destination code. RPL is not checked in this case.

(4) Restricted instruction check verify the program does not use privileged instruction like LGDT, LIDT, MOV a value to control register, HLT the processor, write to model specific register WRMSR etc.

Write Protection

CR0 16th bit is the WP bit. When WP is set, supervisor mode code cannot write to user pages. This mechanism is used to implement copy-on-write used by UNIX in creating process.

PDE and PTE

Both PDE and PTE are 32-bits in length.

The higher order 20 bits (12 to 31) contains the base address of the PTE or page. The address is expanded to 32 bits implicitly by adding trialing 12 zeros.

Avail field (bit 9 to 11) indicat if the entry is available for OS use.

Global (G) flag (bit 8) is ignored in PDE. In PTE, it help to keep frequently accessed pages from flushing out of TLB

Bit 7 in PDE represents page size. When clear, 4KB page is used. In PTE, the bit represents the Page Attribute Table (PAT).

Bit 6 is clear in PDE. In PTE, it indicate if the page is dirty (written to)

Access (bit 5) indicates if the page has been accessed recently (both read or write)

PCD (bit 4) is the page cache disabled flag. When set, the page or page table will not be cached.

PWT (bit 3) is the page write through flag. When set, page write through is enabled for this page or page table

U/S (bit 2) indicates if the page has user or supervisor privilege

R/W (bit 1) specifies the protection for this page. Set means R/W and clear means R/O for the page the entry points to

P (bit 0) is the Present bit which indicates if the page or page table is loaded in memory currently (set)

Saturday, July 13, 2013

Protected Mode Paging

Without paging enabled, the linear address (formed by translating a logical address used in program via the segmentation process) is a physical address.

With paging, the linear address goes through another round of translation to form the final physical address.

The high order bits (22 to 31) in the linear address index into a page directory. The physical address of the page directory is stored in CR3 (known as PDBR or page directory base register). As there are 10 bits, the number of entries in the page directory is 1024. Each page directory entry (PDE) contains the physical address of a page table.

Bit 12 to 21 specifies a particular entry in the page table. Again as the field is 10-bits in length, the number of PTE in each page table is 1024. Each PDE stores the physical address of the page in memory. The total size of the memory space representation by paging using 4K pages is 4G = 1024 x 1024 x 4.

If Physical Address Extension (PAE) is enabled, the address space size would expanded to 64GB. PAE adds another data structure to the address translation process. PAE was introduced in Pentium Pro. PAE increased the address lines of the processor from 32 to 36.

Protected Mode Segmentation

In Real Mode, the segment registers contain a segment selector which is the base address of the segment.

In protected mode, the segment selector points to a specific entry in a table.

There are 2 types of descriptor table - GDT and LDT. There is only one GDT shared by all tasks in the entire system. The LDT can be used by one or one group of tasks. GDT is located by using a special register GDTR. It is manipulated by privileged instruction executable by OS.

Each segment register pairs with an invisible part called the segment cache register, which contains the content of the corresponding 8-bytes table entry (called descriptor) in GDT or LDT.

The selector is 16-bits in length. The highest 13 bits specifies an entry in the descriptor table. In other words, there is 8K entries in the descriptor table. GDT (Global Descriptor Table). The next bit indicate if the table is GDT or LDT. The last 2 bits indicate the privilege level of the selector.

The descriptor is 64-bits in length and contains
1. base address of the segment (32-bits)
2. size of the segment (20-bits). The G-flag (1-bit) is used to interpret the size (clear means the size is number of byte from 1 byte to 1M, set means the size is 4K increment from 4K to 4M)
3. S-flag indicate if it is a system segment (clear) or an application segment (set). System segment descriptors are used to jump to segments that have higher privilege that the current executing task
4. Type (4-bits) used with S-flag to further define the segment. If the descriptor is an application section, bit 11 defines if it is code (set) or data (clear). For data segment, bit 10/9/8 represent the direction of growth (clear = up and set = down), RO/RW and if it is recently accessed respectively. For code segment, the last 3 bits represent if the code is conforming or not (set), Execute-only or Execute/READ and if it is recently accessed. A non-conforming code segment cannot be accessed by a program that is executing with less privilege (higher DP value). In other words, RPL <= CPL <= DPL
5. DPL (descriptor privilege level)
6. P-flag defines if the segment is currently in memory (set)
7. AVL defines if the segment is available for OS use
8. L-flag defines if the segment contains 64-bit code. Most IA-32 processors clear this bit.
9. D/B flag means differently when the segement is code, data or stack.

The first descriptor entry is always empty called null segement descriptor and the selector pointing to this entry is call null selector.

There are other types of descriptor in GDT:
1. Task State Segment (TSS)
2. Local Descriptor Table (LDT)
3. Code, data or stack memory segment to be accessible by multiple task
4. Procedure call gate used to control access to privilege program (e.g. IO routine) by less privileged ones (user)
5. Task gates used to switch to other task. LDT (Local Descriptor Table) is accessed via GDT (see 2).

The 16-bit segment selector is stored in TSS so that it could be loaded at task switching. TSS is a memory area that keeps the context of a task when it is switced out. It contains the general register, the segment register, the LDT selector field, EFlag, EIP, ESP, CR3 (Page Directory Address) etc. When a user program (privilege level 3) called into a more privileged program (level 0 to 2), the processor also automatically create a new stack. Therefore, TSS also keeps 3 additional ESP to record the stack top for each level.

Thursday, July 11, 2013

Real Mode Segmentation

Real Mode environment is based on 8086/88 processors. There are 6 segment registers, 4 general purpose registers, 3 pointer registers, 2 index registers and a flag register. All registers are 16-bit

The first 4 segment registers (CS, DS, SS and ES) store segment selectors which is the first half of a logical address. FS and GS came after 8086/88.

CS stores the base address of the current executing code segment
DS stores the base address of segment storing global data
SS stores the base address of the stack segment
ES stores the base address of segment for string data
FS and GS stores the base address of 2 more segment for global data

The 3 pointer registers are IP (for instruction), SP (stack pointer) and BP used to build stack frames for function calles

The 4 GPR are

AX = accumulator used for arithmetic functions
BX = base register used as index to address memory indirectly
CX = counter often used in loop
DX = data register used with AX

The 3 index registers are

SI = points to address of source in string operation
DI points to address of destination in string operation

Real mode use segmentation to manage memory. Jump operation needs to differentiate if the jump is within segment (NEAR) or across segments (FAR). There are several instruction resulted in jump. NEAR and FAR jump are relocation which means that they do not depend on specific address in the binary encoding

INT and IRET are intrinsically far jump as both of them involve the segment selectors.

JMP and CALL can be near or far depends on how they are invoked.

JMP SHORT label
JMP FAR PTR label
JMP DX is a NEAR indirect jump
JMP DS:[label] is a FAR direct jump
JMP DWPRD PTR [BX] is a FAR indirect jump

CALL label is a NEAR jump
CALL BX is a NEAR indirect jump
CALL DS:[label] is a FAR direct jump
CALL DWORD PTR [BX] is a FAR indirect jump
RET is a NEAR return
RETF is a FAR return

Saturday, June 15, 2013

Physical Memory

A physical address is used to access a byte in RAM. The address is placed onto the address lines of the processor. The number of address lines defines the physical address space.

8086 and 8088 spot 20 address lines (1MB)
286 has 24 address lines (16MB)
386 DX, 486 DX and Pentium have 32 address lines (4GB)
Pentium Pro has 4 more address lines (total 36) which known as Physical Address Extension (PAE) facility and can address up to 64GB

Saturday, May 4, 2013

Mixer

.There are generally 2 types of mixer - for sound reinforcement and for recording. Recording mixer has an additional tape return stage for signal to pass through before going to monitor.

The console is basically a collection of channels. Each channel has a certain arrangement of processes that lead down to the fader or volume knob:

(1) Gain (a.k.a trim or attenuation) adjusts the incoming signal to the console

(2) Auxiliary is used to route signal to an external device (to modify or add to the signal) and route it back to the desk to continue down the channel.

Effects:
Delay - repeat of a single source signal produced. The distance between the original sound and the delay can varies from minimum 10-20 msec to maximum 240-250 msec. Delay can improve a vocal line to create an illusion of a very large room.

Echo - what you hear in very large spaces with reflective wall. It is decaying repeat of an entire sound signal.

Reverb - a series of echoes that overlap. The overlapping of signals creates a cpntinuous sound that eventually decays to zero Reverb is used often to sweeten up a dialigue track or to create the impression that the sound is in an acoustically reverberabt room. Concert hall usually have a decay of about 1.5 to 2.5 seconds long. Smaller hall features 1 to 1.5 seconds reverb times.

Chorus - takes the original signal and slightly delays it. All of this is heard at the same time, giving a very solid thick sound. This is like simulating a group of performer singing together, slightly out of sync.

Pitch shifting/bending - increase or decrease the frequency

(3) Equalization basically divided into low, medium and high frequencies covering the audible range of human hearing.

(4) Fader is the last stage. Once the signal completed its path to the bottom of the channel, it reaches the fader or volume knob. The input signal is once again adjusted. The signal is usually bussed to the submix section of the console. This si another group of faders where all of the signals are collected from all active channels. These are usually called busses.

Monday, April 29, 2013

Microphone Characteristics

Dynamic range is the range of sound intensity a mic can provide to the recording device. A small dynamic range means a limited range of amplitude levels relative to the noise floor. For an empty concert hall, the nosie floor is around 50 dB SPL. Noise floor is the point at which the softest sound can be registered as a useable signal. Any sound below the noise floor cannot be heard.

Frequency response measures how the mic translate SPL (Sound Pressure Level) into audio signal at different frequencies. An ideal frequency response is flat meaning the mic can capture sound with different frequencies into equal amplitude level. Some mic are designed to respond to certain frequencies based on their needs.

Omnidirectional mic responds to sound pressure from all angles. Condenser mic are typically omnidirectional. A directional mic responds to sound pressire from a particular angle. Cardoid is the most common response pattern. It is named as the pattern is heart shaped. Both dynamic and condenser mic exhibit this pattern. Hypercardoid is more directional. Another name is called mini-shotgun. It is used if it needs to keep a distance from the source. Supercardoid, or shotgun, is highly directional.

Sunday, April 28, 2013

Microphones

Any device that converts one form of energy into another is called a transducer. A microphone is a transducer and so does a loud speaker.

According to the theory of electro-magnetic induction, a metal suspended in a flux field of magnet will produce a current of certain direction and magnitude within the metal.

The most commonly microphone used is dynamic microphone. They are extremelyt durable and less expensive. They are commonly used in live performance and concerts. The mic is constructed based on a diaphragm connected to a coil of metal floating in the flux planes of a magnet. When the diaphargm vibrates to the sound pressure, the coil moves and sending an electrical current through the coil connected to an output line.

Dynamic mic contains rather heavy magnets which makes it durable. However, weight of component also limits its frequency response. High frequencies require a diaphragm to move very quickly but the response of the heavy component is slower, thus antenuattimng higher frequencies.

Condenser mic, on the other hand, is not based on magnetic but can generate a voltage. The voltage however has no power behind it. The design is based on the movement of electrons and the open-air capacitor. Behind the diaphragm, there is a conductive back plane separated by a small pocket of air. This forms a capacitor. A current is sending through the plane. When the diaphragm vibrate, closing up and opening up the gap between the diaphragm and the plane, it varies the amount of current through, thus generating a signal. Condensor mic requires an external power source, known as phantom power (48 volts). The power can be supplied by battery. Condenser mic is delicate and can be damaged when falls.

Ribbon mic are least used but in radio broadcast. It uses same principle as the dynamic mic, wherein a thin ribbon of corrugated aluminium is located between 2 strong magnets. It generate a current but typically not strong enough. Instead of using phantom power, ribbon mic contains a built-in transformer to boost the level up. It is like a mic with a signla booster (pre-amp) built in. Ribbon mic are famous to have a warm sound, which lends well with voice. The microphone is fragile and heavy. They are also very expensive.

Wave

Sound move in the form of longitudinal wave. Analysis of this waveform is complex. A simpler visualization is to use a transverse wave. When throwing a rock into water, it creates ripples. When looked at from above, the ripples propagate outwards in the form of longitudinal wave. When looked at form the side (like a cross section), we see the transverse waveform. The upper part of a transverse wave represents the greatest point of compression, while the lowest point represents the rarefaction. The mid point is the position of molecule which it is not vibrating, is called the standard reference level.

A sinusoidal wave represents simple harmonic motion (SHM). A sine waveform result from mass vibration is the simplest and most economical way because it only contains a single frequency and has no harmonic content. Otherwise, the wave is called a complex periodic waveform. A waveform without pattern is called a random waveforms.

Frequency = speed of sound/wave length

The range of frequency human can hear is between 20 to 20kHz. Sound below the lower limit of hearing is called subsonic, whereas above the limit is called ultrasonic. Cat can hear between 45Hz to 85kHz. Bat and dolphin can hear up to 120kHz.

Music occupies about 1/4 of the range of hearing. The fundamental tone in music is that which you hear most prominently when an instrument is played. It occupies about 50% of the total sound heard. Some example of the frequency range of musical instruments:

violin = 200Hz to 3.5kHz
viola = 124Hz to 1kHz
Cello = 63Hz to 630Jz
Double Bass = 40Hz to 200Hz
Guitar = 80Hz to 630Hz
Piano = 28Hz to 4.1kHz

When hearing a periodic wave, we are actually hearing a complex averaging of the waveform's peak to peak values. The root mean square (RMS) ks the average level of a waveform over time. For a sine wave, RMS = 0.707*peak values.

Unlike frequency, amplitude cannot be measured without a reference value. Decibel is a logarithmic unit representing a ratio. Intensity level of a sound is measured as the energy transmitted per unit time and area of a sound wave. The greater the amplitude of a vibration, the greater the energy transmitted.

I = P/S of which P = Power (energy) and S = area covered

The loudest sound one hear is about 1 W/m2, which is a trillion times more energy than the softest sound (1*10-12 W/m2). These values are very awkward to use and so decibels are used. Another reason is that we hear sound intensity logarithmically.

Decibel is one tenth of a Bel (derived from Alexander Graham Bell). Bel is a ration of 10 to 1 between 2 numbers. The amount of energy between 1 Bel and 2 Bel is 10 times. The standard ratio of hearing is 0 dB SPL (Sound Pressure Level). 10 dB SPL is 10 times louder than 0 dB SPL. 20 dB is 100 times louder than 0 dB.

When we walk away from a sound, the loudness decrease following the inverse square law.

When 2 sounds of difference frequencies (e.g. 100 Hz and 105 Hz) are produced at the same time, they produce a pulsation effect, call beats. The number of beats = f1 - f2. When the difference between 2 frequencies is greater than 30 Hz to 40 H, the beat phenomenons ceases to exists. In its place is the existence of the simultaneous sounding 2 frequencies known as interval in music.

Sound

Sound is an aural pecrception of vibration. There are 2 types of sound. Noise is sound that is not organized or harmonized. Music is organized and intentional.

A sound is produced when an object is set in motion by conversion of mechanical energy into acoustic energy. The acoustic energy is in a form of pressure waves in the medium (e.g. surrounding air). The disturbances in the air are known as compressions and rarefactions. These forms of compression and rarefaction occurs around the source and move away in all directions. As a result, the wave propagate outwards. The air molecules does not move with the wave, thet just dislodged from their current locations. The form of acoustical energy transmission is respresented by a longitudinal waveform.

The scientific study of sound perception is called psychoacoustics. It is not concerned with how sounds produce a particular emotional or cognitive response, which is in the area of psychology. Psychological perception of sound is on 2 categories - pitch and loudness. This is equivalent to 2 properties of sound - frequency and amplitude. Frequency measures the rate of repetition and amplitude measures the strength of air pressure produced.

The psychological measurement of the magnitude of sound include its frequency, pressure, harmonics, duration and surface properties within the sound space.

Saturday, April 13, 2013

Signals

SIGABRT - sent by abort() to its calling process. The process terminates and generates a core dump. assert() call abort() when the condition fails.

SIGALRM - sent by alarm() and setitimer() when the period has lapsed to the calling process.

SIGBUS - rasied by kernel when the process incurs a hardware fault other than memory protection, usually a irrecoverable errors such as unaligned memory access.

SIGCHLD - sent to the parent process when a process ends. Parent process issues a wait().

SIGCONT - sent to the process that resumed from stop. Usually caught by terminal or editor use to refresh screen.

SIGFPE - cover not just floating point exception but all arithmetic exception

SIGHUP - kernel sends to the session leader when the terminal disconnects. The kernel also send to all foreground processes when the session leader terminates. The default action is to terminate. This signal means the user has logged out. Daemon overloads this signal to instruct them to reload its configuration. As daemon has no control terminal, it should never receive this signal from other sources.

SIGILL - sent when process execute an illegal instruction. Process can catch this signal but the behaviour is undefined.

SIGINT - sent to all foreground processes when user presses the interrupt key (CTL-C). This allow the processes to clean up before terminating.

SIGIO - BSD style asynchronous I/O event

SIGKILL - sent from the kill() system call. It cannot be caught or ignored.

SIGPIPE - If a process write to a queue but the reader has terminated, kernel raised this signal.

SIGPROF - raised by setitimer() with the ITIMER_PROF flag when the profile timer expires.

SIGPWR - system dependent. A UPS monitoring process sends this signal to init when the the battery level is low to allow the system to shut down orderly.

SIGQUIT - sent to all foreground processes when user presses the quit key (CTL-\)

SIGSEGV - sent when process access an invalid memory address (segmentation violation)

SIGSTOP - sent by kill() system call. This cannot be caught or ignored. The process is unconditionally stopped.

SIGSYS - process executes an illegal system call. For example, code compiled with newer version of OS runs on an older version.

SIGTERM - sent by kill(). Allows a process to catch it to initiate an oerderly termination.

SIGTRAP - sent when process cross a breakpoint, generally caught by debugger and ignored by most other processes.

SIGTSTP - sent by kernel to foreground process when user press suspend key (CTL-Z)

SIGTTIN/SIGTTOU - sent to a background process when it attempts to read from/write to control terminal.

SIGURG - kernel sends to process when an out-of-band data arrived at a socket

SIGURS1/2 - used solely by user processes. Common use is to instruct daemon to change behaviour

SIGVTALRM - raised by setitimer() when timer created with ITIMER_VIRTUAL flag expires

SIGWINCH - sent by kernel to all foreground processes when the terminal window size changes

SIGXCPU/SIGXFSZ - riased by kernel when the CPU and file size limit reached.

Signal Handling

Ignore

No action is taken. Two signals cannot be ignored - SIGKILL and SIGSTOP to allow SA to be able to kill or stop all processes. Otherwise, there will be processes that is unstoppable.

Catch and handle

The kernel suspend the execution of the process's current code path and jump to the signal handler registered. Execution will continue once the handler ends. SIGINT and SIGTERM are 2 commonly caught signal. SIGINT allows the shell process to return to the prompt. SIGTERM allow the process to clean up for a orderly terminating.

Perform default action

Take the defaul action usually means terminating the process

Anonymous Memory Mapping

Large memory allocation request will not be satisfied using heap. Kernel allocates an anonymous memory mapping for this type of request. Anonymous memory mapping is like file-based memory mapping except it is not backed by any file, thus the name. It is just a large piece of zero-filled memory area (in multiple of page size) ready for use.

Anonymous memory mapping uses mmapp call with special flag MAP_ANONYMOUS. The fd parameter is ignored. In BSD without the flag, anonymous memory mapping is implemented by mapping /dev/null with copy-on-write pages.

brk

Older UNIX has its stack and heap in the same data segment. Heap grows upward from the segment and stack grows downward. The line demarcating the two was called the break or break point. In modern UNIX where data segment is its own memory mapping, the end address of the mapping continue to be called break.

A call to brk function set the end address of the segment. sbrk increment the end of data segment by amount which can be +ve or -ve.

Device Node

Device nodes are special files to allow interaction with device driver. Kernel will hand over the I/O calls (e.g. read) to the driver instead of to file. The driver handles the request and returns result to the caller. This abstraction allows user to use familiar I/O call to interact with drivers.

Each device node is assigned a major number and a minor number. The major and minor numbers identify the device driver loaded in memory. If the numbers cannot be matched, system returns ENODEV as the device cannot be found.

Special device nodes are:
/dev/null (1,3) - read returns EOF, writes are discarded
/dev/zero (1,5) - read returns \0, writes are discarded
/dev/full (1,7) - read returns \0, write returns ENOSPC indicating the device is full
/dev/random (1,8) - random number generator. An entropy pool is generated by hashing noise collected from driver and other sources. Read returns from entropy pool. The result is suitable for seeding process like keygen as it is cryptographicall strong. Kernel monitors the amount of entropy in the pool. If it reaches zero, read will be blocked. This scenario could happen in diskless station which have little or no I/O activities.
/dev/urandom (1,9) - a lower grade version of /dev/random. Read will be successful even if the entropy pool is depleted.

Normal I/O call cannot represent all functions of device e.g. set baud rate. ioctl (I/O control) is used for such out of band communication with the device.

int ioctl (int fd, int request, ...)

The request is a code known to kernel representing the command to the driver.

Saturday, April 6, 2013

Standard I/O Locking

stdio is inherently thread-safe. Each opened stream is associated with a lock, a lock count and an owning thread. Thread must acquires the lock to become the owning thread before issuing any I/O call.

Still, it may need to lock the file to allow multiple I/O calls to complete in a thread. flockfile() waits until the stream is no longer locked and then acquire the lock, increase the lock count and become the owning thread.

funlockfile() release the lock after finishing up the I/O calls.

ftrylockfile() is a non-blocking version of flockfile.

Using these calls, programmer can control the locking and can work with a set of I/O calls in standard library which does not check for locks and thus increases performance (e.g. fgetc_unlocked, fgets_unlocked, fwrite_unlocked).

Standard I/O Buffering

Standard I/O implement 3 types of user buffering for different situations. They are set by setvbuf call:

(1) unbuffered (_IONBF) - no user buffering. Data is directly submitted to kernel, This option is seldom used. Example is stderr.

(2) line-buffered (_IOLBF) - buffering performed on per line basis. Data is submitted to kernel at \n reached. This type is suitable for line oriented stream like terminal (stdout)

(3) block-buffered (_IOFBF) - ideal for file. Standard I/O uses the term full buffering.

C Standard I/O Library

This refer to buffering in user space performed by application or standard library. The C language does not provide any advanced I/O function. In turn, the standard C library (stdio) provides a platform independently user buffering solution. As buffering is maintained in user space rather than kernel space, there is a performance improvement. Standard I/O calls are not system calls.

The standard I/O routines use file pointer instead of file descriptor. Inside C library, file pointer is mapped to file descriptor. File pointer points to FILE typedef.

e.g. FILE * fopen(const char *path, const char *mode)

Mode includes
r = read
w = write
a = append
r+ = read and write, position at the start of file
w+ = read and write, truncate the file to size 0, positon at start of file
a+ = read and write, create file if does not exist, position at end of file

Other stdio routines include

fdopen - open using fd

fgetc/fputc - read/write a character from stream

ungetc - put a read character back to stream. If multiple characters are unget, they are read in reverse order. In other words, the last ungetc char will be returned first. POSIX allows only 1 push back. If a seek is performed before read, all pushed back characters will be lost.

fgets/fputs - read/write a string. For read, a \0 character will be place at the end of the buffer. Reading stop at EOF or a newline character is reached. Newline \n is stored in the provided buffer

fread/fwrite - read/write specified number of elements (structures) from file. This is reading the file as binary data.

fseek - seek to a particular position in the file
fsetpos - similar to seek. This function is provided mainly for non-UNIX platform with have complex type representating stream positon.

rewind - reset the sream position to start of file

ftell - return the current stream position
fgetpos - pair with fsetpos above

fflush - write data from buffer to kernel space. No gurarantee that the data are flushed to disk. Issue fsync() after the flush to ensure data are written to disk.

fileno - obtain fd of a stream

Page Cache

Page cache exploite temporal locality which means a thing access recently is highly probable that it will be accessed again. When free page runs out, the least used page will be pruned from the cache. Sometime, it is more effective to swap out a chunk of seldomly used data instead of pruning the cache. The hueristics to balance between swappig and paging is controlled via /proc/sys/vm/swappiness.

Write back of dirty pages to disk is carried out by a group of kernel threads called pdflush. They are woken up when the number of free pages falls below a threshold or the age of dirty pages reaches a threshold. Multiple pdflush are instantiated concurrently to take advantage of multi-processors and also for congestion avoidance, which prevent write from being backed up while writing to a single device.

Multiplex I/O

Multiplex I/O allows application to block on multiple file descriptors and be notified when any one of them is ready for read or write.

select() sustem call implements synchronous multiplex I/O. The call is passed with 3 watched file descriptor sets, with one of them for read, one for write and one for exception. The set is ignore if NULL is passed.

When returns, the sets are modified to contain only the fd which is ready for I/O. For example, if fd x and y is placed in the readfds when calling select(), and x is returned with the readfds, it means x is ready for reading without blocking and y is not.

select() accepts a parameter to indicate the amount of time to block before returning even if no fd is ready for I/O.

The watched fds list is manipluated by macro FD_SET and FD_CLR. FD_ISSET is used to test if a particular fd is in the set and is used after the select() call.

Because select() has historically available in most UNIX comparing to other mechanism for subsecond resolution sleeping, it is used as a portablewya to sleep by providing an non-NULL time out but NULL for all 3 watched file sets.

pselect() system call is introduced in 4.2BSD and also adopted by POSIX. There are 3 differences between pselect and select:

(1) pselect() uses the timespec structure instead of timeval structure. for its timeout parameter. Timespec uses sec and nanosec and not microsec. In practice, neither call provide reliable nanosec resolution.

(2) call to pselect() does not modify the timespec paramter and thus it does not need to reinitiatlized like timeval.

(3) pselect has an additional parameter sigmask. pselect() with NULL as sigmask is same as select(). This parameter is intended to solve a race condition between waiting for the fd firing and signals.

poll() is a SystemV solution which solve a few deficiency of select(). The call uses a pollfd structure with each describe a file and a bitmask of events to look out for. the revent filed in the structure return the events that have fired (e.g. POLLIN for data to read, POLLPRI for reading urgent data, POLLOUT for writing, POLLWRBAND for writing priority data, POLLMSG for a SIGPOLL message is available)

POLLIN | POLLPRI is equivalent to select() read event. POLLOUT | POLLWRBAND is equivalent to select() write event. POLLIN is equivalent to POLLRDNORM + POLLRDBAND. Linux provide a ppoll interface similar to pselect but ppoll is not a POSIX standard.

Comparing poll to select:
(1) poll does not need the programmer to specify the number of fd contained in the watched list
(2) poll is more efficient when monitoring a long list of fd because it passes in individual fd structure rather than a possibly sparse bitmask.
(3) select is more portable as it is available in most systems

Seeking in File

lseek() is used to set the filepos of a file. Seeking past end of file is allowed. Read will return EOF and write will cause data to be written at the position. The range between the old EOF to the filepos is filled with zeros logically but not physically. In other words, the actual file size is smaller than what it is recorded. Performance is enhanced as the hole will not initiate any real I/O. The file is called a sparse file.

In lieu of lseek, Linux also provides pread and pwrite system calls. p stands for positional. Semantically, p-call is similar to a lseek follow by read/write. Differences are (1) they do not change the file pointer upon completion and (2) they avoid potential race condition with using lseek, as threads share the same fole pointer.

Closing Files

Close unmap the file decriptor from the associated file. Closing file does not mean the data will be flushed to disk. Always check for errno after close becuase error occurs in deferred operatons will be reported when close() is called.

Direct I/O

UNIX implements a layer of buffering or caching between the application and the device. A high-performance application (e.g. database) may want to bypass this layer of complexity and manages its own I/O. O_DIRECT specifies the I/O is done directly from user space buffers to the device, bypassing the page cache. All I/O is synchronous. The request length, buffer alignment, file offsets must all be integer multiples of the underlying device's sector size.

Read and Write System Calls

read() can return a few possible scenarios:

(1) call returns a value equal to len and the data is stored in buf
(2) call returns a value less than len by >0. This happens because the read is interrupted by a signal midway, or an error has occurred during read, or there is less data avaliable than len bytes, or EOF is reached. Issue read() again with the remaining len value can complete the call or detect the cause of the problem.
(3) call returns 0 (or EOF)
(4) call is blocked because there is no data available for read
(5) call returns -1 and errno equals to EINTR means the call is disrupted before any byte is read. If the errno equal to EAGAIN, there is no data to read and the read() call is operated in non-blocking mode. Issue the call again.
(6) call returns -1 with other errno values indicated a more severe problem has happen

write() is less likely to return a partial write than a read(). For regular files, write() is guaranteed to perform the entire requested write unless an error occurs. For other type (e.g. socket), partial write may be possible and it can be re-issue when the write is incomplete.

Using O_APPEND mode ensure file corrupted by 2 racing processing competing for write. If the file is not open using O_APPEND, the write will occurs at the filepos for each processes. O_APPEND ensure the write always occurs at the end of the file. This mode is useful for log files but less sensible for other type.
EPIPE indicates that the reading end of a pipe has closed. The process will also receive a SIGPIPE signal, with default action to terminate the process. The process intends to handle this errno must ignore, block or handle this signal.

When a write() returns, the kernel has copied the data from the supplied buffer into a kernel buffer. There is no guarantee that the data will be sent to the disk. The kernel will batch the dirty buffer and write to disk later. This delayed write behaviour also means the write order is not preserved. Another problem is that write error may not be reported immediately as the actual write occurs later and asynchronously with the actual system call. To mitigate the risk of deferred write, kernel institute a maximum buffer age to write out all dirty pages when it is reached. This is configured via /proc/sys/vm/dirty_expire_centiseconds.

fsybc() ensures all dirtt data associated with a file (mapped by fd) is written to disk. The call writes back both data and metadata (e.g. creation timestam and other attributes in an inode). It will returns when the disk acknowledged the data externalization has completed.

fdatasync() writes data only. Neither call guarabtees that any updated directory entries containing the file are synchronously to disk. To ensure this, fsync() must be called against the fd representing the directory itself.

sync() wrties out all buffers to disk. Both data and metadata are written out. Sync returns before all buffers are written out. It just initiates the action. So processes may invoke the call multiple time to ensure all buffer is committed to disk. For Linus, sync() returns after all buffers are written out. sync() may take some time in a busy system.