Record: 2011

Monday, December 12, 2011

Stopped Process

Terminal defined a SUSP character (Ctl-Z) to stop a process from running in the foreground. Pressing the keys resulted in Linux sending a SIGTSTP signal to the process. The process leaves the stop state when it receives a SIGCONT sigbnal. Otherwise, the only other way to get out of the stop state is to terminate.

Foreground process is the only process that receives terminal input. All other processes are considered to be background process. When background process tries to read from terminal, the terminal stops the process by sending it a SIGTTIN signal. The process stops until it is brought to foreground via the fg command. If the process receives a SIGNCONT, it resumes but will be blocked again when trying to read from terminal.

To stop background process writing to console and thus clobbering the display, use the tostop flag of the terminal. While this setting is disabled (default), background processes can write to terminal. The setting is enabled using this command

stty tostop

When a backgroud process tries to write to terminal, the terminal sends a SIGTTOU signal to the process and puts it to sleep. Sending SIGCONT to the stopped process will wake it up but it will be blocked again as soon as it tries the write. Again, the process will allow to write when it is brought to foreground using fg. To disable the setting, use this command

stty -tostop

Wednesday, November 23, 2011

Thread

A thread is a basic unit of CPU utilization. Thread comprises a thread ID, a program counter, a register set and a stack. It shares with other threads belonging to the same process its code section, data section and other operating resoruces such as open files and signals. A traditional process contains a single thread of control.

Support for thread may be provided either at user level (for user thread) or kernel level (kernel thread). User threads are supported above the kernel and are managed without kernel support, whereas kernel threads are supported and managed directly by the OS. Ultimately, a relationship must exist between user threads and kernel threads:

(1) Many-to-one model maps many user-level threads to one kernel thread. Thread management is done by the thread libraray in user soace and so it is efficient. However, the entire process will be suspended if it make a blocking system call. Also only 1 thread can access the kernel at a time and so this model is unable to run in parallel on a multicore system. The GREEN threads (avaliable in Solaris) and GNU portable threads use this model.

(2) One-to-one model maps each user thread to a kernel thread. It provides higher concurrency by allow other threads to run when one gets blocked by system call. The disadvantage of this is the overhead of creating a kernel thread for every user thread and affects performance. Most implementaion of this model restrict the number of threads supported by the system. Linux and Windows implements this model.

(3) Many-to-many model multiplexes many user level threads to a smaller or equal number of kernel threads. This model provdes higher concurrency comparing to the many-to-one model at the same time limit the number of threads as compaaring to the one-to-one model.

(4) Two-level model is a variation of the many-to-many model by also allowing a user-level thread to be bound to a kernel thread (one-to-one). This model is supported in IRIX, HPUX and Tru64 UNIX. Solaris supports this model in pre-Solaris 9 version. From Solaris 9 onwards, it uses the one-to-one model.

Many systems implementing the many-to-many or two-level model place an intermediate data structure between the user and kernel threads. This structure, typically known as lightwight process (LWP), appears to the user-thread libraray as virtual processor on which the application can schedule a user thread to run. Each LWP is attached to a kernel thread of which the OS scheduler will dispatch on physical processors. When a kernel thread is blocked, it blocks the LWP and in turn blocks the user-thread.

Sometime the kernel needs to inform the application about certain event. It uses upcall procedure. At the beginning, the kernel allocates a number of LWP for the application. One upcall event is when an application thread is about to block. In this case, the kernel makes an upcall to the application identifying the specific thread. The upcall is hanlded by the thread library. The upcall handler is allocated with a new LWP to run. It saves the state of the blocking thread and relinquishes the virtual processor. The upcall handler then schedules another thread to the new virutal processor. When the blocking thread is eligible to run, another upcall will be performed.

Tuesday, November 22, 2011

Operating System Structure

(1) Monolithic
Frequently such system started from a small, simple and limited systems and grew beyond the original scope. For example, MSDOS and traditional UNIX.

(2) Layered
With proper hardware supoport, OS can be broken into modules. The main advantage of the layered approach is simplicity of construction and debugging. Once a layer is debugged, the upper layer can be tested easily. The major difficulty with the layered approach involves appropriately defining the various layers. sometimes it is not so obvious. For example, the backing store (paging/swapping) driver would normally be above the CPU schedululer because the driver may need to wait for I/O and CPU can be rescheduled during that time. however, on a large system, the CPU scheduler may have mor einformation about all the active processes than can fit in memory. Therefore, this information may need to be swapped in an dout of memory, requiring the back-store driver routine to be below the CPU scheduler. Another problem with layered approach is that it tends to be less efficient than other types. Each layer adds overhead to the system call. This resulted in the recent trend in OS design to have less number of layers and each with more functionality.

(3) Microkernel
In mid-1980s, researchers in Carnegie Mellon University developed an OS called Mach using microkernel approach. The method removes all non-essential components from the kernel and implementing them as user or kernel processes. This results in a smaller kernel. There is little consensus on what should be included or excluded. Typically, micokernel provides minimal process and memory management, in addition to a communication facility. The main purpose of microkernel is to provide a communication facility between the client program and various services that are also running in user space. Communication is provided via message passing through the microkernel.

One benefit is ease of extending the OS. All new services are added as user services and do not require modifying the kernel. This make the OS easier to port to different hardware platform. The microkernel also provides better security and reliability. If a services fails, the rest of the OS remains intact. Several contemporary OS using this approach. Tur64 UNIX provides a UNIX interface to user but was built with a Mach kernel. The Mach kernel maps UNIX system calls into messages to the appropriate user-level services. The Mac OS X (Darwin) was also based on Mach microkernel.

Unfortuntely, microkernel can suffer from performance decreases due to increased system function overhead. For example, the first release of Windows NT was based on a layered microkernel organization. Windows NT 4.0 partially address the performance problem by moving layers from user space to jernel space and integrating them more closely. by the time windows XP was designed, its architecture was more monolithic than microkernel.

(4) Modules
The best current methodology for operating system design involves using object-oriented programming techniques to create a modular kernel. The jernel has a set of core components and links in additional services either during boot time of run time. It uses dynamically loadable modules. For example, Solaris is organized around a core kernel with seven types of loadable kernel modules - scheduling classes, file systems, loadable system calls, executable formats, STREAMS modules, Miscellaneous, device and bus drivers. such design allows the kernel to provide core services yet also allow certain features to be implemented dynamically. For example, device and bus drive for specific hardware can be added.

The result resemble a layered approach in which each section has a defined, protected interface. But it is more flexible than a layered approach as the sections can called one another. The modular approach is also like the microkernel approach in that the primary module has only core functions and knowledge to load and communicate with other modules. but it is more efficient because module can communicate directly with one another rahter than message passing.

Sunday, November 13, 2011

Procedure and Process

Procedure is a series of steps whereas process is far richer in governance and oversight. Process has feedback mechanism for improvement and is more adaptable to change conditions. A process also interface to other processes.

Tuesday, November 8, 2011

AT&T Mnemonic

When gcc compiles a c program, it actually generate an assembly soruce code in AT&T mnemonics. The output .s file is then assembled using the gas assembler into object module.

General rules for AT&T mnemonics:
(1) All mnemonics and registers are in lower case.
(2) Register names are prefixed by "%"
(3) Every AT&T instruction mnemonics that has operands has a single character suffix indicating the length of the operands. The suffix are b, w, l(long 32-bit). For example, movw %ax,%bx.
(4) Source and destination operands are placed in opposite order from Intel syntax.
(5) Immediate value are prefix by "$". For example, pushl $32
(6) gcc does not generate all AT&T mnemonics.
(7) Displacements in memory references are signed quantities placed outside parantheses containing the base, index and scale values. For example, movl -4(%ebx,%edi) is equivalent to Intel mov byte easx,[ebx+edi-4]

Linking to the Standard C Library

C runtime library dictates a structure to any programs that link to it. This include a block of code that runs before your program begins and another block of code after your program ends. Your program is called by the startup code as though it is a procedure with the CALL instruction. Your program returns to C library using the RET instruction.

The C calling convention is as follow:
(1) A procedure must preserve the values of the EBX, ESP, EBP, ESI and EDI. The other GPR can be altered.
(2) A procedure's return value in EAX for 32-bit or less. 64-bit return value is stored in EDX and EAX.
(3) Parameters passed to procedures are pushed onto the stack in reverse order. In other words, given MyFunc(p1, p2, p3), p3 is pushed onto stack first, then followed by p2 and p1.
(4) Parameters are removed by the caller after the procedure returns.
(5) The label name of the the assembly program (or the one generated by gcc) must be called "main:" and must be declared as GLOBAL.

Monday, November 7, 2011

x86 Jump

There are 2 categories of conditional jump. The problem arises because NASM can generate a binary opcode in different way. A jump target that lies within 127 bytes of the conditional jump instruction is called short jump. A jump that is farther than 127 bytes but within the code segment is called near jump. A near jump can be as far as 2GB away. A far jump go out of a segment which mean specifying both segment and offset. This is quite uncommon.

A short jump generates a short and compact opcode which is always 2 bytes in size. Near jump generates a opcode of either 4 bytes or 6 bytes. NASM generates short jump by default as this is most commonly use in programs. To generate near jump, one must explicitly specify

jne loop ;short jump
jne near loop ;near jump

The above does not apply to unconditional jump.

Linker

There is no reason why assembler cannot generate an executable program as object code file but it is almost never done. The object module cannot be run by itself but requires another step called linking. This model allows breaking big program into smaller parts and each gets written on its own separately.

To process each modules into an executable, linker first builds an index called symbol table, with an entry for every named item in every object module. Once the symbol table is complete, the linker builds an image of how the executable will be arranged in memory when the OS loads it. The image is written to the disk as the executable.

Object modules are allowed to refer to symbols in other object modules using the EXTERN keyword. In the declaring module, the label must be defined using the GLOBAL keyword.

During assembly, these external references are left as holes in the calling module to be filled later. As the linker builds the executable image, it learns where these symbols are in the final image and update them in the reference holes.

Debugging information is a step backward: portions of the soruce code, which was all stripped out early in the assembly process, are put back into the object module by the assembler. These portions of the source code are mostly names of data items and procedures.

Linker also makes the executable relocatable (all address are referenced relative to the start of the executable. This enable the executable to be loaded in any physcial address.

x86 Memory Models

Flat Memory Model
The memory is abstracted as a list of consecutive bytes accessed via a linear address. This resemble to the physical address but not necessary when the linear address is used in the translation process under a memory protection model

Segmented Memory Model
Memory is organized in term of distinct region called segment. A byte in segment is referenced by a logical address (also called a far pointer). A logical address comprises 2 parts - a segment selector and an effective address. The logical address can resemble the physical address or may be not depending on the translation process.

IA-32 operates in 3 modes - real mode, protected mode and system management mode (SMM).

SMM
It is used to execute code in firmware (e.g. emergency shutdown, power management)

Real Mode
Real Mode implements a 16-bit execution environment of the old 8086/88. It is an instance of the segmented memory model. The logical address contains 16-bit segment selector and 16-bit effective address.

The selector stores the base address of a 64KB segment. To form a 20-bit address (used by 8086/88), the segment selector is appended with a zero nibble before adding on with the effective address. For example, the logical address 02001010 is translated to 0200[0] + [0]1010 = 03010. The segment address is always at paragraph boundary (because of the appended zero) and the segment is at most 64KB (because of the 16-bit effective address). There is no memory protection under Real Mode.

There are 5 combination of segment registers and GPR:

SS:SP

SS:BP

ES:DI

DS:SI

CS:IP

MSDOS is a real mode OS. The first 640KB is called Conventional Memory. The remaining memory area up to 1MB is called Upper Memory Area, which is reserved for used by ROM and peripheral. Within UMA, there are slot of free memory not used by hardware and accessible by DOS are called UMB (Upper Memory Block). Memory above the real mode 1MB boundary (when 386 was released) are called extended memory.

In a Real Mode Flat Model, program starts at 100h which is a holdover from CP/M-80 (PSP).

Protected Mode
Similar to real mode, protected mode is an instance of segmented memory. The difference is that operating system is now must collaborate with the CPU to resolve addresses. The segment registers are owned by OS and program cannot read or write them.

x86 general purpose registers

There are 8 GPR. Four of them can be referenced either in full 32-bit or 16-bit.

The 16-bit names are SI, DI, BP and SP. The 32-bit names are prefixed by "E" - ESI. EDI, EBP and ESP.

The other four GPR can be referenced either in 8-bit, 16-bit or 32-bit

32-bit: EAX, EBX, ECX amd EDX
16-bit: AX, BX, CX and DX
8-bit: AH, BH, CH and DH (for the upper 8-bits) and AL, BL, CL and DL (for the lower 8bits).

Sunday, October 30, 2011

x86 Segment Registers

8080 is a 8-bit CPU with 16-bit address lines. 16 bits define an address space of 64K. At that time, the prevalent OS is CP/M-80. CP/M loads itself to the top of the installed memory. (This allows CP/M to reside in ROM.) More importantly, it allows the transient (user) program to start at a fixed location in low memory, at address 0100h (256 bytes from the start of memory).

The first 256 bytes are called PSP (program segment prefix) which contains various odd bit of information and buffer for disk I/O. Program will be loaded at 0100h, follow by a section of unused memory. At the end is CP/M. Above that are addresses that are not supported by any installed RAM.

8086 has 20 address lines which gave an address space of 1MB (16x64K). To allow programs written for CP/M to run in (or port to) 8086, segment register was used to point to the starting address for a 64K memory area within the 1M address space. While the CPU can see the full 1MB of memory, the segment register restrict the access to 64K at a time. CP/M program runs in the 64K segment similar to 8080.

Segmenation model was a big plus in allowing more memory to be addressed than a linear model in a 16-bit architecture of 8080. But when the CPU architecture moved on to 32-bit and 64-bit, the segmentation model is no longer required.

A segment starts on paragraph boundary (16-bytes). To form a 20-bit address, 2 registers are used. One of them is segment register which form the "base". There are 6 segment registers:

CS - code segment
DS - data segment
SS - stack segment
ES - extra segment (a spare segement)
FS and GS are clones for ES

While the segmentation scheme makes porting of CP/M program easy, this has severely complicated the programming model beyond CP/M.

Friday, October 21, 2011

CPU Architecture

Architecture is "what the CPU does". It comprises like instruction set and registers. x86 CPU archtiecture evolves but attained high degree of backward compatibility. New instructions were added and width of the CPU expanded from 8 to 16 to 32 to 64 bits. Intel's 32-bits architecture is called IA-32. The new 64-bit architecture is called x86-64 which was invented by AMD. Inte;'s 64-bit architecture is called IA-64 Itanium which was not backward compatible to IA-32.

A CPU's microarchitecture is an implementation of an architecture. Different techniques are used to improve the throughput and speed of the execution (E.g. prefetch, cache, micro-op fusion etc.).

Monday, August 29, 2011

Broadband

Most LAN technology uses baseband signalling which mean only one signal is transmitted through the medium at a time. Broadband signaling on the other hand allows multiple signals on the medium simultaneously.

Sunday, August 28, 2011

Layer 1 Devices

Repeaters are the most basic form of forwarding device. They associate with Layer 1 as they do not have the ability to inspect the content of the headers. They simply regenerate the exact copy of the electrical signal. They are primarily used to extend the distance of cable run. They are seldonly used now due to advance of fibre optics that spans great distance.

Hub, or concentrator are simply multiport repeaters. The concept is the same. A signal delivered to any port on the hub is regenerated and forwarded to all ports. Again, there is no examination of frame. Hub is generally deployed in small LAN. When the number of nodes increases, perform degrades significantly as bandwidth are contended.

Ethernet is a baseband medium, meaning only 1 signal will be transmitted at any one time. When 2 nodes attempt to transmit at the same time, collision results. The nodes will reattempt to transmit at a later time. Therefore, hubs and repeaters in a network form a collison domain.

As the size of network grow, the collision domain becomes less efficent. Solution needs to partition the collision domain.

Monday, August 8, 2011

Memory Access

First level cache often has access time between 1 to 3 processor cycles. Access to second level cache requires 20 to 30 cycles and access to memory will needs >100 cycles.

In a multi-core system, each core has memory directly attached to it. Access to the local memory is faster than access to the memory connected to another core.

One approach is to interleave memory, commonly at cache line boundary so that application would access local and remote memory alternatively. On average, application will experience memory latency which is average of the local and remote memory access. This architecture is called uniform memory architecture (UMA) where all processor see uniform memory latency

Another approach is to accept the difference memory region will have different access frequency and make the OS aware of this. This architecture is called cache coherent nonuniform memory architecture (ccNUMA). OS will attempted to assign process to run on the same processor to maximize local memory access.

Cache

Direct mapped cache - each cache line in memory maps directly to a fixed cache line position. If the cache is 4KB in size, every memory location in 4KB apart maps to the same cache line.

N-way associative cache - each cache line in memory maps to a set of N possible location in cache. The location chosen will be determined by some replacement policies (e.g. random or LRU etc).

Fully associative cache - each cache line in memory maps to any position in cache. This is rarely implemented because of its complexity

First level cache access time is 1 to 3 cycles. Second level cache access time is 20 to 30 cycles. Memory access takes more than 100 cycles. The reason for having multiple level of cache because the larger the cach, the longer it takes to find out if an item is stored there.

Saturday, August 6, 2011

Initial RAM Disk in Linux

In the final boot step, the Linux kernel will execute a series of run_init_process calls:

run_init_process("/sbin/init");
run_init_process("/etc/init");
run_init_process("/bin/init");
run_init_process("/bin/sh");
panic("no init found. try passing init= option tp kernel.")

The run_init call execite program in a filesystem already mounted, else the kernel will panic. Linux kernel contains 2 mechanisms to mount an early root file system to perform certain start up functions.
(1) initial RAM disk (initrd) is the legacy method. Support for this must be compiled into the kernel. It is a small, self-contained root file system that usually contains directives to load specific device driveers before the end of boot cycle. For example, Red Hat and Ubbuntu uses initrd to load the EXT3 file system driver before mounting the real root file system.
The bootloader pass the initrd image to the kernel. A common scenario is that the bootloader loads the kernel image into memory and loads the initrd image to another address ranges. Bootloader than passes the load address of initrd to kernel. In other scenario (ARM architecture), the initrd and kernel image are simply concatenated into a single composite image. The address and size of the intird is passed to kernel using the kernel command line. One note is that the initrd image is universally compressed.
When Linux boots, it detects the presence of initrd image. It copies the compressed binary file from the physical location in RAM to proper kernel ramdisk and mounts it as the root file system. It then looks for a special file in the initial ramdisk called linuxrc. This file contains directives (commands)required before mounting the real root file system. After the kernel copies the ramdisk from physcial address into the kernel ramdisk, it releases the physical memory. This is similar to transfer the initrd image from real memory to virtual memory of the kernel.

As part of the Linux boot process, the kernel must locate andmount a root file system. The kernel decides what and where to mount in a funciton called prepare_namespace() in /init/do_mount.c.

If initrd support is enabled in the kernel and the kernel command line is so configured, kernel decompresses the initrd image and copy the content to ramdisk under /dev/ram. At this point, we have a proper file system on a kernel ramdisk. The kernel effectively mounted the ramdisk device as its root file system. Finally, kernel spawn a kernel thread ti execute the linuxrc file. When linuxrc is done, kernel unmounts the initrd and proceed to the final stage of system boot. If the real, root device has a directory called /initrd, Linux will mount initrd file system on this path. Otherwise, the initrd will be discarded.

If the kernel command line contains a root= paramter specifying a ramdisk (e.g. root=/dev/ram0), the previous described behaviour will change. First, the processing of the linuexrc file is skipped. Second, no attempt is made to mount another file system as root. It means the initrd is the only root file system. This is useful if minimal system configuration is intended.

(2) initiramfs is the preferred mechanism for executing early user space programs. It is coneptually simliar to initrd and its purpose is also similar: to load drivers that might be required to mount the real file system.
Initramfs is much easier to use by developer as it is a cpio archive whereas initrd is a gzipped file system image.

init process in Linux

The File System Hierarchy Standard (FHS) establishes a minimum baseline of compatibility among Linux distributions and application programs.

AT the end of kernel initialization, it calls run_init_process() which refer to programs expected to residing in the root file system. If the programs cannot be found, the system will halt at the panic().

It is not sufficient to simply include an executable call init in the system and expect it to boot. You must also satisfy its dependency (unresolved reference and configuration data in external file).

The init process supplied with Linux is very flexible. It implements what is commonly called System V init, from UNIX System V using this schema.

Init spawn other processes under the direction of configuration stored in /etc/inittab. Init execute differently according to runlevel (0-6, S). Runlevel 0 halt the system. Runlevel 6 reboot the system. Associate with each runlevel is a set of startup and shutdown scripts (kept in /etc/rc.d/init.d

When init starts, it reads /etc/inittab. This file contains directives for each runlevel. The sysinit tag indicates the script to be run first. The initdefault tag indicate the default run level. Init then executes the scripts denoted by the runlevel tag l#.

Linux Boot Up

The bootloader takes control once power is applied to the computer. Bootloader is a set of routines designed to do low-level initialization, OS image loading and system diagnosis (dump, test etc). It loads and passes control to the OS.

Using XScale platform as an example, the Bootloader passes control to head.o module label Start, which is a bootstrap loder. The bootstrap loader appended to the kernel image has a primary purpose: to create an environment to decompress and relocate the kernel and pass control to it.

Control is passed to the kernel proper, to a module called head.o (a different module than before) and label Start. In other words, head.o is the kernel entry point. The head.o module performs architecture and often CPU-specific initialization in preparation for the main body of kernel. Still, CPU-specific initialization tasks are kept as generic as possible across the CPU family. Machine-specific tasks are performed elsewhere.

The head.o module checks for valid processor and architecture, creates initial page table entries, enables processor's MMU (Memory Management Unit), establishes limited error detection and reporting.

When control is first passed to head.o, the processor is in real mode (the program counter represents physical address). After the processor's registers and kernel data structures are initialized to enable address translation, the processor's MMU is turned on. Suddenly the address space as seen by the processor is yanked from beneath it and replaced by an arbitrary virtual addressing scheme. This is why debugger cannot single step through this portion of code.

The final action of head.o is to jump to the kernel proper's own main.c, a startup file written in c.

The first line of output from the kernel is the kernel version string. Upon entering start_kernel call (in main.c), printk() displays Linux_banner which contains information such as version, username and machine which the kernel is compiled, build and version number etc.

start_kernel is by far the biggest function in main.c. Most of the kernel initialization takes place in this routine.

One of the first few things start_kernel called is setup_arch(&kernel_command_line). This module calls setup_processor() which verifies the CPU ID and revision, calls CPU-specific initialization functions, and display CPU information in the console. Finally, setup_arch calls the machine-dependent initialization routine.

Next to architecture set up, main.c performs generic early kernel initialization and then displays the kernel command line which is a list of parameters.

After start_kernel() and calling some early initialization functions explicitly by name (e.g. init_timers, console_init), the first kernel thread, called init (PID=1), is spawned. At this point of boot sequence, 2 distinct threads are running: that represented by start_kernel(), and now init(). The former becomes the idle process after completing its work. The latter becomes the init process.

start_kernel() calls rest_init() and this allows start_kernel() to return most of its memory to the address space. The kernel's init process is then spawned in rest_init() by the call to kernel_thread(). init continues with the rest of the initialization while start_kernel() executes the function cpu_idle() that loops forever.

After further initialization, the final step of boot up is to try to run a series of run_init_process call.

run_init_process("/sbin/init");
run_init_process("/etc/init");
run_init_process("/bin/init");
run_init_process("/bin/sh");

One way of another, these run_init_rpocess() commands must proceed without error, or the system will fall down to a panic() call to halt the system.

The run_init_process() function does not return on successful invocation because it call execve() system call which replaces the process executable with a new one.

Friday, August 5, 2011

Linux File System

Linux file system is stored in partition, which is a logical disk. There are 3 type of partitions in Linux - Linux, FAT32 and Linux Swap.

ext2 (Second Extended File System) is a conventional FS using inode etc. ext3 built on ext2 is a journal file system. Change to the file system (e.g. date, size, blocks used etc) are stored in a special file so recovery is possible from known journal point. It eliminate the weakness of long check time for ext2 after abnormal shutdown. ext4 is also a journal file system with larger fs and file size support.

ReiserFS is also a journal fs like ext3. Reiser4 introduced a API for system programmer to guarantee the atomicity of file system transaction. A jfs guarantee the metadata changes have been stored in the journal file so that the kernal can at least establish a consistent state of the file system. That is, if file A was reported to have 16KB before the system failure, it will be reported as having 16KB afterward in the inode. This does not mean that the file data was properly written to the file.

Linux Device Node

A device node is a special file type represents a device. It is kept under /dev. mknod create a device node. For example,

mknod /dev/xyz1 c 234 0

This create a file xyz1 under /dev. It is a character device (crw-r--r--) with a mojor number of 234 and minor number of 0.

By itself, the device node is just another file in /dev. But we can use it to bind to an installed device driver. When an open() call is excuted with this device node as the path parameter, kernel search for device driver that registered with a major number of 234. This is how kernel associate the driver with our device node.

Minor number is a mechanism for handling multiple devices or subdevices with a single device driver. It is not used by the OS but simply passed over to the driver. What to do with the minor number is entirely up to the driver. For example, the minor number can denote one of the ports on the card managed by the driver.

Device node is usually created by udev rather than manually using the command.

Thursday, July 21, 2011

Linux Bootable Image

After the vmlinux kernel ELF has been built, it is further processed to strip off the redundant sections (comments and notes). The output is an object file called Image which is then compressed using gzip

cat Image | gzip -f -o > piggy.gz

Next the bootstrap loader is built. Bootstrap loader is the 2nd stage loader which prepare the context for the Linux kernel to run in. It is different from the bootloader (1st stage loader) of which control will be passed over to once the hardware is power on. Bootloader performs low level initialization and diagnosis utilities.

Bootstrap loader perform the following functions:
(1) head.o and head-"arch".o - low level assembly language processor initialization, which include enabling the processor's internal instruction and data cache, disabling interrupts and setting up a C runtime.
(2) misc.o - decompress and relocate linux kernel
(3) other initialization

Bootstrap loader contains an assembly program called piggy.s. The program contains a include (inclib) for piggy.gz. In other words, when piggy.s is assembled, the compressed kernel is piggyback into the bootstrap loader. The bootstrap also include other object code like head.o to form the boatable kernel image called zImage.

In summary, the bootable kernel image contains the following codes:
(1) piggy.o - asm wrapper around piggy.gz which is a compressed vmlinux without notes and comments sections
(2) head.o
(3) head-"arch".o - architecture specific intialization
(4) misc.o
(5) others

Wednesday, July 20, 2011

Semaphore

It was invented by Dijkstra in 1965 as a generalization of critical region. A semaphore was assigned an initial count. As long as the count is not zero, thread can continue to decrement the count without waiting. When thread leaves, it increases the count. Dijkstra invented 2 operations - P based on a fictitious word prolaag which means "try to take" and V which is a Dutch work "verhoog" meaning to increase.

Semaphore of value 1 is called binary semaphore. Semaphore of value more than 1 is called counting semaphore.

Dual-mode Hosted VM (e.g. VMWare GSX)

There are 3 components:

(1) VMM-n (native) - This component runs in native mode. It is the component that intercepts traps or patch critical instruction in VM. this may provide drivers for performance purpose or those not available in host OS.

(2) VMM-u (user) - This component runs in user mode, similar to an applicaiton to the host OS. This component issues resource requests to the host OS, in particular memory and I/O request on behalf of VMM-n. VMM-u issue requests using the system library fucntion provided by the host OS.

(3) VMM-d (driver) - This component provides a means for communication between VMM-n and VMM-u. This makes VMM-n appears as a special device to VMM-u.

The advantage of a hosted VMM is easy installation and the fact that the actual device drivers do not have to be incorporated into VMM. The disadvanatages are

(1) VMM-n operates in privileged mode alongside with the host OS. There could be compatibilty problem which may corrupt each other's memory.

(2) The allocaiton of resources is completely under the control of host OS. The effects of allocation prolicies in the VMM are less predictable because the VMM has insifficient knowledge and control over the use of resources.

Performance disadvantage compared to a native VM because the need to go back and forth between VMM-n and host Os. The perofrmnace degratdation is more significant in an I/O intensive workload than in a user-mode CPU-intensive workload.

VM/370

The first VM environment was the IBM System/360 Model 40 VM. VM became mainstream until later with a model of System/370. The VMM in VM/370 was called CP (Control Program). The CP design team also developed a single-user OS called CMS (Conversational Management System), mainly to demonstrate the advanatage of modularization for system evolution. The CP/CMS design separates the function of resouce management from the function to provide services to the user. CP and CMS can exist without each other. In fact, CMS was developed on a bare machine before CP exists.

Resoruce Virtualization

(1) Processor
The key aspect of virtualization a processor lies in the execution of the guest instruction, including both system-level and user-level instructions. There are 2 ways. One is via emulation (interpretation or binary translation). Emulation is the only processor virutalization mechanism available when the ISA of the guest is different from the host. Emulation may also be required even if host and guest ISA are the same, such as for instructions that interact with hardware resources need to operate differently on virtualized processor than on a real one.

The second method use direct native execution on the host machine. This is possible only when the host and guest ISA are the same. To minimize performance degratdation, one basic design goal for system VM is to execute significant fraction of the instructions directly on native hardware. the overhead for emulating the remaining instructions depends on the number of such instructions, the complexity to discover this instructions, and the data structure and algorithm used for emulation.

(2) Memory
In a system VM, each guest VM has its own set of virutal memory tables. A guest's real address must undergo a further mapping to derive the physical address. Memory resource virtualization is done differently depending on if the page table or TLB is architected. If page table is architected, its structure is defined by the ISA, and the OS and hardware cooperate in maintaining and using it. TLB is maintained by hardware not visible to OS. When TLB misses, hardware walk the page table to find the new entry to load. If the mapping is not found, a page fault result and OS take over. If TLB is architected, its structure and special instruction used to manipulate it is defined in the ISA. Hardware is unaware of the page table. When there is a TLB miss, a trap is sent to OS to be handle instead.

Most older ISA uses architected page table. Some of the more recent RISC ISA use architected TLB.

Vitualizing architected page tables - Each guest OS maintains its owne page table. Virtual-to-physical address mapping is maintained by VMM in shadow page tables., one for each VM. these tables are actually used by hardware to do the address translation and to keep TLB up-to-date. the shadow page tables eliminate the addition translation required from real to physical address. VMM control the real page table pointer. When VMM activates a guest, it loads the page table pointer to use the correct shadow page table for this guest. Guest OS attempts to read/write to the page table pointer will result in a trap to VMM. The trap is generated either because these instructions are privilege or via the wrapper code around these instructions. For read instruction, VMM return the value of the guest's virutal page table pointer (in memory block). For write instruction, VMM updates both virtual and real pointer. Where there is a page fault, the page may or may not have been mapped in the guest's virutal page tables. If it is mapped, the fault is handled entirely by VMM. The guest is not notified of it. If it is not mapped, VMM transfer control to the page faulter handler of the guest OS. However, many of the instruction issued by the page fault handler must be intercepted and handled by VMM because these instructions are privilege or the memory where the virtual page tables reside are write protected by VMM. Doing I/O to real address is also tricky as the physcial address may not be contiguous as the real address is. In this case, the I/O instruction may need to be broken up to multiple I/O instructions to perform I/O on discontinuous blocks of memory. These operation can degrade the performance substantially.

virtualizing TLB - When ISA provides a software managed TLB, the TLB must be virtualized. VMM must mainains a copy of TLB for each guest and intercept instruction so that VMM can keep these TLB up-to-date. One approach is for VMM to load the TLB whenevem a guest VM is activated. This TLB rewrite incur quite high overhead especially for large TLB. An alternate approach is to leverage the address space ID (ASID) that are part of the architect TLB. This allows the TLB to contains entries for various guest simultaneously. A special regiester containing the current ASID is use to indicate the entries in TLB to be used for translation.

(3) I/O
This is one of the more difficult parts of implementing a system VM because there is a large number of device types and a large number of devices in each types.

dedicated devices - e.g. display, keyboard, mouse and speaker. The device itself does not necessarily have to be virtualized. Requests to and from the device could theoretically bypass the VMM and go directly to the guest. However, this is not the case as guest usually runs in user mode. VMM will capture the interrupt and pass to the guest VM when it is activated.

partitioned devices - the VMM translate the parameters into corresponding parameters for the underlying physical device, using a map and reissue the request. Similarly, the result is translated on its way back.

shared devices - e.g. network adaptor. Each guest has its virtual stage of the device e.g. network address, port. VMM translate the parameters and result similar to partitioned devices.

spooled devices - e.g. printer. Printer is solely under control of one program until the printing is complete. This is true even if the program is swapped out. VMM maintain a spool table which consolidate the spool entries in each guest. The virtualization of printer of this kind was more appropriate for older line printers which were expensive and attached to the machine. Nowaday, network printers usually have its own buffer and spool jobs from machines across network.

Generally, VMM can intercept a guest's I/O action and convert it from a vertial device action toa real device action at 3 interfaces

virtualizing at I/O operation level - Memory mapped I/O, commonly in many RISC platform, perform I/O by reading/writing to a special memory location. These locations are protected by OS which make them inaccessible by user mode programs. System like S/360 and IA-32 performed I/O using special command (e.g. SIO). These privilege nature make them easy to be trapped by VMM. However, a I/O request from application typically results in a few such I/O request. Reverse engineering these requets to deduce the I/O action is extremely difficult in practice.

virtualizing at Device Driver level - this is straightforward and allow virtualization at a natural points. However, it requires the VMM developer have knowledge of the guest OS and its device interface. Special virtual device drivers can be developed for each guest. These drivers are bundled and installed as part of VMM. This can be simplified by "borrowing" the drivers (and interface) from an existing OS. In a hosted VM environment, the drivers in the host OS are used.
virtualizing at system call level - this is done by intecepting the initail I/O request at the OS inerface, the ABI. Then the entire I/O the be done by VMM. To do this, VMM must shadow (emulate) the ABI routines. This is a daunting task.

System VM

Real resoruces of the host Platform are shared among the guest system VMs. VMM (VM Monitor) manage the allocaiton, access to the hardware resources.

native VM - VMM runs in privilege mode, guest system (OS) runs in user mode. The privilege level of the guest IS is emulated by VMM.

Hosted VM - VMM run in user mode above a hosted OS. VMM uses services provided by the host OS to control and manmage resources desired by each of the VMs.

Dual-mode BVM - For efficiency reasons, it is desirable to have at least part of the VMM work in privilege mode. It is done via extending the host OS using standard interface such as kernel extension of device driver.

Garbage Collection

To begin, it is first necessary to identify the root pointers. The root set must contain references somewhere on the stack, including both local storage and the operand stack or in the constant pool. The root set must also include references contained in static objects.

(1) Mark and Sweep starts with the root references and traces through all the reachable objects, marking each one as it is reached. Marking may coinsist of setting a flag bit in the object or in a separate bitmap. Garbage objects found can be combined into a linked list of free objects. Overall, this is a relatively fast way of identifying and collecting garbage. However, the free objects are of varying size and are scattered in the heap space. This leads to memory fragmentation which requires compaction. Inefficiencies can be reduced by using segregated free lists, or in other words, by dividing the heap into a set of fixed size chunks having arrange of sizes.

(2) Compacting Collectors essentially slides the live objects to the bottom or top of the heap so all live objects are adjacent. What is left is a contiguous region of free space. Although conceptually simple, the compacting collector is relatively slow as it makes multiple passes through the heap. One pass does the marking and then subsequent passes compute the new locations for the live objects, move the objects and update all reference poniters to the new location. To reduce the number of reference update, some system uses handle pool (one objects referenced by 2 other objects are pointed to through the handle pool. The 2 referencing object point to the handle pool which in turn points to the referenced object. This introduced another level of indirection.

(3) Copying collectors divids the heap into 2 halves. At any one time, only 1 half is used. The collectors copies the objecs from one half to the other half. The memory requirement for copying collector is high compares to other collector.

(4) Generational Collectors
Objects manifests a bimodel behaviour. Most objects tend to be short-lived due to good object oriented programming practice. Objects that are not short-lived tends to have very long life-time. The heap is divided into 2 subheaps. The nursery is for newly created objects. It is garbage-collected more frequently. Any objects survives a number of collection in the nursery is moved to the tenured heap which has less frequent collection.

(5) Incremental and concurrent collectors
All the above collectors stop program execution while they perform collection. Collection time may be spread out if done incrementally. Also, in a multiprocessor envrionment, it is advantageous to collect using one thread while other threads are used to run programs. As a partially collected heap is in a state of flux, some synchronization between the collector and program is needed. For example, once the collection marks a object to be alive, it may be dereferenced by the program. One of the common solutions is to provide write barriers for references to objects that are marked.

Overall, mark and sweep provide good collection time.

Process VM - Code Cache

Code Cache differs from a conventional cache memory in at least 3 ways:
(1) The cache blocks do not have a fixed size. The size depends on the size of translated target block.
(2) The presence and locations of the blocks are dependent on one another because of chaining. If a block is removed, the links must be updated
(3) there is no backing store. When a block is remove, it must be regenerated.

Replacement strategies include:
(1) LRU - Because of temporal locality, this is often a good strategy for conventional caches. Unfortunately, because of the specific properties of code cache, it is relatively difficult to implement. Firstly, there is overhead to implement structure to track access. Secondly, back points are need to perform delink. Thirdly, blocks are of different size and this results in fragmentation. Because of these, LRU is not typically used for code cache.
(2 Flush when full - most basic algorithm is simply to let the code cache fill and then to flush it completely and start over with an empty cache. There are advantages in this approach. Overtime, the frequently followed paths may change and flishing provides an opportunity to elimnate control paths that have become stale and no longer refelct the common paths through the code. Secondly, flushing may remove orphans and reclaim space. The disadvantage is block must be retranslated from scratch, leading to a high performance overhead immediately after the flush.
(3) Preemptive flush - Many programs operate in phases. A phase change is usually associated with an working set change. When phase change, a new region of source code is being entered and a larger percentage of time is spent in block translations. The code cache can be preemptively flished tomake room for the translation.
(4) Fine grained FIFO - a nonfragmenting algorithm that exploits temporal locality and is not a brite force approach. The code cache is managed as a circular buffers. The oldest n-blocks are replaced to make room large enough for the new block. This scheme overcomes a number of disadvantages of LRU (albiet at a slightly reduced hit rate). It still needs to keep track of chaining via back pointers.
(5) Coarse grain FIFO - partition the cache into large blocks (e.g. a block may be 1/8 of cahce size). with this, backpointer problem can be simplified or eliminated by only maintianing backpointer in block level.

Process VM - OS Emulation

When guest and host OS are the same, the problem of OS call emulation is primarily one of matching the OS interface syntax. The OS functions required by the guest are available in the host, but it may be necessary to move and format arguments and return values, possibly forming some data covnersion in the process. For example, OS running on platform with few register may pass parameter in stack while OS with many registers may pass arguments in registers. In this case, the call must be set up by copying arguments from stack to registes when emulating a system call. This is call wrapper.

Some call may be handled by the runtime code instead of transalation. For example, a call to establish signal. In this case, the runtime code will own all signal and so the request from the guest will be marked in a side table instead. Runtime will pass on the signal if it matches the side table. Another example is memory management call.

Pracitically, if the guest and host OS are different, there is relatively little likelihood that a complete compatible OS emulation can be performed. Some compromise and approximations will be required. A compromise often used is to restrict the application supported (thus limiting the system calls).

Process VM

This VM provide a virtual environment at the program or process level. This allow program compiled for one system to run in a different system. Program are compiled, distributed and stored as executable binaries that conforms to a specific ABI which include hardware instruction set and OS interface.

Process VM contains the following components:
(1) Loader writes the guest code and data into a region of memory and loads the runtime code.
(2) Loader passes control to the initialization block which allocates memory space for code cache and other tables used during emulation. It also involves host OS to establish signal handlers for all trap conditions.
(3) Emulation engine uses interpretation or binary translation to emulate guest instructions.
(4) Code cache manager to maintain the cache
(5) Profile database contains dynamically collected program information that is used to guide optimization during translation.
(6) When guest program performs a system call, the OS call emulator translates the OS call into approapriate calls to the host OS and handles any associated information returned as a result.
(7) The runtime must also handle traps that may occur and interrupt that is directed at the guest process using exception emulator

Process VM state mapping

It refers to the mapping of register and memory of guest process to the host address space. In the host addres space, guest registers could be mapped to the host regisers or in register conext block in memory (runtime data) of the host address space. The guest code and data will be map into memory, together with the emulatior (runtime code).

Memory mapping from guest to host address space conceptually uses a mapping table. The mechanism is similar tothe virtual-to-real address translation (base address translation then forming the address by adding offset). This emulation using software has high overhead. This approach is most flexible as consecutive memory blocks in guest can be dispersed in non-consecutive blocks in guest address space. To simplify the translation, we can cosnider address space mapping methods that rely more on the inderlying hardware than VM. Both cases assuming the host address space is larget than the guest:

(1) the guest address space is map continuously in the host address space above the runtime code. In this case, the host address = guest address + (length of runtime)
(2) the guest address space is map continuously in the host address space starting at the same offset. in this case, host address = guest address. Runtime is relocated to a location above the guest address space.

It is apparent that the relative size of guest and host address space has significant implicaiton to the choice of mapping method. Whether the run time can be placed in arbitrary area outside the confine of guest address apce is also another improtant factor.

Emulator needs to deal with memory model as it mimic the OS the guest process thought run in. For example, guest process may allocate a memory block with some protection setting and emulator needs to mimic the memory model to be compatible. In general, user application sees 3 main features:
(1) overall structure of the address space e.g. segment or flat
(2) access privilege (R, W, E)
(3) protection and allocation granularity - smallest unit that the OS can allocate and protect for the application.

The complexity of mapping of a page in guest to host depends on the relative page size and protection types available in both platform. If the host page size is larger and protection types is more comprehensive than the guest, it is possible tomap the guest page to host page directly thus letting the underlying hardware to enforce the allocation and protection. Otherwise, some software mapping and interfence from EM is required which is more complex and slow.

Incremental Predecoding and Translation

To tackle the code discovery problem, a general solution is to translate the binary while the program is operating on actual input data, i.e. dynamically, and to predecode or tranate new sections of code incrementally as the program reaches them.

There are 4 components:
(1) The emulation manager (EM) controls the overall flow.
(2) The interpretor translates the source code to intermediate code
(3) The translator translates to the target binary code
(4) The map table associates the SPC, for a block of source code with TPC for the corresponding block of translated code. The map table is typically implemented using hash table.

The system translates one block of code at a time. The unit for translation is called dynamic basic block. This is different from the static basic block which determined by the static structure of the program. A static baisc block starts a a branch label and end at a branch instruction. For example,

- - - - - - - - - start of block 1
add...
load....
store...
- - - - - - - - - end of block 1/start of block 2
loop: load...
add...
store
brcond skip
- - - - - - - - - end of block 2/start of block 3
load...
sub...
- - - - - - - - - end of block 3/start of block 4
skip: add...
store
brcond loop
- - - - - - - - - end of blcock 4/start of block 5
add...
load...
store...
jump indirect
- - - - - - - - - end of block 5

A dynamic basic block is determined b the acutal flow of a programas it is executed. It begins at the instruction executed immediatelhy after a branch or jump, follow the sequential stream, and end with the next branch or jump. For example,

- - - - - - - - - start of block 1
add...
load....
store...
loop: load...
add...
store
brcond skip
- - - - - - - - - end of block 1/start of block 2
load...
sub...
skip: add...
store
brcond loop
- - - - - - - - - end of blcock 2/start of block 3
loop: load...
add...
store
brcond skip
- - - - - - - - - end of block 3/start of block 4
skip: add...
store
brcond loop
- - - - - - - - - end of block 4

Incremental (staged) translation works as follow. After the source code binary is loaded into memory, the EM begins interpreting the binary using a simple decode-and-dispatch loop or indirect threaded method. When the EM reaches a bracnh or jump, the SPc-TPC mapping table is consulted.

If it is a miss, the profile table is checked to see if the next block is hot (i.e. it has been used a few times above a predefined threshold). If it is not, update the profile table and control is passed back to the interpreter. If it is hot,the next block is translated and placed into the code cache, and SPC-TPC mapping table is updated.

If it is a hit, the EM transfer to the code cache. Exection start in the code block and follows the linked block until the forward link end and control passes back to EM.

When the interpreter or translated code hit an OS call, control is passed back to the EM. The instruction is processed by the OS emulator. When the Os emulator returns, the control passes back to EM which do a SPC-TPC look up and continue. Similarly, when the code genreates an exception, control is passed to the exception emulator.

EM follows the path of the soruce program and either directly executes the next block or begin translating the next dynamic basic block. Incrementally, more of the program is discovered and translated, until eventually only translated code is executed.

Further optimization includes:
(1) Translation chaining - similar to threading, instead of branching back to EM at end of the block, the jump is target directly to the next block
(2) Software Indirect Jump Prediction - implementing indirect jumps by map table lookup is expensive in execution time. Instead, use a series of compare statement for the source address value to determine the jump address reduce the overhead. For example, Rx is the register holding the indirect jump target PC value,

if (Rx == addr1) go to target1
else if (Rx == addr2) go to target2
else table_lookup(Rx)

(3) Shadow Stack - when a translated code block contains a procedure call via an indirect jump to a target binary routine, the SPC value must be saved by emulation code as part of the source architected state, either in register or in memory stack. when the called procedure completes, it can restore this SPC value, access the map table and jump to the translated address. As map table resolution is expensive, the overhead can be avoided if the target return PC value can be made available directly. In this case, the return value of the target code is pushed onto a shadow stack maintained by the emulation manager. In case the return SPC may have changed during the call, the SPC is pushed onto the shadow stack too. upon return, the SPC on the stack is checked against the SPC for the shadow stack before the link address is used.

Binary Translation

The source architected register values are stored in a register conetxt block held in memory. If feasible, source register can map directly to target register for performance optimization. Static binary translation is not always possible because of the following code discovery problems:

(1) indirect jump of which the target is unknown at translation time
(2) padding bytes or data inserted by compiler to align instruction makes it difficult to determine where the next instruction starts after the last one. This is especially hard for ISA with variable instruction length

Another type of problem is code-location. In an indirect jump of which the destination address in the register is a source code address. The address must be mapped to a address int the translated code.

Threaded Interpretation

The basic interpretator consists of a central loop which drive the interpretation as follow:

While PC->instruction
decode instruction
dispatch to emulation routine
PC = PC + 1
end

Instead of using the central decode-dispatch loop, append the decode and dispatch logic at then end of the emulation routine to speed up by reducing the number of branches in the basic interpretator. The emulation routines are threaded together indirectly through a table and thus is called indirect threaded interpretation.

Predecoding can achieve further efficiency. Specificially, predecoding involves parsing an instruction and putting it in a form that simplifies interpretation. For example the following is a list of instruction in predeconded form

struct instruction {
unsigned long op
unsigned char dest;
unsigned char src1
unsigned int src2
} code [CODE_SIZE]

If same source instructions are interpreted, the intermediate form can be reused. Because the intermediate code is separate from the source code, a Target PC (TPC) is used to track the execution of the intermediate code. The SPC and TPC may have no direct coorelation. Thus both values must be tracked.

The op field can be further optimized to hold the actual address of the emulation routine. Thus saving an indirect look up and jump operation. This is called direct threading.

Codesigned VM

conventional VM focus on portability and functionality (to emulate differet platform), never on performance as emulating often added overhead and make execution on VM less efficient than those running on the native hardware. Codesigned VM are designed to enable innovative ISA and to improve performance or power efficency or both. In a codesigned VM,there is no native application. The VM is part of the hardware implemenation. The software portion of the codesigned VM uses a region of memory that is not visible to any system or application. This concealed memory is carved out of real memory at boot time and conventional guest software is never informed of its existence. VMM code resides in the concealed memory can take control of hardware practically any time. In the most general form, the VM software perform binary translation and cache the code in the concealed memory for execution. Hence th guest never executes directly on the native hardware. IBM AS400 uses many codesigned VM techniques. The primary objective is support for an object-oriented instruction set that redefines the hardware/softare interface in a novel fashion. The current AS400 implementation are based on an extended PowerPC ISA.

System Level VM

System Level VM were first developed during 1960s for mainframe which the hardware is expensive and differnt group of users wanted different OS. A single host hardware platform can support multiple guest OS simultaneously. At present, themost important features of System Level VM is to partition major software system securely.

Platform replication is the major feature provided by a VMM. The central problem is that of dividing a single set of hardware resources among multiple operating system envrionments. The VMM has access to and manages all hardware resources. When guest OS performs a privileged instruction, it is intercepted by VMM, checked for correctness and exectured on behalf by the VMM. All being done transparently.

One way to build system level VM is to have VMM sits on the hardware and the guest OS sits on top of the VMM. One disadvantage is that the original OS must be wiped out to install the VMM. Another disadvantage is that the VMM must contain device drivers becuase it interacts directly with the underlying hardware. An alternative is to build the VMM on the top of an existing OS. The installation is similar to an application and the VMM can use services from the existing OS. However, the performance will be hit as there are more layer. An example of hosted VM is VMware 2000)

Process Level VM

Process Level VM provides application witha virtual ABI environment. Process VM exhibits in various form

(1) Replication - e.g. multiprogramming - most operating system can support mulitple user processes. In other words, the operating system provides a replicated process-level VM for each of the concurrently executing applicaitons. The host OS can be same of different. The most straightforward emulatio is interpretation. The interpretor emulates the source ISA. The process can be relatively slow as each source instruction may require tens of native target instructions.

(2) Emulation - For better performance, binary translation is used. Block of source instructions are converted to target instruction set, which can be cached and reused. Interpretation has relatively low start up overhead but high execution time. On the other hands, dynamics binary translatir has high initial overhead but is fast for repeated execution. some VM use a staged emulation strategy combined with profiling (i.e. collect statistics regarding the program's behaviour). Initially, a block of source instruction is interpreted and profile track how frequently the block is executed. Binary translation is used for block with repeated execution.

(3) Optimization - In addition to emulation, the target code can be optimized. This leads naturrally for VM where soruce and target instruction set are the same and optimization is the the primary purpose of the VM.

(4) Platform independence - Emulation translates from a specific source ISA to a specific target ISA. A virutalized ISA can be used for ultimate protability. The VM environment does not dorectly corrrespond to any real platform. Rather, it is designed for ease of protability and to match the features of a high level language. The HLL VM focuses on minimizing hardware-specific and OS-specific features which would compromise protability. Examples of HLL VM are Java VM and MS CLI (.Net). These VM uses bytecodes (each instruction is encoded as a sequence of bytes) which are stack based (to eliminate register requirement). Memory size is conceptually unbounded with garbage collection as an assumed part of the implementation.

Instruction Set Architecture (ISA)

It marks the division betweein hardwar eand software. The concept of ISA was first clearly articulated when IBM 360 familiy in early 1960. The importance of software compatibility was fully recognized. IBM 360 has a number of model incorporated with a wide range of hardware resources but all of them could run the same software. There are 2 parts of ISA. User ISA is visible to application program, System ISA is visible to supervisor such as operating systems.

The application binary interface (ABI) provides a program with acc ess to hardware resources and services available in the system. ABI contains all user instruction. It also contains system call interface which allows application to invoke operating system to perform works on behalf of itself.

The application programming interface (API) is usually defined with respect to a high level language. The API can include system call provided by the operating system (wrapper). API enable applications written to be ported easily (via recompilation) to any sytem that supports the sampe API.

Trust Computing Base (TCB)

This is part of the computing system that absolutely must be trustworthy if we are going to get anything done. This usage of trusted may seem conterintuitive: We do not trust the TCB because it's worthy of trust but rather because we have no choice. consequently, it's important both to know what the TCB is and to keep it as small as possible. This way, we have more assuarance that what's trusted is also trustworthy. The TCB is defined also indirectly as security perimeter that separate it from the rest of the system. The reference monitor is somtimes called security kernel. The user had better be sure whether he is communicating to the TCB. Trusted path denotes a mechanism through which the user can do this. A secure attention key is the mechanism used by user to establish the channel.

Sunday, July 10, 2011

Booting linux

Bootloader in an embedded system is the first code to run when the system is powered on. Bootloader typically is stored in BIOS or flash memory. It performs low level hardware initiailization and then pass control to the linux kernel.

Some architecture and bootloader (e.g. Power with U-Boot) can boot the vmlinuex directly (after converting ELF to binary form). In this case, the image is called uImage (a compressed vmlinux in U-Boot header). In other architecture, an intermediate step is required to set the right context for vmlinux before control is handed over.

vmlinux

IT is the Linux monolithic kernel in ELF format. This is binary and contains no unresolved references.

/arch/arm/kernel/head.o is the an architecture specific (ARM in this case) that perform low level kernel intialization. This is executed first when the kernel is loaed and passed control to by the bootloader.

init_task.o set up the initial thread and task structures that the kernel requires.

The largest object module making up kernels are filesystem code, network code, built-in drivers code and the kernel (which contains scheduler, process and thread management, timer and other core functions).

The /arch/ARM/kernel contains specific architecture functionalities such as low-level context switching, hardware level interrupt, processor exception handling etc.

Flash Memory

Flash memory can be written to and erased under software control. Speed is considerably slower than hard disk. Flash memory is divided into relatively large erasable units (blocks). In a NOR flash memory chip, data can be changed from a binary 1 to 0directly to the cell address, one bit or word at a time. However, to change from 0 to 1, an entire erase block must be erased using a sequence of control instructions to the flash chip.

Flash memory erase block can be uniform in size or variable. The smaller block can store the bootloader and the kernel or data are kept in larger block. This is commonly called boot block or boot section chip.

To modify data stored in Flash memory array, the block in which the modified data resides must be completely erased. As the block size for Flash is much larger than typical hard disk (512 or 1K bytes), the wrtie time of Flash can be many times of hard disk.

Another limitation for Flash is there is write lifetime. Write may fails after the lifetime (100K) is exceeded.

NAND Flash is newer technology. It has smaller block. While NOR flash uses parallel address and data lines. NAND flash use proprietary serial interface. Lifetime for NAND fash is also significantly higher.

Sunday, April 10, 2011

Drawing 3D

Perspective, or angles between the lines lends the illusion of depth. With perspective, 3D object can be depicted in 2D screen.

When one covers one eye and looks, the world still appears as in 3D because perspective alone is enough to create the 3D effect. Another obvious clue is nearby objects appear larger than the distant ones. This perspective effect is called foreshortening.

To enhance 3D effect in drawing on 2D, another technique is to remove hidden lines and surfaces from the object on screen. Applying different color to different side of object simulates the effects of light. Adding shadow further enhances the illusion.

Texture mapping takes an image, such as a photo and applies it on the surface of an object to give it a realistic looks. For example, applying wood texture onto a cube to make it looks like a box. The process of stretching and compressing texels (texture elements) is called filtering.

Fog adds haziness to objects in a scene depending on the distance where objects are.

Blending is the combination of colors or objects on the screen, similar to superimposing 2 images. By varying the amount of each object with the scene, one can make the effect of transparency. Blending can create the illusion of reflection.

Antialiasing blends the lines with the background color to eliminate the jagged edges.

Saturday, April 2, 2011

CISC and RISC

If RISC is faster, why was CISC predominant in the past? Back then, computer had very little storage. An instruction that could roll up several steps of complex operation, such as a do-loop, into a single instruction was an advantage, because memory was precious. In addition, instruction fetch was piecemail and sequential. Reducing the number of instruction saved time.

As it turned out, assembly langauage progreammers used the complicated machine instructions but compilers generally did not. It was difficult enough for compiler to recognize when a complicated instruction could be used, but the real problem was code optimizations. Verbatim translation of source constructs isn't very efficient. An optimizing compiler works by simplifying and eliminating redundant computations, After optimization, opportunities to use the complicated instructions tended to disappear.

On the other hand, there were several pushes for the development of RISC.
(1) The number of transistors that could fit on a single chip was increasing.
(2) Techiques like pipelining was difficult to be implemented for variable length instruction common in CISC design.
(3) As compilers improved, the resulting code generated for RISC out-performed equivalent complicated multi-cycle instructions in CISC.

DRAM and SRAM

DRAM are charge-based devices, where each bit is represented by a electrical charge stored in a very small capacitor. The charge can be leaked away in a short amount of time, so the system has to continuously refreshed to prevent the data from being lost. The act oreading also discharges te bit, requiring refresh. It is not possible to read the memory bit in DRAM while refresh takes place.

SRAM is based on gates and each bit stored in four to six connected transistors. SRAM retains the data as long as there is power, without the need to refresh.

Access time is the amount of time takes to read or write a memory location. Memory cycle time describe how often you can repeat reference to the same memory chip.

Fast page mode DRAM saves time by allowing a mode in which the entire address doesn't have to be re-clocked into the chip for each memory operation. Instead, there is an assumption that the memory will be accessed sequentially and only the low-order buts of the address are clocked in for successive reads or writes.

EDO RAM is a modification to output buffering on page mode RAM that allows it to operate roughly twice as fast for operation other than refresh.

Synchronous DRAM is sychnorized using an external clock that allows the cache and DRAM to coordinate their operations. Also SDRAM can pipeline the retrieval of multiple memory bits to improve overall throughput.

Cached DRAM contains a SRAM cache on the same chip as the DRAM. This improves performance to close to SRAM.

Saturday, March 12, 2011

Form

Divide music into parts and convential muscial theory gives alphabet labels to the miscial parts within a composition.

1-part form: A
e.g. Happy Birthday - is the most primitive song structure sometimes refered as the air or ballad form. The melody is repeated with slight variation.

2-part form: AB
Binary form contain 2 contrasting sections functioning as statement and counterstatement. e.g. Greensleeves in AABB

Song form: ABA
3-part or tertiary form. One of the simpliest way to write in song form is to vary and repeat the melody e.g. Twinkle twinkle little star. Another variation of song form is AABA with B bridging the stretched A.

Arch form: ABCBA

Classical form
- sonata - common in period from mid 1700 to early 20 century. Sonata are based on song form (ABA). The first part is the exposition which presents the main theme of the song as well as 2 or 3 minor themes. The second part (or B part) is called development often sounds belonging to another musical piece, usually in different ey and even different time Signature. The 3rdpart is to return to the theme and called recapitulation.

Sonata has its roots in language - in this case, the sonnet. In sonnet (like those Shakespeare wrote), the first quatrain (4-line verse) consists of 2 set of rhyming pairs (A part), whereas the second verse consists of 2 comptelely dofferent set of rhyming pairs (B part). The 3rd verse retruns to the rhyming A section.

- rondo (ABACA) or ABACADA. The A seciton is called refrain ties the pieces together. B, C and D are called episodes can be in any key or time signature. Rondo also has its roots in poetry. In 13th century France, the rondeau was a popular form of street poetry which often set to music

- Concerto - contrasts between having a large ensemble playing a section of music and a solo/smaller group playing the same or similar section.

- Symphony - performed by an orchestra. It combines several different muscial forms into one piece of music harmoniously. There are traditionally 4 movements in a symphony
(a) Sonata Allegro
(b) Slower movement
(c) Minuet or scherzo
(d) combination of sonata and rondo, a thematic repeat of the first movement.

- Fugue - first form that fully utiliuze the left hand of the pianist. The treble clef and the bass lef take turns carrying the melody.

- Divertimento - a light and short form of instrumental chamber music has several short movement chiefly for entertainment value.

- Minimalism - extremem simplication of rhythms, patten and harmonies. Prolonged chordal or melodic repetitions and often in trancelike effect - AAAA....

- Through composed - no repetitious of themes

Popular forms
- blues - AABA patternof I, IV and V

- rock - 32 bar compounded blue. One plays the first 32 bars, go to the second bridge and repeat the first 32 bars again e.g. every breath you take

- verse-chorus form: Intro ABAVBCB
Intro us usually instrumental and set the mood, or short spoken piece
A (verse) being the story of the song
B (chorus) is the hook of the osng, both lyrically and musically, should be the most memorable, anthemic part of the song. Often the title too
A (verse) is the part 2 of the story
B (chorus) reinforce the hook to make it memorable
C (bridge) can be instrumental or lyrical and is differnt sounding than the verse or chorus
B (end, chorus) repeat chorus to fade or just stop at the I chord (candence) after one time through

Cadences

Music has commas, full stops and paragraph used in the same way as storytelling. The term for phrase (sentence) ending is called cadence.

Cadence means return to the I/i chord from iv or V chord. The stopping position of music that does not make sense to listener will not be fulfilling. Audiences expect a certain closure.

Authentic (Perfect) Cadences - V/v to I/i

Plagal Cadences IV/iv to I/i. It is used to end a phase rather than the song because they do not sound decisive as Authentic Cadences.

Deceptive or interrupted cadences is to resolve the V/v to any chord other than I/i. The most common ones are V/v to VI/vi. This is one of the weakest cadence as it makes one feels incomplete. Other forms are V to IV/ii/V/VIIl.

Half (Imperfect) Cadences stop at IV, iv, V or v and not resolved to I (thus half of the cadence).

Chord Progression

Certain sequences of chords sound nicer than the others. Over time, a consensus devleoped to become the rules of chord progression. The rules arre based on a concept called chord leading - certain chords naturally lead to other chords.

Major chord leading
I can appear anywhere in a progression (any chord can go to I)
ii to I, IV, V, or viio (minor seventh dim)
iii to I, ii, IV or vi
IV to I, ii, iii, V or viio
V to I or vi
vi to I, ii, iii IV or V
viio to I or iii

Minor chord leading
i can appear anywhere in a progression
iio or ii to i, iii, V, v, viio, VII
III or III+ to i, iv,IV, VI, #vio, viio, VI
iv or IV to i, V, v, viio or VII
V or v to i, VI or #vio
VI or #vio to i, III, III+ iv, IV, V, v, viio or VII
viio or VII to i

Common chord progressions
(1) I, IV, V and V7 (C, F, G, G7)
e.g. I-IV, I-V, I-IV-V or V7, I-IV-I-V of V7, I-IV-V-IV

(2) mixing other chords with combination of IV and V
e.g. I-ii-IV-V, I-ii-IV, I-V-vi-IV, I-vi-ii-V or V7, I-vi-IV-V, I-vi-ii-V7-ii

(3) starting with non-tonic chord
e.g. ii7-V7-I, IV-I-IV-V

(4) 12-baror blues progression
I-IV-I-V7-IV-I (C, F, C, G7, F, C)

(5) circles of fifths - each chord is a diatonic fifth above the following chords which makes each chord functionally a dominant for the next
I-IV-viio-iii-vi-ii-V-I (C, F, Bdim, Em,Am, Dm, G, C)

Chord Composition

A major triad is built of 2 intervals - a major third and a further minor third on top it. Another way to look at this is a major third and a perfect fifth

A minor triad composed of a minor third and a further major third on the top. In other words, a minor third and a perfect fifth

An augmented triad composed of a major third plus a major third, or in other words, a major third and an augmented fifth.

A diminished triad composed of a minor third and another minor third on top, or in other words a minor third and a diminished fifth.

C- Major chords are happy, simple, honest and bold composed of 1,3,5
Cm - Minor chords are sad and serious composed of 1,3b,5
Cmaj7 - Major seventh are pretty, delicate, sensitive and thoughtful composed of major+7
Cm7 - Minor seventh are pensive, moddy and introspective composed of minor+7b
C7 - Dominant 7 are sassy, outgoing and strong composed of major+7b
C6 - Major sixth are playful composed of major+6
Cm6 - Minor sixth are dark, sensuous, troubled composed of minor+6
Csus4 - Suspended 4 are regal and material composed of C without third (sus part)+4 (i.e. 1,4,5). This chord creates a harmonic suspension and could be resolved by returning to the original triad
C9 - Ninth are energetic and lively composed of major+9 (or 2 raised to an octave)
Cm9 - Minor ninth are sad, tender and complex composed of minor+9
Cdim - Diminished are dark, strained and complex composed on a Minor with its fifth flatted (i.e. 1,3b,5b)
C+ - Augmented are anticipatory and full of movement composed on Major with its fifth sharped
Cm7/b5 - Minor 7, flat 5 or half diminished are disparaging, sorrowful, difficult and deep composed of Cm7 with 5b

In summary,
Cmaj7 = C + 7
C7 = C + 7b
C6 = C + 6
C9 = C + 9 = C + 2
C+ = C with 5#

Cm = C with 3b
Cm7 = Cm + 7b
Cm6 = Cm + 6
Cm9 = Cm + 9
Cdim = Cm with 5b

Csus4 = C - 3 + 4

Wednesday, January 26, 2011

Major and Minor Interval

Interval is the distance between two notes. In a scale, interval is measured from the root note. The root note is counted as 1.

For example, in C Major:
C-C
C-D = Major Second
C-E = Major Third
C-F = Perfect Forth
C-G = Perfect Fifth
C-A = Major Sixth
C-B = Major Seventh
C-C = Perfect Octave

For C Minor:
C-D = Major Second
C-Eb = Minor Third
C-F = Perfect Forth
C-G = Perfect Fifth
C-Ab = Minor Sixth
C-Bb = Minor Seventh
C-C = Perfect Octave

Major interval can only be used for the 2/3/6/7 notes in Major scale

Minor interval can only be used for the 3/6/7 notes in Minor scale. Minor interval is always half step shorter than the major interval of the same name.

Pythagoras, born circa 580 BC, initiated the idea that if a vibrating string is divided in half, it produces the octave above the open vibrating string. In other words, the ratio of the fundamental string to to the divided string is 1:2 will create a octave interval.

Perfect intervals (unison, forth, fifth and octave) is perfect because they are produced from perfect or pure frequency ratio. For fifth, the ration is 2:3 and for forth is 3:4

Identifying Key

The key signature does not differentiate Major or Minor key. 99.99999% of time, Minor key seldom appears in its natural form. Usually the melodic or harmonic minor keys are used in music. And both forms contain a leading note (raised 7th). Major key does not need to do so because the 7th note is already raised. So if you found one note is sharped frequently and that is the leading note of the related minor key, the piece is likely in the minor key.

For example, if to know if the piece is A minor, look for appearance of G# in the piece.

Monday, January 24, 2011

Scale

(1) Pentatonic scale comprises 5 notes (1, 2, 3, 5 6) which simple fractional relationship with each other in term of frequency.

1, 1+1/8, 1+1/4, 1+1/2, 1+2/3, 2 or
1, 1+1/8, 1+2/8, 1+4/8, 1+6/9, 2

There is no semitones within the pentatonic scale that make the notes sound good together. The scale has the following interval pattern:

1 1+1/8 1+1/4 1+1/2 1+2/3 2

The interval between the third and forth notes is more than 1 tone. Likewise between the fifth and the next octave note.

(2) Major scale is built from the pentatonic scale plus two notes. Two note was added in the interval higher than 1 tone - sub-dominant and the leading note. The pentatonic notes have the follow relative frequency to the tonic:

The two additional notes are 1+1/3 (which is 1 semitone above 1+1/4) and 1+7/8 which is 1 tone above 1+2/3 and 1 semitone below 2 (octave).

1, 1+1/8, 1+1/4, 1+1/3, 1+1/2, 1+2/3, 1+7/8. 2

Major scale has the following interval pattern:

WWHWWWH

A drawback of the major scale is the tendency towards definite, complete statements.

(3) Minor scale has less definite full stops. Minor scale takes the same set of note as a major scale but use a different tone as the root. For example, C manor and A minor uses the same set of note but the root (tonic) is C and A respectively.

There are 3 different forms of minor scale.

Natural minor scale has the following interval pattern:
WHWWHWW

Comparing to the major scale, 3 notes (including the leading note) move down 1 semitone.
Comparing C Major to C Minor, the 3, 6 and 7 notes (E A B) are flat in C Minor.

C, D, E, F, G, A, B, C
C, D, E♭, F, G, A♭, B♭, C

The relative frequency is:

1, 1+1/8, 1+1/5, 1+1/3, 1+1/2, 1+3/5, 1+3/4, 2

Natural minor lacks the leading note. So it does not sound good when the melody going up as the seventh and the octave note is too far apart. It sounds good for melody travelling down in pitch (descending melodic minor).

For the ascending melodic minor, 2 of the notes are re-instate to their major positions. The leading note is restored and its neigbouring note also to smoothing out the big gap. The sixth note also bumped up a semitone so that it does not sound too far apart from the leading note too.

Ascending melodic minor scale has the following interval pattern:

WHWWWWH

C, D, E, F, G, A, B, C
C, D, E♭, F, G, A, B, C

The relative frequency of ascending melodic minor are

1, 1+1/8, 1+1/5, 1+1/3, 1+1/2, 1+2/3, 1+7/8, 2

For composing melody, we use the natural minor for the part going down and melodic for the part that going up. However, the two minor scales pose a problem for creating chord which sounds good for both directions. The harmonic minor is a compromise by only restoring the leading note from the natural minor.

WHWWHW#H

The 6 and 7 note is now 1.5 notes apart. This creates a melodic awkward.

Classical musician practice minor scales by playing melodic minor as the scale ascend and natural minor in descend. Harmonic minor scale is rarely practised. This is more a harmonic tool and not a melodic one.

Wednesday, January 12, 2011

Key Signature

The sharp and flat have fixed position and sequence in key signatures.

#: Father Charles Goes Down And Ends Battle
b: Battle Ends And Down Goes Charles's Father

Sharp Key Trick - For keys that have sharp, find the last sharp and go one note higher. If the note is sharped in the key signature, the key is sharped.

For example, for A Major, there are 3 sharps. The last one is G. So the key is A Major. For F# major with 6 sharp, the last sharp is on E. since note F is sharped, the key is F# Major.

Flat Key Trick - Find the second last flat (from the right). The name of the flat note is the name of the key. Similarly, if the note is flatted in the key signature, the key is flatted.

For example, Ab Major with 4 flat, the last flat is D and the second last flat is A. As note A is flatted in the key signature, the key is Ab major