Thursday, July 21, 2011

Linux Bootable Image

After the vmlinux kernel ELF has been built, it is further processed to strip off the redundant sections (comments and notes). The output is an object file called Image which is then compressed using gzip

cat Image | gzip -f -o > piggy.gz

Next the bootstrap loader is built. Bootstrap loader is the 2nd stage loader which prepare the context for the Linux kernel to run in. It is different from the bootloader (1st stage loader) of which control will be passed over to once the hardware is power on. Bootloader performs low level initialization and diagnosis utilities.

Bootstrap loader perform the following functions:
(1) head.o and head-"arch".o - low level assembly language processor initialization, which include enabling the processor's internal instruction and data cache, disabling interrupts and setting up a C runtime.
(2) misc.o - decompress and relocate linux kernel
(3) other initialization

Bootstrap loader contains an assembly program called piggy.s. The program contains a include (inclib) for piggy.gz. In other words, when piggy.s is assembled, the compressed kernel is piggyback into the bootstrap loader. The bootstrap also include other object code like head.o to form the boatable kernel image called zImage.

In summary, the bootable kernel image contains the following codes:
(1) piggy.o - asm wrapper around piggy.gz which is a compressed vmlinux without notes and comments sections
(2) head.o
(3) head-"arch".o - architecture specific intialization
(4) misc.o
(5) others

Wednesday, July 20, 2011

Semaphore

It was invented by Dijkstra in 1965 as a generalization of critical region. A semaphore was assigned an initial count. As long as the count is not zero, thread can continue to decrement the count without waiting. When thread leaves, it increases the count. Dijkstra invented 2 operations - P based on a fictitious word prolaag which means "try to take" and V which is a Dutch work "verhoog" meaning to increase.

Semaphore of value 1 is called binary semaphore. Semaphore of value more than 1 is called counting semaphore.

Dual-mode Hosted VM (e.g. VMWare GSX)

There are 3 components:

(1) VMM-n (native) - This component runs in native mode. It is the component that intercepts traps or patch critical instruction in VM. this may provide drivers for performance purpose or those not available in host OS.

(2) VMM-u (user) - This component runs in user mode, similar to an applicaiton to the host OS. This component issues resource requests to the host OS, in particular memory and I/O request on behalf of VMM-n. VMM-u issue requests using the system library fucntion provided by the host OS.

(3) VMM-d (driver) - This component provides a means for communication between VMM-n and VMM-u. This makes VMM-n appears as a special device to VMM-u.

The advantage of a hosted VMM is easy installation and the fact that the actual device drivers do not have to be incorporated into VMM. The disadvanatages are

(1) VMM-n operates in privileged mode alongside with the host OS. There could be compatibilty problem which may corrupt each other's memory.

(2) The allocaiton of resources is completely under the control of host OS. The effects of allocation prolicies in the VMM are less predictable because the VMM has insifficient knowledge and control over the use of resources.

Performance disadvantage compared to a native VM because the need to go back and forth between VMM-n and host Os. The perofrmnace degratdation is more significant in an I/O intensive workload than in a user-mode CPU-intensive workload.

VM/370

The first VM environment was the IBM System/360 Model 40 VM. VM became mainstream until later with a model of System/370. The VMM in VM/370 was called CP (Control Program). The CP design team also developed a single-user OS called CMS (Conversational Management System), mainly to demonstrate the advanatage of modularization for system evolution. The CP/CMS design separates the function of resouce management from the function to provide services to the user. CP and CMS can exist without each other. In fact, CMS was developed on a bare machine before CP exists.

Resoruce Virtualization

(1) Processor
The key aspect of virtualization a processor lies in the execution of the guest instruction, including both system-level and user-level instructions. There are 2 ways. One is via emulation (interpretation or binary translation). Emulation is the only processor virutalization mechanism available when the ISA of the guest is different from the host. Emulation may also be required even if host and guest ISA are the same, such as for instructions that interact with hardware resources need to operate differently on virtualized processor than on a real one.

The second method use direct native execution on the host machine. This is possible only when the host and guest ISA are the same. To minimize performance degratdation, one basic design goal for system VM is to execute significant fraction of the instructions directly on native hardware. the overhead for emulating the remaining instructions depends on the number of such instructions, the complexity to discover this instructions, and the data structure and algorithm used for emulation.

(2) Memory
In a system VM, each guest VM has its own set of virutal memory tables. A guest's real address must undergo a further mapping to derive the physical address. Memory resource virtualization is done differently depending on if the page table or TLB is architected. If page table is architected, its structure is defined by the ISA, and the OS and hardware cooperate in maintaining and using it. TLB is maintained by hardware not visible to OS. When TLB misses, hardware walk the page table to find the new entry to load. If the mapping is not found, a page fault result and OS take over. If TLB is architected, its structure and special instruction used to manipulate it is defined in the ISA. Hardware is unaware of the page table. When there is a TLB miss, a trap is sent to OS to be handle instead.

Most older ISA uses architected page table. Some of the more recent RISC ISA use architected TLB.

Vitualizing architected page tables - Each guest OS maintains its owne page table. Virtual-to-physical address mapping is maintained by VMM in shadow page tables., one for each VM. these tables are actually used by hardware to do the address translation and to keep TLB up-to-date. the shadow page tables eliminate the addition translation required from real to physical address. VMM control the real page table pointer. When VMM activates a guest, it loads the page table pointer to use the correct shadow page table for this guest. Guest OS attempts to read/write to the page table pointer will result in a trap to VMM. The trap is generated either because these instructions are privilege or via the wrapper code around these instructions. For read instruction, VMM return the value of the guest's virutal page table pointer (in memory block). For write instruction, VMM updates both virtual and real pointer. Where there is a page fault, the page may or may not have been mapped in the guest's virutal page tables. If it is mapped, the fault is handled entirely by VMM. The guest is not notified of it. If it is not mapped, VMM transfer control to the page faulter handler of the guest OS. However, many of the instruction issued by the page fault handler must be intercepted and handled by VMM because these instructions are privilege or the memory where the virtual page tables reside are write protected by VMM. Doing I/O to real address is also tricky as the physcial address may not be contiguous as the real address is. In this case, the I/O instruction may need to be broken up to multiple I/O instructions to perform I/O on discontinuous blocks of memory. These operation can degrade the performance substantially.

virtualizing TLB - When ISA provides a software managed TLB, the TLB must be virtualized. VMM must mainains a copy of TLB for each guest and intercept instruction so that VMM can keep these TLB up-to-date. One approach is for VMM to load the TLB whenevem a guest VM is activated. This TLB rewrite incur quite high overhead especially for large TLB. An alternate approach is to leverage the address space ID (ASID) that are part of the architect TLB. This allows the TLB to contains entries for various guest simultaneously. A special regiester containing the current ASID is use to indicate the entries in TLB to be used for translation.

(3) I/O
This is one of the more difficult parts of implementing a system VM because there is a large number of device types and a large number of devices in each types.

dedicated devices - e.g. display, keyboard, mouse and speaker. The device itself does not necessarily have to be virtualized. Requests to and from the device could theoretically bypass the VMM and go directly to the guest. However, this is not the case as guest usually runs in user mode. VMM will capture the interrupt and pass to the guest VM when it is activated.

partitioned devices - the VMM translate the parameters into corresponding parameters for the underlying physical device, using a map and reissue the request. Similarly, the result is translated on its way back.

shared devices - e.g. network adaptor. Each guest has its virtual stage of the device e.g. network address, port. VMM translate the parameters and result similar to partitioned devices.

spooled devices - e.g. printer. Printer is solely under control of one program until the printing is complete. This is true even if the program is swapped out. VMM maintain a spool table which consolidate the spool entries in each guest. The virtualization of printer of this kind was more appropriate for older line printers which were expensive and attached to the machine. Nowaday, network printers usually have its own buffer and spool jobs from machines across network.

Generally, VMM can intercept a guest's I/O action and convert it from a vertial device action toa real device action at 3 interfaces

virtualizing at I/O operation level - Memory mapped I/O, commonly in many RISC platform, perform I/O by reading/writing to a special memory location. These locations are protected by OS which make them inaccessible by user mode programs. System like S/360 and IA-32 performed I/O using special command (e.g. SIO). These privilege nature make them easy to be trapped by VMM. However, a I/O request from application typically results in a few such I/O request. Reverse engineering these requets to deduce the I/O action is extremely difficult in practice.

virtualizing at Device Driver level - this is straightforward and allow virtualization at a natural points. However, it requires the VMM developer have knowledge of the guest OS and its device interface. Special virtual device drivers can be developed for each guest. These drivers are bundled and installed as part of VMM. This can be simplified by "borrowing" the drivers (and interface) from an existing OS. In a hosted VM environment, the drivers in the host OS are used.
virtualizing at system call level - this is done by intecepting the initail I/O request at the OS inerface, the ABI. Then the entire I/O the be done by VMM. To do this, VMM must shadow (emulate) the ABI routines. This is a daunting task.

System VM

Real resoruces of the host Platform are shared among the guest system VMs. VMM (VM Monitor) manage the allocaiton, access to the hardware resources.

native VM - VMM runs in privilege mode, guest system (OS) runs in user mode. The privilege level of the guest IS is emulated by VMM.

Hosted VM - VMM run in user mode above a hosted OS. VMM uses services provided by the host OS to control and manmage resources desired by each of the VMs.

Dual-mode BVM - For efficiency reasons, it is desirable to have at least part of the VMM work in privilege mode. It is done via extending the host OS using standard interface such as kernel extension of device driver.

Garbage Collection

To begin, it is first necessary to identify the root pointers. The root set must contain references somewhere on the stack, including both local storage and the operand stack or in the constant pool. The root set must also include references contained in static objects.

(1) Mark and Sweep starts with the root references and traces through all the reachable objects, marking each one as it is reached. Marking may coinsist of setting a flag bit in the object or in a separate bitmap. Garbage objects found can be combined into a linked list of free objects. Overall, this is a relatively fast way of identifying and collecting garbage. However, the free objects are of varying size and are scattered in the heap space. This leads to memory fragmentation which requires compaction. Inefficiencies can be reduced by using segregated free lists, or in other words, by dividing the heap into a set of fixed size chunks having arrange of sizes.

(2) Compacting Collectors essentially slides the live objects to the bottom or top of the heap so all live objects are adjacent. What is left is a contiguous region of free space. Although conceptually simple, the compacting collector is relatively slow as it makes multiple passes through the heap. One pass does the marking and then subsequent passes compute the new locations for the live objects, move the objects and update all reference poniters to the new location. To reduce the number of reference update, some system uses handle pool (one objects referenced by 2 other objects are pointed to through the handle pool. The 2 referencing object point to the handle pool which in turn points to the referenced object. This introduced another level of indirection.

(3) Copying collectors divids the heap into 2 halves. At any one time, only 1 half is used. The collectors copies the objecs from one half to the other half. The memory requirement for copying collector is high compares to other collector.

(4) Generational Collectors
Objects manifests a bimodel behaviour. Most objects tend to be short-lived due to good object oriented programming practice. Objects that are not short-lived tends to have very long life-time. The heap is divided into 2 subheaps. The nursery is for newly created objects. It is garbage-collected more frequently. Any objects survives a number of collection in the nursery is moved to the tenured heap which has less frequent collection.

(5) Incremental and concurrent collectors
All the above collectors stop program execution while they perform collection. Collection time may be spread out if done incrementally. Also, in a multiprocessor envrionment, it is advantageous to collect using one thread while other threads are used to run programs. As a partially collected heap is in a state of flux, some synchronization between the collector and program is needed. For example, once the collection marks a object to be alive, it may be dereferenced by the program. One of the common solutions is to provide write barriers for references to objects that are marked.

Overall, mark and sweep provide good collection time.

Process VM - Code Cache

Code Cache differs from a conventional cache memory in at least 3 ways:
(1) The cache blocks do not have a fixed size. The size depends on the size of translated target block.
(2) The presence and locations of the blocks are dependent on one another because of chaining. If a block is removed, the links must be updated
(3) there is no backing store. When a block is remove, it must be regenerated.

Replacement strategies include:
(1) LRU - Because of temporal locality, this is often a good strategy for conventional caches. Unfortunately, because of the specific properties of code cache, it is relatively difficult to implement. Firstly, there is overhead to implement structure to track access. Secondly, back points are need to perform delink. Thirdly, blocks are of different size and this results in fragmentation. Because of these, LRU is not typically used for code cache.
(2 Flush when full - most basic algorithm is simply to let the code cache fill and then to flush it completely and start over with an empty cache. There are advantages in this approach. Overtime, the frequently followed paths may change and flishing provides an opportunity to elimnate control paths that have become stale and no longer refelct the common paths through the code. Secondly, flushing may remove orphans and reclaim space. The disadvantage is block must be retranslated from scratch, leading to a high performance overhead immediately after the flush.
(3) Preemptive flush - Many programs operate in phases. A phase change is usually associated with an working set change. When phase change, a new region of source code is being entered and a larger percentage of time is spent in block translations. The code cache can be preemptively flished tomake room for the translation.
(4) Fine grained FIFO - a nonfragmenting algorithm that exploits temporal locality and is not a brite force approach. The code cache is managed as a circular buffers. The oldest n-blocks are replaced to make room large enough for the new block. This scheme overcomes a number of disadvantages of LRU (albiet at a slightly reduced hit rate). It still needs to keep track of chaining via back pointers.
(5) Coarse grain FIFO - partition the cache into large blocks (e.g. a block may be 1/8 of cahce size). with this, backpointer problem can be simplified or eliminated by only maintianing backpointer in block level.

Process VM - OS Emulation

When guest and host OS are the same, the problem of OS call emulation is primarily one of matching the OS interface syntax. The OS functions required by the guest are available in the host, but it may be necessary to move and format arguments and return values, possibly forming some data covnersion in the process. For example, OS running on platform with few register may pass parameter in stack while OS with many registers may pass arguments in registers. In this case, the call must be set up by copying arguments from stack to registes when emulating a system call. This is call wrapper.

Some call may be handled by the runtime code instead of transalation. For example, a call to establish signal. In this case, the runtime code will own all signal and so the request from the guest will be marked in a side table instead. Runtime will pass on the signal if it matches the side table. Another example is memory management call.

Pracitically, if the guest and host OS are different, there is relatively little likelihood that a complete compatible OS emulation can be performed. Some compromise and approximations will be required. A compromise often used is to restrict the application supported (thus limiting the system calls).

Process VM

This VM provide a virtual environment at the program or process level. This allow program compiled for one system to run in a different system. Program are compiled, distributed and stored as executable binaries that conforms to a specific ABI which include hardware instruction set and OS interface.

Process VM contains the following components:
(1) Loader writes the guest code and data into a region of memory and loads the runtime code.
(2) Loader passes control to the initialization block which allocates memory space for code cache and other tables used during emulation. It also involves host OS to establish signal handlers for all trap conditions.
(3) Emulation engine uses interpretation or binary translation to emulate guest instructions.
(4) Code cache manager to maintain the cache
(5) Profile database contains dynamically collected program information that is used to guide optimization during translation.
(6) When guest program performs a system call, the OS call emulator translates the OS call into approapriate calls to the host OS and handles any associated information returned as a result.
(7) The runtime must also handle traps that may occur and interrupt that is directed at the guest process using exception emulator

Process VM state mapping

It refers to the mapping of register and memory of guest process to the host address space. In the host addres space, guest registers could be mapped to the host regisers or in register conext block in memory (runtime data) of the host address space. The guest code and data will be map into memory, together with the emulatior (runtime code).

Memory mapping from guest to host address space conceptually uses a mapping table. The mechanism is similar tothe virtual-to-real address translation (base address translation then forming the address by adding offset). This emulation using software has high overhead. This approach is most flexible as consecutive memory blocks in guest can be dispersed in non-consecutive blocks in guest address space. To simplify the translation, we can cosnider address space mapping methods that rely more on the inderlying hardware than VM. Both cases assuming the host address space is larget than the guest:

(1) the guest address space is map continuously in the host address space above the runtime code. In this case, the host address = guest address + (length of runtime)
(2) the guest address space is map continuously in the host address space starting at the same offset. in this case, host address = guest address. Runtime is relocated to a location above the guest address space.

It is apparent that the relative size of guest and host address space has significant implicaiton to the choice of mapping method. Whether the run time can be placed in arbitrary area outside the confine of guest address apce is also another improtant factor.

Emulator needs to deal with memory model as it mimic the OS the guest process thought run in. For example, guest process may allocate a memory block with some protection setting and emulator needs to mimic the memory model to be compatible. In general, user application sees 3 main features:
(1) overall structure of the address space e.g. segment or flat
(2) access privilege (R, W, E)
(3) protection and allocation granularity - smallest unit that the OS can allocate and protect for the application.

The complexity of mapping of a page in guest to host depends on the relative page size and protection types available in both platform. If the host page size is larger and protection types is more comprehensive than the guest, it is possible tomap the guest page to host page directly thus letting the underlying hardware to enforce the allocation and protection. Otherwise, some software mapping and interfence from EM is required which is more complex and slow.

Incremental Predecoding and Translation

To tackle the code discovery problem, a general solution is to translate the binary while the program is operating on actual input data, i.e. dynamically, and to predecode or tranate new sections of code incrementally as the program reaches them.

There are 4 components:
(1) The emulation manager (EM) controls the overall flow.
(2) The interpretor translates the source code to intermediate code
(3) The translator translates to the target binary code
(4) The map table associates the SPC, for a block of source code with TPC for the corresponding block of translated code. The map table is typically implemented using hash table.

The system translates one block of code at a time. The unit for translation is called dynamic basic block. This is different from the static basic block which determined by the static structure of the program. A static baisc block starts a a branch label and end at a branch instruction. For example,

- - - - - - - - - start of block 1
add...
load....
store...
- - - - - - - - - end of block 1/start of block 2
loop: load...
add...
store
brcond skip
- - - - - - - - - end of block 2/start of block 3
load...
sub...
- - - - - - - - - end of block 3/start of block 4
skip: add...
store
brcond loop
- - - - - - - - - end of blcock 4/start of block 5
add...
load...
store...
jump indirect
- - - - - - - - - end of block 5


A dynamic basic block is determined b the acutal flow of a programas it is executed. It begins at the instruction executed immediatelhy after a branch or jump, follow the sequential stream, and end with the next branch or jump. For example,

- - - - - - - - - start of block 1
add...
load....
store...
loop: load...
add...
store
brcond skip
- - - - - - - - - end of block 1/start of block 2
load...
sub...
skip: add...
store
brcond loop
- - - - - - - - - end of blcock 2/start of block 3
loop: load...
add...
store
brcond skip
- - - - - - - - - end of block 3/start of block 4
skip: add...
store
brcond loop
- - - - - - - - - end of block 4

Incremental (staged) translation works as follow. After the source code binary is loaded into memory, the EM begins interpreting the binary using a simple decode-and-dispatch loop or indirect threaded method. When the EM reaches a bracnh or jump, the SPc-TPC mapping table is consulted.

If it is a miss, the profile table is checked to see if the next block is hot (i.e. it has been used a few times above a predefined threshold). If it is not, update the profile table and control is passed back to the interpreter. If it is hot,the next block is translated and placed into the code cache, and SPC-TPC mapping table is updated.

If it is a hit, the EM transfer to the code cache. Exection start in the code block and follows the linked block until the forward link end and control passes back to EM.

When the interpreter or translated code hit an OS call, control is passed back to the EM. The instruction is processed by the OS emulator. When the Os emulator returns, the control passes back to EM which do a SPC-TPC look up and continue. Similarly, when the code genreates an exception, control is passed to the exception emulator.

EM follows the path of the soruce program and either directly executes the next block or begin translating the next dynamic basic block. Incrementally, more of the program is discovered and translated, until eventually only translated code is executed.

Further optimization includes:
(1) Translation chaining - similar to threading, instead of branching back to EM at end of the block, the jump is target directly to the next block
(2) Software Indirect Jump Prediction - implementing indirect jumps by map table lookup is expensive in execution time. Instead, use a series of compare statement for the source address value to determine the jump address reduce the overhead. For example, Rx is the register holding the indirect jump target PC value,

if (Rx == addr1) go to target1
else if (Rx == addr2) go to target2
else table_lookup(Rx)

(3) Shadow Stack - when a translated code block contains a procedure call via an indirect jump to a target binary routine, the SPC value must be saved by emulation code as part of the source architected state, either in register or in memory stack. when the called procedure completes, it can restore this SPC value, access the map table and jump to the translated address. As map table resolution is expensive, the overhead can be avoided if the target return PC value can be made available directly. In this case, the return value of the target code is pushed onto a shadow stack maintained by the emulation manager. In case the return SPC may have changed during the call, the SPC is pushed onto the shadow stack too. upon return, the SPC on the stack is checked against the SPC for the shadow stack before the link address is used.

Binary Translation

The source architected register values are stored in a register conetxt block held in memory. If feasible, source register can map directly to target register for performance optimization. Static binary translation is not always possible because of the following code discovery problems:

(1) indirect jump of which the target is unknown at translation time
(2) padding bytes or data inserted by compiler to align instruction makes it difficult to determine where the next instruction starts after the last one. This is especially hard for ISA with variable instruction length

Another type of problem is code-location. In an indirect jump of which the destination address in the register is a source code address. The address must be mapped to a address int the translated code.

Threaded Interpretation

The basic interpretator consists of a central loop which drive the interpretation as follow:

While PC->instruction
decode instruction
dispatch to emulation routine
PC = PC + 1
end

Instead of using the central decode-dispatch loop, append the decode and dispatch logic at then end of the emulation routine to speed up by reducing the number of branches in the basic interpretator. The emulation routines are threaded together indirectly through a table and thus is called indirect threaded interpretation.

Predecoding can achieve further efficiency. Specificially, predecoding involves parsing an instruction and putting it in a form that simplifies interpretation. For example the following is a list of instruction in predeconded form

struct instruction {
unsigned long op
unsigned char dest;
unsigned char src1
unsigned int src2
} code [CODE_SIZE]

If same source instructions are interpreted, the intermediate form can be reused. Because the intermediate code is separate from the source code, a Target PC (TPC) is used to track the execution of the intermediate code. The SPC and TPC may have no direct coorelation. Thus both values must be tracked.

The op field can be further optimized to hold the actual address of the emulation routine. Thus saving an indirect look up and jump operation. This is called direct threading.

Codesigned VM

conventional VM focus on portability and functionality (to emulate differet platform), never on performance as emulating often added overhead and make execution on VM less efficient than those running on the native hardware. Codesigned VM are designed to enable innovative ISA and to improve performance or power efficency or both. In a codesigned VM,there is no native application. The VM is part of the hardware implemenation. The software portion of the codesigned VM uses a region of memory that is not visible to any system or application. This concealed memory is carved out of real memory at boot time and conventional guest software is never informed of its existence. VMM code resides in the concealed memory can take control of hardware practically any time. In the most general form, the VM software perform binary translation and cache the code in the concealed memory for execution. Hence th guest never executes directly on the native hardware. IBM AS400 uses many codesigned VM techniques. The primary objective is support for an object-oriented instruction set that redefines the hardware/softare interface in a novel fashion. The current AS400 implementation are based on an extended PowerPC ISA.

System Level VM

System Level VM were first developed during 1960s for mainframe which the hardware is expensive and differnt group of users wanted different OS. A single host hardware platform can support multiple guest OS simultaneously. At present, themost important features of System Level VM is to partition major software system securely.

Platform replication is the major feature provided by a VMM. The central problem is that of dividing a single set of hardware resources among multiple operating system envrionments. The VMM has access to and manages all hardware resources. When guest OS performs a privileged instruction, it is intercepted by VMM, checked for correctness and exectured on behalf by the VMM. All being done transparently.

One way to build system level VM is to have VMM sits on the hardware and the guest OS sits on top of the VMM. One disadvantage is that the original OS must be wiped out to install the VMM. Another disadvantage is that the VMM must contain device drivers becuase it interacts directly with the underlying hardware. An alternative is to build the VMM on the top of an existing OS. The installation is similar to an application and the VMM can use services from the existing OS. However, the performance will be hit as there are more layer. An example of hosted VM is VMware 2000)

Process Level VM

Process Level VM provides application witha virtual ABI environment. Process VM exhibits in various form

(1) Replication - e.g. multiprogramming - most operating system can support mulitple user processes. In other words, the operating system provides a replicated process-level VM for each of the concurrently executing applicaitons. The host OS can be same of different. The most straightforward emulatio is interpretation. The interpretor emulates the source ISA. The process can be relatively slow as each source instruction may require tens of native target instructions.

(2) Emulation - For better performance, binary translation is used. Block of source instructions are converted to target instruction set, which can be cached and reused. Interpretation has relatively low start up overhead but high execution time. On the other hands, dynamics binary translatir has high initial overhead but is fast for repeated execution. some VM use a staged emulation strategy combined with profiling (i.e. collect statistics regarding the program's behaviour). Initially, a block of source instruction is interpreted and profile track how frequently the block is executed. Binary translation is used for block with repeated execution.

(3) Optimization - In addition to emulation, the target code can be optimized. This leads naturrally for VM where soruce and target instruction set are the same and optimization is the the primary purpose of the VM.

(4) Platform independence - Emulation translates from a specific source ISA to a specific target ISA. A virutalized ISA can be used for ultimate protability. The VM environment does not dorectly corrrespond to any real platform. Rather, it is designed for ease of protability and to match the features of a high level language. The HLL VM focuses on minimizing hardware-specific and OS-specific features which would compromise protability. Examples of HLL VM are Java VM and MS CLI (.Net). These VM uses bytecodes (each instruction is encoded as a sequence of bytes) which are stack based (to eliminate register requirement). Memory size is conceptually unbounded with garbage collection as an assumed part of the implementation.

Instruction Set Architecture (ISA)

It marks the division betweein hardwar eand software. The concept of ISA was first clearly articulated when IBM 360 familiy in early 1960. The importance of software compatibility was fully recognized. IBM 360 has a number of model incorporated with a wide range of hardware resources but all of them could run the same software. There are 2 parts of ISA. User ISA is visible to application program, System ISA is visible to supervisor such as operating systems.

The application binary interface (ABI) provides a program with acc ess to hardware resources and services available in the system. ABI contains all user instruction. It also contains system call interface which allows application to invoke operating system to perform works on behalf of itself.

The application programming interface (API) is usually defined with respect to a high level language. The API can include system call provided by the operating system (wrapper). API enable applications written to be ported easily (via recompilation) to any sytem that supports the sampe API.

Trust Computing Base (TCB)

This is part of the computing system that absolutely must be trustworthy if we are going to get anything done. This usage of trusted may seem conterintuitive: We do not trust the TCB because it's worthy of trust but rather because we have no choice. consequently, it's important both to know what the TCB is and to keep it as small as possible. This way, we have more assuarance that what's trusted is also trustworthy. The TCB is defined also indirectly as security perimeter that separate it from the rest of the system. The reference monitor is somtimes called security kernel. The user had better be sure whether he is communicating to the TCB. Trusted path denotes a mechanism through which the user can do this. A secure attention key is the mechanism used by user to establish the channel.

Sunday, July 10, 2011

Booting linux

Bootloader in an embedded system is the first code to run when the system is powered on. Bootloader typically is stored in BIOS or flash memory. It performs low level hardware initiailization and then pass control to the linux kernel.

Some architecture and bootloader (e.g. Power with U-Boot) can boot the vmlinuex directly (after converting ELF to binary form). In this case, the image is called uImage (a compressed vmlinux in U-Boot header). In other architecture, an intermediate step is required to set the right context for vmlinux before control is handed over.

vmlinux

IT is the Linux monolithic kernel in ELF format. This is binary and contains no unresolved references.

/arch/arm/kernel/head.o is the an architecture specific (ARM in this case) that perform low level kernel intialization. This is executed first when the kernel is loaed and passed control to by the bootloader.

init_task.o set up the initial thread and task structures that the kernel requires.

The largest object module making up kernels are filesystem code, network code, built-in drivers code and the kernel (which contains scheduler, process and thread management, timer and other core functions).

The /arch/ARM/kernel contains specific architecture functionalities such as low-level context switching, hardware level interrupt, processor exception handling etc.

Flash Memory

Flash memory can be written to and erased under software control. Speed is considerably slower than hard disk. Flash memory is divided into relatively large erasable units (blocks). In a NOR flash memory chip, data can be changed from a binary 1 to 0directly to the cell address, one bit or word at a time. However, to change from 0 to 1, an entire erase block must be erased using a sequence of control instructions to the flash chip.

Flash memory erase block can be uniform in size or variable. The smaller block can store the bootloader and the kernel or data are kept in larger block. This is commonly called boot block or boot section chip.

To modify data stored in Flash memory array, the block in which the modified data resides must be completely erased. As the block size for Flash is much larger than typical hard disk (512 or 1K bytes), the wrtie time of Flash can be many times of hard disk.

Another limitation for Flash is there is write lifetime. Write may fails after the lifetime (100K) is exceeded.

NAND Flash is newer technology. It has smaller block. While NOR flash uses parallel address and data lines. NAND flash use proprietary serial interface. Lifetime for NAND fash is also significantly higher.