Monday, December 7, 2009

Interrupt

Maskable interrupt can be generated by hardware or software by asserting the INTR line. They are maskable because programmer can disable the processor from recgonizing the INTR signal or disable the interrupt controller from accepting the interrupt request from selected device.

Non-Maskable interrupt (NMI) is generated by the chipset when serious hardware problem was detected in the system board. The processor's NMI input is asserted.

Software exception

Software exception refers to the problem when executing an instruction or its operands. The processor attempts to recovery gracefully by invoking a special exception handler.

A fault is an exception reported at the start of the instruction that caused the exception. The instruction can be restarted after the handler fixes the problem (e.g. page fault). A trap is an exception reported after the offending instruction has been executed. An abort does not always reliably supply the instruction that caused the problem. This makes it impossible for the exception handler to fix the problem and resume program execution.

Demand Paging in 386

Segmentation complicates programming. Paging can be used to present a flat 32-bit (4GB) address space and yet provide protection among tasks. There is no way to switch off segmentation in the processor. However, if all segments defined in GDT was set to R/W, start at 00000000h and 4GB in legnth, segmentation is effectively eliminated.

Paging is enabled by setting PG bit in CR1 to 1. The paging unit intercepts all 32-bit linear memory addresses generated by the segment unit and perform a redirection mapping using a 2-level page table structure.

The top 10 bits of the linear address is used to index into the page directory to yield the base address of the Page table. The next 10 bits of the linear address is then used to index into the Page Table to yield the start address of the memory page. The last 12 bits is then used to access the location as offset.

If the Page Table was not in memory, a page fault is triggered. The linear address is latched into CR2 so that it could be accessed by the OS's page fault handler. When the target page is not in memory, similar action is performed.

TLB (Translation Lookaside Buffer) is used to short-circuit the look up. The segment unit sends the linear address to both TLB and Paging Unit. The top 20-bits of the linear address is compared against the cached entries in the TLB. If a match is found, it will disable the paging unit and send the corresponding 20-bits physical address mapping onto the FSB for retrieval.

Segmentation in IA32 Real Mode

The start address of the segment must be within the first 1M memory space. The length of segment is fixed to 64KB (most significant 16 bits in the address). There is also no protection of segment among tasks.

The 64K memory above 1M line is called extended memory or HMA (High memory area). In real mode, one can access HMA by setting the segment register to xFFFF. For example,

mov ax,ffff
mov ds,ax
mov al,[0010]

The effective address - FFFF0+0010=10000.

This method is not effective for 8086/8088 which does not have A20 pin. Instead, the address will be wrapped around to 0000 (segment wrapping).

Monday, November 30, 2009

Memory Alignment for 386

When the processor initiates a transaction on the FSB, the logic external to the processor takes the least 2 significant bits of the address as always zero. In other word, the processor can only address memory locations at Dword boundary. The processor implement 4 output pins (BE0# to BE3$) instead to address individual byte in the Dword. (Each BE pin is used to select a separate memroy bank DIMM?) For example, for location zero of the Dwrod, the processor asserts BE0# pin and the target byte will be output over data path 0 (D[7:0]). For location three of the Dword, the processor asserts BE3# pin and the data will be output over data path 3 (D[31:24]).

To execute this instruction - mov eax,[0101], processor will need to access address locations at 0100 and 0104 (Dword boundary). Then it extracts the last 3 bytes from 0100 and the first byte from 1004 to form the final Dword to be loaded in eax. This degrades performance. Moreover, it could further trigger double cache misses or page misses. Therefore, Dword alignment of data is important. Starting from 486, all IA32 procssors will flag out this condition (Aligment Check Exception).

Saturday, October 31, 2009

POP, IMAP and SMTP

Post Office Protocol is used transfer mail from central server to user agent (reader). POP only handles inbound mail. IMAP (Internet Message Access Protocol) is similar to POP. IMAP has an additional online mode which messages can remain in the server.

Outbound mail is handled by SMTP (Simple Mail Transfer Protocol). MTA (Mail Transfer Agent) uses SMTP to deliver mail across internet.

Monday, September 21, 2009

Useless mov edi,edi in the Prologue

The seemingly useless statement is used to enable hot patching (patching without stopping the component). The 2-byte instruction can be changed to a short jmp operation (within a range of 127 bytes in either direction). To extend the jmp target, NOP statements are generated before the function labels so that a long jmp statement could be patched in:

xor eax,eax
jmp xyz
nop
nop
nop
nop
nop
func-abc:
mov edi,edi
push ebp
mov ebp,esp
:

Frame Point Omission

FPO is an optimization technique. ebp is used as a general purpose register rather than the stack frame base pointer. Execution is sped up by the availability of this additional register.

Call Convention

Stdcall pushes the argument from right to left onto the stack. The called function is responsible to remove the parameters passed in by decrementing the esp by the length of the parameters. Cdecl call convention differs from Stdcall by having the calling function to remove the argument passed from the stack. Stdcall is preferred because the clean up is done one place (no mater how many times it is being called), which is simpler. Cdecl is used for C/C++ because they support variable number of parameters for function call. As the called function will not know the number of parameters beforehand, the clean up has to be performed by the calling function instead.

Linker generates special name for different call conventions. For Stdcall, function name will be prefixed by "_" and appended by "@", follow by the number of bytes of stack space required. For Cdecl, function name is prefixed by "_".

Fastcall uses ecx and edx to pass the first 2 argument. Clean up is by the called function, similar to Stdcall. Function name is prefixed by "@", appended by "@" and followed by the number of bytes of stack space required.

Thiscall passes this point via exc and the rest of arguments on the stack. Clean up is by called function.

Stack Frame

Before calling a function, the caller will first reserved space for the parameters in the stack. For example, assuming the parameters occupies 20 bytes:

sub esp,14h

Following this is a series of mov statment to move the parameters to the stack using offset with ebp. For example,

mov dword ptr [edp-14h],3
:
:

Then the call operation is used to jump to the function. Call will push the eip onto the stack (esp will advance as a result).

At the beginning of the function, the compiler generates a stack frame using the frame base pointer register ebp. The function prologue saves the current ebp onto the stack before setting up a new stack frame:

mov edi, edi
push ebp
mov edp, esp

As a result, the ebp of the new frame points to the old ebp value (the last frame base). The ebp is then used to access the parameter (positive offset) and local variables (negative offset).

At this stage, the call stack contains the following (growing downwards):

parm1
parm2
:
return address
saved ebp of caller
local variable1
local variable2
:

When the function finishes, the epilogue restore the previous stack frame

add esp,14h ; clean up the parameter stack space assuming this is stdcall
mov esp, ebp
pop ebp
ret

Saturday, August 22, 2009

SLIP and PPP

SLIP (Serial line IP) is a link level protocol to carry IP over serial line (e.g. modem, RC232 interace). Each IP datagram is terminated by END character. The END character will be ESCAPEd if presents in the datagram. There is no checksum in SLIP and error detection and handling is assumed to be done in the upper protocol layer.

As serial interface speed is typically slow, CSLIP is used to optimize throughput by reducing the 20+20 TCP+IP header to 3 or 5 bytes. This is done by maintaining state information (fields that rarely changed) in the CSLIP protocol thus removing the need for such information to be present in the normal TCPIP header. CSLIP can maintain up to 16 connections. The smaller header size improve response time for interactive session on serial line.

PPP (Point-to-point Protocol) improves on SLIP by adding LCP (Link Control Protocol) and NCP (Network Control Protocol) capability. LCP allow the both ends to negotiate options (e.g. IP address negitiation for both ends). NCP allows PPP to support more than 1 network protocol (i.e. no just IP) on one serial line. Finally, PPP included checksum.

Sunday, August 9, 2009

Challenges to Pipleining and Superscalar

Data Hazard refers to the use of related data in 2 instruction that prevent them from executing simultaneously. For example, the output of the one instruction is used as an input to the next instruction. Pipelined processors use "forwarding" to resolve this issue. Output port of the ALU is fed into the input port directly and bypassing the register-file write stage. Superscalar processor uses "register renaming" to decouple instructions using the same register in the calculation. For example, the following 2 instruction can be executed simultanously using register renaming technique.

Add A, B, C; add a and b and store result in c
Add D, B, A; add d and b and store result in a

Structure Hazard refer to the shortage of resources to execute multiple instruction simultaneously. In a superscalar design, it takes a large number of wire to connect each ALU to the register. Hence, CPU registers are grouped into a special unit called register file. Register files are like memory array which consists a data bus and 2 ports - read and write ports. for example, ALU accesses the register file's read port and requests the data to be placed on the bus. A single read port allows the ALU to access a signle registr at a time. Therefore, for 3 operand instruction like the above requires 2 read port and 1 write port. Modern CPU also uses separate regiester files to store integer, floating-point and vector numbers as each of them uses separate execution units. Another reason for this separation is to keep the register file size small. The large the register file, the slower the access will be.

Control (Branch) Hazard arises when the processor arrives at a conditional branch instruction. Branch prediction is used to get around this type of stall. Instruction cache is used to improve the performance for loading the next instruction from a branch.

ISA

In 1960, IBM S/360 introduced the concept of ISA as a layer of abstraction to the underlining CPU hardware microarchitecture. Programs written on an ISA are guaranteed to run on any CPU that implement the ISA. ISA provides a standardized way to expose the features of a system's hardware that allows manufactures to enhance the implementation without breaking programs. ISA is implemented using microcode engine, wich consists of some storage, microcode ROM which holds the microcode programs, and an execution unit that translate the standard instruction to the ones specific to the hardware implementaiton.

The drawback of microcode engine is it is slower than direct decoding. (Modern microcode engine has approached 99% of the speed.) However, the benefit of abstraction is so signifcant that outweight this slight penalty.

Instruction Flow

Execution of instruction takes multiple stages. Generally, there are 4 basic stages - (1) fetch (from memory), (2) decode, (3) execute and (4) write (back result). Contemporate CPU further break down these stages and enhanced them for performance improvement. Discrete logics are used to implement these stages and form a pipeline. This allow the processing of mulitple instructions simultaneously.

Superscalar

As the number of transistors increases, chip designer could afford to put more than 1 ALU on a single chip. As the design could do more than one scalar operations, it was called superscalar. IBM RS6000 was the first superscalar CPU released in 1990. The first superscalar CPU from Intel was Pentium, released in 1993.

Saturday, June 27, 2009

RAID

RAID2 uses error correction codes to fix incorrect data when read from disk. Data are stipped in bit sized chunk. A dedicated disk to containt he error correction codes.

RAID3 and RAID4 uses a dedicated parity disk and requires at least 3 disk to function. The difference is that RAID3 use byte sized chunk and RAID4 uss block sized chunk. A common way to calculate parity is using XOR.

RAID5 removed the bottleneck of the dedicated parity disk. Parity are stored in all disk in a round-robin fashion. It also requires at least 3 disks to function.

Master Boot Record

MBR resides in the first 512-bytes sector on the device. The first 446 bytes contains the boot code, followed by 4x16-bytes partition table entries. There are 2 types of partition in MBR:

(1) A primary file system partition contains file system.
(2) A primary extended partition contains additional partitions.

An extended partition contains 1 secondary file system partition (called logical partition in Windows) and 1 secondary extended partition. The secondary extended partition contains more partitions in a recursive fashion.

To illustrate, a disk with 6 partitions may have the following partition structure:
- 3xparimary file system partitions (C:, D:, E:)
- 1xparimary extended partition
- 1xsecondary file system parition (F:)
- 1xsecondary extended partition
- 1xsecondary file system partition (G:)
- 1xsecondary extended partition
- 1xsecondary file system partition (H:)

Volume

Volume is a collection of addressable sectors recognized by the OS. Volume assemble multiple storage devices into one logical unit. Volume can be sliced into smaller partitions. Partitions can in turn build volume.

Saturday, June 20, 2009

Shader

Previously, programmer uses Fixed Function Pipeline (FFP) to send instruction directly to the GPU. Shaders are program written in HSLS (MS) or Cg (C for Graphic) that are executed by GPU. Vertex Shader are run once for every vertex in the viewing frustum to set the position of the vertex based on the world and camera settings. Once the postion data is created, rasterization translate the triangles into a set of pixel. Pixel Shaders then run for each pixel to determine the colour. Then the data is output to the screen.

Viewing Frustum

It defines an area in the 3D world of what is visible to the camera. The frustum is confined by the near clipping plane and a far clipping plane. Only objects in the frustum will be drawn. When a player move and far objects (e.g. building, mountain) suddenly pops up because the object has entered the far clipping plan.

SCSI

Small Computer System Interface versions differ in the number of bit per transfer, speed and types of signal.

(a) number of bits
SCSI (normal) - 8 bits, 5MB/s
SCSI (wide) - 16-bits, 10MB/s

(b) speed (frequency)
Fast SCSI, Ultra, Ultra-2, Ultra-3 (Ultra-160) and Ultra-320

(c) Signal
Single-ended (SE) use a strong voltage for 1 and no voltage for 0. This method is not stable with higher speed and longer cable.
Differential Voltage uses 2 wires. No voltage on both wire represents 0. Opposite voltage applies to the wires to represent 1. HVD (Hign Voltage Differential) was the inital standard. LVD was now the more common type. Some LVD can operate in SE mode but results in slow speed.

DCO

Device Configuration Overlay was introduced in ATA-6 to allow hiding of disk capabilities. DCO allows user purchases drives from different vendors which have different storage space to appear offering same number of sectors.

 DCO caused the IDENTIFY_DEVICE command to show a subsut of features and smaller disk size. DCO uses DEVICE_CONFIGURATION_SET command to reserve specified storage at end of the disk. The usable disk space could be further reduced by using HPA.

DCO is also managed via ATA commands like HPA.

HPA

Host Protection Area was introduced in ATA-4 to allow computer vendor to save data that would not be erased when a user formats the hard disk. It is intended manufacturer to stow diagnosis tool and backup image so that they do not need to ship an install CD.

HPA was an area set aside at the end of the disk using SET_MAX_ADDRESS command. The command featured a "volatile" bit which allow the HPA to be effected at the next reboot. This allows user to read/write the HPA in the current session and lock the content after reboot.

HPA is invisible to both BIOS and OS.  It is managed by low level ATA commands.


Disk Password

There are 2 passwords in harddisk - user and master. The master password allows administrator to access the disk if user password is lost.

There are 2 operating modes. In high security mode, both user and master password can unlock the disk. In maximum security mode, the master password can unlock the disk only after the disk has been wiped.

Security is implemented using the ATA SECURITY_UNLOCK command which must be executed before any READ/WRITE ATA command. Some ATA commands are still usable without SECURITY_UNLOCK command and thus the disk may be recognizable by the system.

IDE/ATA

IDE (Integrated Disk Electronics) refers to disk with a logic board built-in, as comparing to older disk. ATA (AT Attachement) refer to the interface of IDE disk. ATA disks requries a controller which is built into the motherboard. ATA cables has a maximum length of 18 inches and uses 40-pins. The cable contains an extra 40-wires which do not connect to the pin to improve insulation.

Early ATA disks were addressed by CHS (Cylinder/Head/Sector). The disparaty between the ATA standard and older BIOS on the number of bits used for CHS has limited the size of IDE disk to 504MB. Newer BIOS translate the CHS address to ATA standard and this extended the addressability to 8.1GB. To overcome the limitation, CHS was replaced by LBA (Logical Block Address) in later ATA standard.

ATA-1 (1994) - support CHS and 28-bit LBA
ATA-3 (1997) - added Self-Monitoring Analysis amd Reporting Technology (SMART) which allows monitoring of several parts of the disks. Another feature was Password Protection.
ATA/ATAPI-4 - ATA Packet Interface for removable media. ATAPI used the same cable and controller but required special drivers. 80-wire cable was introduced. Add HPA (Host Protection Area) for vendor to keep data that would not be erased by formatting the disk.
ATA/ATAPI-6 (2002) - Added 48-bit LBA and removed support for CHS. Add DCO (Disk Configuration Overlay).
ATA/ATAPI-7 - Included serial ATA

Unicode

Unicode uses 4 bytes. There are 3 ways to store unicode character. UTF-32 uses 4 bytes for each character. UTF-16 stores most frequently used character in 2-byte values and less frequently used in 4-bytes value. UTF-8 uses 1-, 2- or 4-bytes values. UTF-8 and UTF-16 uses more processing comparing to UTG-32.

Sunday, February 22, 2009

ttymon login

Beside getty login, SVR4 also supports another login via ttymon. Typically getty login is used for console and ttymon login is used for terminal. In ttymon login, init fork-exec program sac (sevice access controller). sac fork-exec ttymon when system enters multiuser state. ttymon monitors all terminal ports listed in the configuration file. When user logs in, ttymon fork-exec program login. Note that the parent of login is now ttymon while the parent for getty is init.

getty login

Login in from terminal or modem (remote) connecting via RS232 connection comes from terminal driver in the kernel. Terminal device configuration are defined in /etc/ttys by the administrator. When the system boots, it creates the first process, init (PID=1). init reads /etc/ttys and for each termial devices that allows login, init fork-exec program getty.

getty is run with superuser privilege (UID=0). It open the terminal in RW mode. getty sets file descriptor 0, 1 and 2 to the terminal device driver. It then outputs the login prompt. Once the user enters its userid, getty exec program login to handle the input.

login prompts the user for the password and uses crypt to encrypt the password entered for checking. If the password is invalid, login will exit with code 1 after a few tries. Control passed back to init and it fork-exec the getty again to restart the process. If the password check is successful, login will set up the user environment and changes to the user's ID before it invoke the shell using execl call - ("/bin/sh", "-sh", (char *) 0). The minus sign is a flag to tell all shells that they are being invoked as a login shell. Finally, the login shell reads the start up files (e.g. .profile) before displaying the first prompt for user.

Sunday, February 15, 2009

Unbuffered I/O and effect of buffer size

Most UNIX I/O can be accomplished by 5 system calls - open, lseek, read, write and close. They are unbuffered I/O calls. Programmer needs to allocate a buffer and pass it as one of the arguments to these I/O functions. If the buffer size picked is small, more I/O read/wrtie calls will have be made to transfer the same amount of data in and out of the program. A small buffer size results in higer CPU utilization (System and User) as well as longer total run time. Using larger buffer size will decrease CPU time. This is an example of trading CPU with memory usage. The CPU improvement will taper off at a certain buffer size beyond which further CPU utilization decrease will not be significant when the number of I/O call has already become quite small.

UNIX standard I/O packages provides a buffered I/O call interface for program. When using these calls, programmer no longer needs to concern with buffer size as it will be taken care by the standard I/O calls.

Tuesday, January 6, 2009

AC/DC

An invertor converts low voltage DC to high voltage AC. It consists of a chopper device which opens and closes a switching transistor successively. This produces a pulsating DC. Then a transformer steps up the current to higher voltage. The output is closer to a square wave than a good sine wave. More expensive chopper will produce a wave form closer to sine wave.

To convert AC to DC, one uses a rectifier. This device uses one or more diode to force/convert current to flow in one direction. The result is a pulsating DC current. The ripples generated can be smoothed out to approach a pure DC form by using filter. The rectifer works by trying to maintain the DC voltage at its peak value.