Equi-Join is when the join columns are related using equality. e.g. T1.C1 = T2.C1
Theta-Join is when the join columns are related with > or < operators e.g. T1.C1 > T2.C1
Certain join methods only work with equi-join while some may work with any type of joins
When joining 2 tables, if there are rows that does not have matching join value, inner join will not return these rows. Outer-join on the other hand will return such rows in the result set. Example of such condition:
T1.CustID = T2.CustID - there are CUSTID in T1 not found in T2 and vice versa.
No-matching rows returned from outerjoin will have some of the fields (those in the table without the join value) empty.
Join Techniques
(1) Nested Loop
Used when the number of rows return is expected to be small and indexes are available to pick up those rows efficiently. Filter the matching row in driving table, look up the matching entry in the driven table via index. Use the ROWID to access the data.
Nested loop is efficient only if the select condition is highly restrictive. To retrieve each row, IO are incurred on both indexes and tables. If the result set is large, the IO cost could be higher than scanning through the table once to find all matching rows.
(2) Sort-Merge
There is no concept of driving or driven table. Both tables are equal. Rows from both tables are first sorted in join column order. The resulting tables are then merged to find the matching rows. This join technique use full table scan. If the sort work area is not large enough to accomodate the full table, additional IO will incur to swap the rows in and out of memory and increase retrieval time. There are cases that the optimizer piggyback the sort required by the ORDER BY or GROUP BY processing.
(3) Hash Join
It uses 2-phases processing as well. In the BUILD phase, picks the smaller (driving) table and places the rows into different bin by hashing the join value. In the PROBE phase, scan through the driven table, hash the join value to compute the slot, then compare with the join value already in the slot for a match.
One pit fall of hash join is when the join column has a skew distribution such that some slots contain no entry while some contain a very large number of entries. This makes the comparison in PROBE phase sub-optimal. The performance is also affected by the size of hash work area that holds the driving table.
(4) Cluster Join
It is a rare type of joins. The idea is to store the data chunks most likely to be needed at the same time in the same data block, thus speed up retrieval. The procesing is similar to nested loop. For each row in the driving row, use the cluster index to find the atching row in the driven table. These rows will probvably be in the same data block.
Cluster join can be used when the join condition corresponding to a clustering key in the driven table.
Sunday, December 23, 2012
Saturday, October 13, 2012
Loops per jiffy
System timer interrupt the CPU at pre-set frequency specified by the variable HZ in Linux. There is trade off in choosing a value for HZ. Too short the interval resulted in excessive switching which means higher overhead and power consumption. Too long the interval reduce the granularity of the scheduler resolution. The value for HZ is hardware architecture dependent. On x86, it used to be 100 in 2.4 kernels by default. It was changed to 1000 in 2.6 kernels but lowered to 250 in 2.6.13.
Jeffies holds the numner of tomes the system timer jas popped since system boot. Jeffies is incremented HZ time by the system every second.
During boot, the Linux kernel calculates the number of times the CPU can execute an internal delay loop in one jiffy (consecutive ticks of the system timer). The result depends on the speed of the CPU and is stored in a variable called loops_per_jiffy. This value is used to delay execution (wait) for a small duration (microseconds).
In kernel, delays in order of jiffies are considered long wait. To ensure sleep-wait for such long duration, use schedule_timeout(), wait_event_timeout() and msleep() functions. These functions cannot be used in interrupt context as interrupt handler cannot issue schedule() or sleep. Busy-wait for short period is possible for interrupt context.
Delays in sub-jiffy duration is considered as short delay. Such delays are common for both process and interrupt contexts. Because it is not possible to use jiffy-based methods to implement sub-jiffy delays, the only way is to use busy-wait. Kernel API for short dalays are mdela(), udelay() and ndelay() (for milli-, micro and nano- durations). These API uses loop_per_jiffy to calculate the number of loop requires to achieve the desired delay duration.
Jeffies holds the numner of tomes the system timer jas popped since system boot. Jeffies is incremented HZ time by the system every second.
During boot, the Linux kernel calculates the number of times the CPU can execute an internal delay loop in one jiffy (consecutive ticks of the system timer). The result depends on the speed of the CPU and is stored in a variable called loops_per_jiffy. This value is used to delay execution (wait) for a small duration (microseconds).
In kernel, delays in order of jiffies are considered long wait. To ensure sleep-wait for such long duration, use schedule_timeout(), wait_event_timeout() and msleep() functions. These functions cannot be used in interrupt context as interrupt handler cannot issue schedule() or sleep. Busy-wait for short period is possible for interrupt context.
Delays in sub-jiffy duration is considered as short delay. Such delays are common for both process and interrupt contexts. Because it is not possible to use jiffy-based methods to implement sub-jiffy delays, the only way is to use busy-wait. Kernel API for short dalays are mdela(), udelay() and ndelay() (for milli-, micro and nano- durations). These API uses loop_per_jiffy to calculate the number of loop requires to achieve the desired delay duration.
Sunday, August 26, 2012
Windows Debugging
User mode debugging is supported by OS using debug port object. Notificiation message generated by the target process is passed to the debug port and the debgger process process them.
In order for user mode debugger to break-in another process, it must have sufficient privilege. That's why one can only debug its own process unless one attains full administrator privilege (SeDebugPrivilege).
When an exception occurs, the OS exception dispatching code in ntdll.dll notifies the debugger before any user exception handler. This is called first chance exception notification. The debugger can choose to handle or ignore the exception. If it choose to ignore, control passes to the user exception handler. If the SEH exception was not handled by a thread in the target process, the debugger is sent another debug event called second chance exception notification to inform that the exception is not handled.
Default action for first chance notification is to logged in the debugger window. The second chance notification default action is always to stop the target process because unhandled exception is a common error.
Debugger can freeze the target process anytime which refer as breaking into the debugger. This is achieved by using the DebugBreakProcess API, which internally injects a remote thread into the target process. This "break in" thread execuites a debug break CPU interrupt instruction (int 3) and OS responds by raising an SEH exception and send a first chance notification to the debugger. The debug break exception (x80000003 or STATUS_BREAKPOINT) suspend all threads in the target process. The current thread context om the user mode debugger after a break-in operation will be in this special thread (ntdll!DbgUiRemoteBreakin). Once in the debugger, one can switch to the actual thread to inspect the content.
Code breakpoints are also implemented using int 3 instruction directly overwriting the target code in memory (kernel32!WriteProcessMemory). The debugger keeps track of the initial instruction for each code breakpoint so that it can substitute them in place of the debug break instructure when the breakpoints are reached. and before the user can inspect the target process in the debugger. This makes the break transparent to the user. To reinsert the break after the user issue g (go) to resume exection, the debugger uses the TF (Trap Flag) in EFLAGS register which force the target thread to execute on instruction at a time. This single-step flag causes the CPU to issue an interrupt (int 1) after each instruction. The target process execute the original instruction and then the debugger gets the chance to handle the SEH. The debugger restore the breakpoint instruction again and reset the TF flag to disable single-step mode.
In kernel mode debugging, the communication channel is buillt lower down in the architectural stack using HAL extensions. When kernel debugging is enabled via msconfig.exe and the target machine is rebooted, the OS boot loader loads the appropriate debug transport extension module at the early part of boot sequence (around the time when HAL is being loaded):
kdcom.dll for serial COM cable
kd1394.dll for firewire cable
kdusb.dll for USB 2.0 debug cable
The target machine periodically poll for break-in message from the host debugger (nt!kdCheckForDebugBreak). When the message is received, the OS will suspend all process and enter break-in state.
Because these modules sit low in the architectural stack, they cannot depend on higher level OS kernel components that may not have fully loaded or they themselves are the targets of debug. The OS kernel periodically asks the tranport layer (as part of the clock ISR) to check for break-in packets from the host debugger. While the target computer is suspended, the break-in loop checks for more commands (e.g. inspect register) sent by the host kernel debugger. Another way to enter the break-in loop is when the target machine hits an exception.
Like the user mode debugger, breakpoints are inserted by overwriting the target code with int 3. Once the breakpoint is reached, the target machine enter the break-in send/receive loop to allow the debugger to put the intial byte back in the breakpoint location before entering the break-in state.
If the code has been paged out, the target machine register the code breakpoint as "owed". When the code page is loaded into memory later, the page fault handler (nt!MmAccessFault) in the kernel memory manager intervenes and insert the breakpoint into the global code page at that time.
Managed code,MSIL (NET), is translated into machine code lazily when the method is actually invoked. Therefore, when debugger wants to insert a breakpoint, it needs to wait. The native debug events generated by OS are not sufficient to support MSIL debugging as only CLR knows when a method are compiled or how the class is represented in memory. As such, the CLR provide a dedicated helper thread, known as debugger runtime controller (clr!DebuggerRCThread), to interact with debugger. Even in the break-in state, the managed target process is not frozen because the debugger thread still runs to service the commands from debugger.
A set of COM objects provide via mscordbi.dll provide the interface to the helper thread. Debugger like the visual studio debugger accepts user command via its front end UI (vsdebug.dll), forward them to the backend (cpde.dll) which in turn uses mscordbi.dll to communicate with the helper thread in the managed target process. These COM objects take care the private interprocess comminication channel between the debugger and target processes. This architecture has a drawback is that it cannot debug crash dump as there is no active helper thread running.
Active Scripting specification defines a language processing engine with the Active Scripting host using that engine when the script needs to be interpreted. Examples of scriot engine are vbscript.dll and jscript.dll (MS implementation of JavaScript). Example of Active Scripting hosts ionclude the IIS (server-side scripts embedded in ASP or ASP.NET pages), IE (client side script hosting in web pages) and the Window scripting hosts (cscript.exe or wscript.exe). The Active Scripting specification also defines a contract (a set of COM interfaces) for debugger. Host that support debugging (i.e. implement the COM interfaces) are called smart host. A Process Debug Manager (PDM) component (pdm.dll) is shipped with the Visual Studio debugger to insulate script engine from the intricacies of script debugging. PDM serves the same purpose as the debugger controller thread in CLR. The debugging services in script engine is usually not exposed by default and must be turned on (e.g. via IE option).
In order for user mode debugger to break-in another process, it must have sufficient privilege. That's why one can only debug its own process unless one attains full administrator privilege (SeDebugPrivilege).
When an exception occurs, the OS exception dispatching code in ntdll.dll notifies the debugger before any user exception handler. This is called first chance exception notification. The debugger can choose to handle or ignore the exception. If it choose to ignore, control passes to the user exception handler. If the SEH exception was not handled by a thread in the target process, the debugger is sent another debug event called second chance exception notification to inform that the exception is not handled.
Default action for first chance notification is to logged in the debugger window. The second chance notification default action is always to stop the target process because unhandled exception is a common error.
Debugger can freeze the target process anytime which refer as breaking into the debugger. This is achieved by using the DebugBreakProcess API, which internally injects a remote thread into the target process. This "break in" thread execuites a debug break CPU interrupt instruction (int 3) and OS responds by raising an SEH exception and send a first chance notification to the debugger. The debug break exception (x80000003 or STATUS_BREAKPOINT) suspend all threads in the target process. The current thread context om the user mode debugger after a break-in operation will be in this special thread (ntdll!DbgUiRemoteBreakin). Once in the debugger, one can switch to the actual thread to inspect the content.
Code breakpoints are also implemented using int 3 instruction directly overwriting the target code in memory (kernel32!WriteProcessMemory). The debugger keeps track of the initial instruction for each code breakpoint so that it can substitute them in place of the debug break instructure when the breakpoints are reached. and before the user can inspect the target process in the debugger. This makes the break transparent to the user. To reinsert the break after the user issue g (go) to resume exection, the debugger uses the TF (Trap Flag) in EFLAGS register which force the target thread to execute on instruction at a time. This single-step flag causes the CPU to issue an interrupt (int 1) after each instruction. The target process execute the original instruction and then the debugger gets the chance to handle the SEH. The debugger restore the breakpoint instruction again and reset the TF flag to disable single-step mode.
In kernel mode debugging, the communication channel is buillt lower down in the architectural stack using HAL extensions. When kernel debugging is enabled via msconfig.exe and the target machine is rebooted, the OS boot loader loads the appropriate debug transport extension module at the early part of boot sequence (around the time when HAL is being loaded):
kdcom.dll for serial COM cable
kd1394.dll for firewire cable
kdusb.dll for USB 2.0 debug cable
The target machine periodically poll for break-in message from the host debugger (nt!kdCheckForDebugBreak). When the message is received, the OS will suspend all process and enter break-in state.
Because these modules sit low in the architectural stack, they cannot depend on higher level OS kernel components that may not have fully loaded or they themselves are the targets of debug. The OS kernel periodically asks the tranport layer (as part of the clock ISR) to check for break-in packets from the host debugger. While the target computer is suspended, the break-in loop checks for more commands (e.g. inspect register) sent by the host kernel debugger. Another way to enter the break-in loop is when the target machine hits an exception.
Like the user mode debugger, breakpoints are inserted by overwriting the target code with int 3. Once the breakpoint is reached, the target machine enter the break-in send/receive loop to allow the debugger to put the intial byte back in the breakpoint location before entering the break-in state.
If the code has been paged out, the target machine register the code breakpoint as "owed". When the code page is loaded into memory later, the page fault handler (nt!MmAccessFault) in the kernel memory manager intervenes and insert the breakpoint into the global code page at that time.
Managed code,MSIL (NET), is translated into machine code lazily when the method is actually invoked. Therefore, when debugger wants to insert a breakpoint, it needs to wait. The native debug events generated by OS are not sufficient to support MSIL debugging as only CLR knows when a method are compiled or how the class is represented in memory. As such, the CLR provide a dedicated helper thread, known as debugger runtime controller (clr!DebuggerRCThread), to interact with debugger. Even in the break-in state, the managed target process is not frozen because the debugger thread still runs to service the commands from debugger.
A set of COM objects provide via mscordbi.dll provide the interface to the helper thread. Debugger like the visual studio debugger accepts user command via its front end UI (vsdebug.dll), forward them to the backend (cpde.dll) which in turn uses mscordbi.dll to communicate with the helper thread in the managed target process. These COM objects take care the private interprocess comminication channel between the debugger and target processes. This architecture has a drawback is that it cannot debug crash dump as there is no active helper thread running.
Active Scripting specification defines a language processing engine with the Active Scripting host using that engine when the script needs to be interpreted. Examples of scriot engine are vbscript.dll and jscript.dll (MS implementation of JavaScript). Example of Active Scripting hosts ionclude the IIS (server-side scripts embedded in ASP or ASP.NET pages), IE (client side script hosting in web pages) and the Window scripting hosts (cscript.exe or wscript.exe). The Active Scripting specification also defines a contract (a set of COM interfaces) for debugger. Host that support debugging (i.e. implement the COM interfaces) are called smart host. A Process Debug Manager (PDM) component (pdm.dll) is shipped with the Visual Studio debugger to insulate script engine from the intricacies of script debugging. PDM serves the same purpose as the debugger controller thread in CLR. The debugging services in script engine is usually not exposed by default and must be turned on (e.g. via IE option).
Sunday, August 19, 2012
Chord Variations
Notes added to a basic triad are called extensions. Extended chords can be major, minor or dominant depending on the tirad on which they are based.
Major 7 = I-III-V+VII (C, E, G, B)
Minor 7 = I-iii-V+vii (C, Eb, G, Bb)
Dominant 7 = I-III-V+vii (C, E, G, Bb)
Inversion refers to the the order in which the notes of a chord are stacked. The standard order is called a root invesrion because the tonic is at the bottom (1,3,5). First inversion is to have the 3rd at the bottom etc. The notation to indicate inversion uses a slash. For example, CMaj7/G is to have G as a base for the CMaj7 chord. The notation can also be used to refer to a bass note not part of the chord. This is called altered base chord, for example, Am/D.
Compound chords are 2 chords playing at the same time. The notation use are similar to fraction, e.g.
A
D
Chord substition is to replace a chord with a related chord. The chords involved should have things in common. For example, chords with several common notes or chords with similar leading e.g. a dominant chord by another one which also lead back to tonic
Diatonic Substitution is to replace a chord with a diatonic third above or below the original. For example, C major chord (C-E-G) to be replaced by Am (A-C-E)
Major 7 = I-III-V+VII (C, E, G, B)
Minor 7 = I-iii-V+vii (C, Eb, G, Bb)
Dominant 7 = I-III-V+vii (C, E, G, Bb)
Inversion refers to the the order in which the notes of a chord are stacked. The standard order is called a root invesrion because the tonic is at the bottom (1,3,5). First inversion is to have the 3rd at the bottom etc. The notation to indicate inversion uses a slash. For example, CMaj7/G is to have G as a base for the CMaj7 chord. The notation can also be used to refer to a bass note not part of the chord. This is called altered base chord, for example, Am/D.
Compound chords are 2 chords playing at the same time. The notation use are similar to fraction, e.g.
A
D
Chord substition is to replace a chord with a related chord. The chords involved should have things in common. For example, chords with several common notes or chords with similar leading e.g. a dominant chord by another one which also lead back to tonic
Diatonic Substitution is to replace a chord with a diatonic third above or below the original. For example, C major chord (C-E-G) to be replaced by Am (A-C-E)
Saturday, July 21, 2012
Calling User code from Windows Kernel Mode
Code running in kernel mode in theory has unrestricted access to the whole address psace and so it could invoke code running in user mode. However, doing so requires first picking a thread to run the code in, transitioning the CPU mode back to user mode, and setting up the user-mode context of the thread to reflect the call parameters. Fortunately, calling user mode code is typically only required by OS itself and for driver, only in the context of a device IOCTL initiated by a user-mode thread.
A standard way for system execute code in the context of a given user-mode thread is to send an asynchronous procedure call (APC) to that thread. This is how thread suspension works in Windows: the kernel simply sends an APC to the target thread and asks it to execute a function to wait on its internal thread semaphore object, causing it to beconme suspended. APC also used in many other scenarios such as I/O completion and thread pool callback routines.
A standard way for system execute code in the context of a given user-mode thread is to send an asynchronous procedure call (APC) to that thread. This is how thread suspension works in Windows: the kernel simply sends an APC to the target thread and asks it to execute a function to wait on its internal thread semaphore object, causing it to beconme suspended. APC also used in many other scenarios such as I/O completion and thread pool callback routines.
Calling Windows Kernel Mode
the most basic way to call kernel mode code from user mode application is via system call. This mechanism uses native support in CPU to implement the trasnition. One drawback of this mechanism is that it relies on a hard-coded table of well known executive service routines to dispatch the request from the client code to the target kernel routine. This does not extend well to extension like drivers.
For those cases, another mechanism called I/O control commands (IOCTL) via the generic kernel32!DeviceControl API, is used. The API taks the user-defined IOCTL identifier as one of its parameters and also a handle to the devfice object to dispatch the request. The transition of kernel mode is still performed in NTDLL layer (ntdll!NtDeviceControlFile) and internally also uses the system call mechanism. So IOCTL is a higher level communication protocol built on the top of system call.
I/O control command are processed by the I/O manager of executive which builds an I/O request packet (IRP) that it then routes to the device object requested by the user mode caller. The device has an associated device stack that handles their requests. The IRP will filter down the stack to give each driver a chance to either process or ignore the request. In fact, IRP is also used by driver to send request to other drivers
For those cases, another mechanism called I/O control commands (IOCTL) via the generic kernel32!DeviceControl API, is used. The API taks the user-defined IOCTL identifier as one of its parameters and also a handle to the devfice object to dispatch the request. The transition of kernel mode is still performed in NTDLL layer (ntdll!NtDeviceControlFile) and internally also uses the system call mechanism. So IOCTL is a higher level communication protocol built on the top of system call.
I/O control command are processed by the I/O manager of executive which builds an I/O request packet (IRP) that it then routes to the device object requested by the user mode caller. The device has an associated device stack that handles their requests. The IRP will filter down the stack to give each driver a chance to either process or ignore the request. In fact, IRP is also used by driver to send request to other drivers
Windows Kernel Layering
Windows is a monlithic system. Drivers share the same address space as Windows kernel and has the same unrestricted access to memory and hardware. Several part of OS such as the NT file system, TCPIP stack etc are implemented as drivers rather than included in the kernel binary.
Above the HAL layer is the Window kernel (ntoskrnl). Kernel implements core low-level OS services such as thread scheduling, multiprocessor synchronization and interrupt/exception dispatching. It also contains a list of routine that are used by the executuve layer to expose higher level semantics to user mode applicaiton.
The executive is also hosted in the same kernel module (ntoskrnl). It performs core services such as proces/thread management and I/O dispatching. Other functions include security reference montior, plug and play manager, power management, cache and memory manager. Executive exposes callable function to other components such as driver and to user mode, called system services. The typical entry point in user mode is via ntdll.dll (which contain instruction to perform context switching to kernel mode). Executive allows process in kernel mode to access kernel objects such as process, thread, event etc via an object called handle, kept tracked in a handle table for each process.
The Win32 UI and graphics services are implemented by an extension to the kernel (win32k.sys) and exposes system services for application via entry point in user32.dll.
Several core facilities of the Windows OS are primarily implemented in user mode instead of kernel mode.
User sessions in Windows represent resource and security boundaries and offer a virtualized view of the keyboard, mouse and display to support multiuser logon on the same OS. The state that back these sessions is tracked in a kernel-mode virtual address space called session space. In user mode, the session manager subsystem process (smss.exe) is used to start and manage these user sessions.
During Windows boot upm a leader smss.exe instance that is not assocaiated wth any session gets created. this leader instance creates a copy of itself for each new session, which then starts the winlogon.exe and csrss.exe for each sesison. Use of multple smss.exe instance can provide faster logon of multiple users on Windows server acting as terminal servers.
Winlogon.exe is responsible for manageing user logon and logoff. In particular, this process starts the graphic UI process that display the logon screen when the user presses the CTL-ALT-DEL keys and display the desktop after successful login. Each session has its own instance of winlogon.exe process.
Csrss.exe or the client serverrintime subsystem process is the user model part of win32.sys (UI subsystem). It runs the UI message loop of console application prior to Win7. Each session has its own instance of csrss.exe.
The local security authority subsystem (lsass.exe) is used by winlogon.exe to authenticate user accounts during logon. It generates a security token object to represent the user's security rights which then used to create the explorer process for the user session. New child processes created from the explorer shell inherit their access tokens from the initial explorer process security token. There is only one instance of lsass which runs in the noninteractove session (known as session 0).
Services.exe or the NT service control manager (SCM) runs in session 0. It is responsible to start a class of special user mode processes called Windows services. These processes are typically used to carried out background tasks that do not required terminal interaction. These processes can choose to run with the highest privileges in windows (LocalSystem account), so they often used to perform privilegded tasks on behalf o user mode application.
Processes run with LocalSystem account (which is the highest privileged account in Windows) are parts of the trusted computing base (TCB) which are able to bypass any check by the security subsystem in the OS.
Above the HAL layer is the Window kernel (ntoskrnl). Kernel implements core low-level OS services such as thread scheduling, multiprocessor synchronization and interrupt/exception dispatching. It also contains a list of routine that are used by the executuve layer to expose higher level semantics to user mode applicaiton.
The executive is also hosted in the same kernel module (ntoskrnl). It performs core services such as proces/thread management and I/O dispatching. Other functions include security reference montior, plug and play manager, power management, cache and memory manager. Executive exposes callable function to other components such as driver and to user mode, called system services. The typical entry point in user mode is via ntdll.dll (which contain instruction to perform context switching to kernel mode). Executive allows process in kernel mode to access kernel objects such as process, thread, event etc via an object called handle, kept tracked in a handle table for each process.
The Win32 UI and graphics services are implemented by an extension to the kernel (win32k.sys) and exposes system services for application via entry point in user32.dll.
Several core facilities of the Windows OS are primarily implemented in user mode instead of kernel mode.
User sessions in Windows represent resource and security boundaries and offer a virtualized view of the keyboard, mouse and display to support multiuser logon on the same OS. The state that back these sessions is tracked in a kernel-mode virtual address space called session space. In user mode, the session manager subsystem process (smss.exe) is used to start and manage these user sessions.
During Windows boot upm a leader smss.exe instance that is not assocaiated wth any session gets created. this leader instance creates a copy of itself for each new session, which then starts the winlogon.exe and csrss.exe for each sesison. Use of multple smss.exe instance can provide faster logon of multiple users on Windows server acting as terminal servers.
Winlogon.exe is responsible for manageing user logon and logoff. In particular, this process starts the graphic UI process that display the logon screen when the user presses the CTL-ALT-DEL keys and display the desktop after successful login. Each session has its own instance of winlogon.exe process.
Csrss.exe or the client serverrintime subsystem process is the user model part of win32.sys (UI subsystem). It runs the UI message loop of console application prior to Win7. Each session has its own instance of csrss.exe.
The local security authority subsystem (lsass.exe) is used by winlogon.exe to authenticate user accounts during logon. It generates a security token object to represent the user's security rights which then used to create the explorer process for the user session. New child processes created from the explorer shell inherit their access tokens from the initial explorer process security token. There is only one instance of lsass which runs in the noninteractove session (known as session 0).
Services.exe or the NT service control manager (SCM) runs in session 0. It is responsible to start a class of special user mode processes called Windows services. These processes are typically used to carried out background tasks that do not required terminal interaction. These processes can choose to run with the highest privileges in windows (LocalSystem account), so they often used to perform privilegded tasks on behalf o user mode application.
Processes run with LocalSystem account (which is the highest privileged account in Windows) are parts of the trusted computing base (TCB) which are able to bypass any check by the security subsystem in the OS.
Sunday, June 17, 2012
Library
Reasons for using library to structure program:
(1) share common functionality
(2) convenient upgrade rather than replacing multiple executables
(3) separate of interface and implementation
(4) improve start up time and memory footprint by loading what is needed
(5) dynamically replaceable
(6) different version for different platform
Calls to library have a higher overhead than normal call because:
(1) library call may be implemented using a table of jump addresses
(2) library and its data is usually placed into new TLB entries. Calls to library could trigger TLB misses
(3) lazy load overhead during run time
(4) use of position-independent code (address reference relative to current executing location) to allow sharing but also increase code length.
(1) share common functionality
(2) convenient upgrade rather than replacing multiple executables
(3) separate of interface and implementation
(4) improve start up time and memory footprint by loading what is needed
(5) dynamically replaceable
(6) different version for different platform
Calls to library have a higher overhead than normal call because:
(1) library call may be implemented using a table of jump addresses
(2) library and its data is usually placed into new TLB entries. Calls to library could trigger TLB misses
(3) lazy load overhead during run time
(4) use of position-independent code (address reference relative to current executing location) to allow sharing but also increase code length.
Process and Thread
An application comprises instructions and data. Before a program runs, it consists of instruction and initialized data sections on the disk.
A process contains instruction, data and states. State is a set of value held in the registers, address of current instruction, values held in memory and flags. State is private to the process itself unless it is explicitly set up to share. In memory, a process contains the text, initialized data, uninitialized data, heap, stack, library and linrary data.
A thread has some state but basically just the regiesters and the stack. Each thread has its own stack.
A process contains instruction, data and states. State is a set of value held in the registers, address of current instruction, values held in memory and flags. State is private to the process itself unless it is explicitly set up to share. In memory, a process contains the text, initialized data, uninitialized data, heap, stack, library and linrary data.
A thread has some state but basically just the regiesters and the stack. Each thread has its own stack.
Sunday, June 3, 2012
Type of classical music
The most common type of classical music played by an orchestra are symphony and concerto. A symphony involved a hundred of players but not all of them playing at the same time. The tune is passed around from group to group and occasionally is played by full together. A symphony composes of 3 or 4 pieces of tune with short silence in between.
A concerto include a soloist playing with the orchestra. Bach and composer of his time usually group 6 short pieces of music into a suite. Suite typically starts off with a prelude (before-play) follow by dances which merely means they have the distinctive rhythm of certain dances.
Sonata are pieces written for 1 or 2 instruments and contain 3 or 4 movements. A piano sonata is written for piano. A violin or cello sonota has piano accompaniment.
String quartet is written for 2 violin. 1 viola and 1 cello. String trio takes away the 2nd violin. String quintet is a string quartet with an additional cello or viola. Piano quintet adds a piano to the string quartet.
Fugue is a piece with lots of counterpoints.
Cantata is a choir with orchestra and sometime with a solo singer. Chamber music is written for about 10 musicians performing in a room. Lieder (German word for song) is a singer with piano accompaniment.
Classical music is identified by number preceded by K and BWV which representing different catelogues. Opus means piece of works of a composer published in chronological sequence. The piece of music usually is of high quality to get published.
A concerto include a soloist playing with the orchestra. Bach and composer of his time usually group 6 short pieces of music into a suite. Suite typically starts off with a prelude (before-play) follow by dances which merely means they have the distinctive rhythm of certain dances.
Sonata are pieces written for 1 or 2 instruments and contain 3 or 4 movements. A piano sonata is written for piano. A violin or cello sonota has piano accompaniment.
String quartet is written for 2 violin. 1 viola and 1 cello. String trio takes away the 2nd violin. String quintet is a string quartet with an additional cello or viola. Piano quintet adds a piano to the string quartet.
Fugue is a piece with lots of counterpoints.
Cantata is a choir with orchestra and sometime with a solo singer. Chamber music is written for about 10 musicians performing in a room. Lieder (German word for song) is a singer with piano accompaniment.
Classical music is identified by number preceded by K and BWV which representing different catelogues. Opus means piece of works of a composer published in chronological sequence. The piece of music usually is of high quality to get published.
Modes
Sometime before 300 B.C., Greek uses different system of tones and semitones to divide the octave up and named them to people or neigbouring areas. The Christian Church develop a method of singing called Gregorian Chant at 750 onward which uses 7 scale patterns with random Greek names.
There are 7 modes:
Ionian - WWHWWWH - major key
Dorian - WHWWWHW - mostly used in Celtic music and American folk songs originated from Irish melodies. It gives a melancholy and soulful as the last note does not sound so resolved, like a question unanswered
Phrygian - HWWWHWW - middle east like tune. Brighter and less melancholy than minor key. Interval pattern is similar to Aeolian.
Lydian - WWWHWWH - with most of the intervals similar to Ionian mode except the third and forth position. Bright as major key but with uncommon interval
Moxilydian - WWHWWHW - similarly close to Ionian mode. Good for counterpoint to Ionian melody
Aeolian - WHWWHWW - natural minor. If Dorian is melancholy, Aeolian mean despair.
Locrian - HWWHWWW - uncommon
The basic feel of Dorian is very close to minor which is not surprising as the different between Dorian and D minor is only 1 note (B to Bb).
Lydian and Moxilydian sound close to major key as there is only 1 note difference in semitone. They sound slightly less definite and unambiguous.
Dorian and Phrygian sound close to minor key. Locrian is the odd ball and not commonly used.
There are 7 modes:
Ionian - WWHWWWH - major key
Dorian - WHWWWHW - mostly used in Celtic music and American folk songs originated from Irish melodies. It gives a melancholy and soulful as the last note does not sound so resolved, like a question unanswered
Phrygian - HWWWHWW - middle east like tune. Brighter and less melancholy than minor key. Interval pattern is similar to Aeolian.
Lydian - WWWHWWH - with most of the intervals similar to Ionian mode except the third and forth position. Bright as major key but with uncommon interval
Moxilydian - WWHWWHW - similarly close to Ionian mode. Good for counterpoint to Ionian melody
Aeolian - WHWWHWW - natural minor. If Dorian is melancholy, Aeolian mean despair.
Locrian - HWWHWWW - uncommon
The basic feel of Dorian is very close to minor which is not surprising as the different between Dorian and D minor is only 1 note (B to Bb).
Lydian and Moxilydian sound close to major key as there is only 1 note difference in semitone. They sound slightly less definite and unambiguous.
Dorian and Phrygian sound close to minor key. Locrian is the odd ball and not commonly used.
Identifying keys
C major and A minor use the same set of notes. How do the ears know which key is playing. The answer is on the emphasis. If C, E and G is the favourite, particularly using C as the end of the phrase, the music will have a major key feel.
Alternatively, if A, C and E is the favourites and ending with A frequently, the music will have a minor feel.
Alternatively, if A, C and E is the favourites and ending with A frequently, the music will have a minor feel.
Modulation
The most common type of modulation is to change to the key which has 1 note different. For example, from C major to G major:
C, D, E, F, G, A, B to G, A, B, C, D, E, F#
We subconsciously sense the change of the team leader (tonic) from C to G. Althernatively we could change from C major to F major which contains all notes from C except B falls to Bb. The emotional impact usualy lasts for a short duration (10 of seconds).
Composer usually modulate a phrase to key with many different notes to inject more interest. For example from C major to E major. Changing the key a semitone above or below brightens up the music. This is known as truck driver's modulation.
Changing from a major key to minor key add further interest to the mood. Modulate too frequent may confuse the audience while too little makes the music bland and predictable.
C, D, E, F, G, A, B to G, A, B, C, D, E, F#
We subconsciously sense the change of the team leader (tonic) from C to G. Althernatively we could change from C major to F major which contains all notes from C except B falls to Bb. The emotional impact usualy lasts for a short duration (10 of seconds).
Composer usually modulate a phrase to key with many different notes to inject more interest. For example from C major to E major. Changing the key a semitone above or below brightens up the music. This is known as truck driver's modulation.
Changing from a major key to minor key add further interest to the mood. Modulate too frequent may confuse the audience while too little makes the music bland and predictable.
Sharp and Flat in Keys
Even though the same note can be called sharp or flat (e.g Eb = D#), the key to name notes is to avoid using the same letter twice. For example, the following uses D and A twice.
A major: A, B, Db, D, E, Gb, Ab, A
The practices is to use each letter only once and use either # or b for all notes in the key
E major: E, F#, G#, A, B, C#, D#, E
Bb major: Bb, C, D, Eb, F, G, A, Bb
B major: B, C#, D#, E, F#, G#, A#, B
A major: A, B, Db, D, E, Gb, Ab, A
The practices is to use each letter only once and use either # or b for all notes in the key
E major: E, F#, G#, A, B, C#, D#, E
Bb major: Bb, C, D, Eb, F, G, A, Bb
B major: B, C#, D#, E, F#, G#, A#, B
Saturday, June 2, 2012
Notes in chord
When a note is produced from a musical instrument (e.g. pull a string), it was a combination of the base frequency and its multples. For example, the note A will contain the base frequency 110Hz and its multiples like 220Hz, 330 Hz etc. The combimation depends on the position the string is pulled. If the string is pulled in the middle, the frequency combination will be 110, 220, 440 (i.e. multiple factor of 1, 2, 4, 8... etc). If the string is pull at the third position, the frequency combination will be 110, 330, 660 etc).
A chord is a contains minimum 3 notes. The dorminant (5th note) is 1.5 times frequency of the tonic. The third node is 1.25 times frequency of the tonic. When these 3 notes are played together, there are many common frequency and thus the output sound very harmonic. For example, A has frequency of 110Hz. E, the dorminant note has frequency of 165Hz. They have a common frequency of 330Hz and thus they sound harmonic together.
For a minor chord, the third note is changed to have a relative frequency of 1.2 which has less strongly links to the other 2 notes.
A chord is a contains minimum 3 notes. The dorminant (5th note) is 1.5 times frequency of the tonic. The third node is 1.25 times frequency of the tonic. When these 3 notes are played together, there are many common frequency and thus the output sound very harmonic. For example, A has frequency of 110Hz. E, the dorminant note has frequency of 165Hz. They have a common frequency of 330Hz and thus they sound harmonic together.
For a minor chord, the third note is changed to have a relative frequency of 1.2 which has less strongly links to the other 2 notes.
Tuesday, May 29, 2012
Chord and Harmony
Chord is a sound made by 3 or more notes playing at the same time. Harmony is a succession of chords. Harmony is usually form a backdrop to melody. Some chord is pleasantly sound and some are not. A sequence of anxious chords build up tension before it is relieved by harmonious combination.
Rhythm guitar plays chords to accompany melody. Simple chords contain don't involve notes right from each other as such combination is harsh sounding. It is common though to build a harmoniously chord and add a crashing note.
Playing a chord one note at a time is called arpeggio. The most complex type of harmony is called counterpoint of which a melody is accompanied by another melody. Playing the same melody with delay is called canon. A slightly clever form of canon is to start the second run in higher or lower note.
Fugue is a piece which relies on the interplay between counterpoints as its main content.
Rhythm guitar plays chords to accompany melody. Simple chords contain don't involve notes right from each other as such combination is harsh sounding. It is common though to build a harmoniously chord and add a crashing note.
Playing a chord one note at a time is called arpeggio. The most complex type of harmony is called counterpoint of which a melody is accompanied by another melody. Playing the same melody with delay is called canon. A slightly clever form of canon is to start the second run in higher or lower note.
Fugue is a piece which relies on the interplay between counterpoints as its main content.
Sunday, May 27, 2012
Join Strategies
Join strategies depends on indexes, table sizes and selectivity (measure of distinct values in an index).
Nested-loop Join
for each row in outer table {
for each row in the inner table {
if outer table join column matches inner table join column then add to temp table
else fail
}
}
The smaller size table should be the inner table so that it can be cached and reduce the number of page read. If cache is too small, the DBMS needs to look for index to avoid scanning every row in the inner table. Thus the table with good index should be the inner table. Having good indexes in both tables is not necessary useful as the outer table only goes through once. If there are multiple join columns, a compund index (c1, c2) is needed.
If there is a restrictive expression in the SQL, for example
select * from table1, table2
where table1.c1 = table2.c1
and table1.c2 = "abc"
the table with the restrictive expression should be the outer table. If table1 is the inner table, the restrictive expression will need to be applied every time through the inner loop.
To speed up the inner loop, IBM hybrid join sort data so outer table access will be in the order of the join column on the inner table. Scan the inner table and apply the join column, build a temporary table by appending the RID of the inner table to the outer table row Soprt the temporary table by RID order (in other words, the outer table is now in RID order of the inner table). Retrieve the inner row based on the RID in the temporary table and form the result set. Because the retrieval is in RID order, performance increase with prefetch. Hybrid join is essentailly a nested join that requires the outer table be in sequence.
Sort-merge Join
/*sort phase*/
sort table1
sort table2
get first row of table1
get first row of table2
/*merge phase*/
for (;; until no more rows in tables) {
if (join coumn in table1 < join column in table2) then get next row in table1
else if (join column in table1 > join column in table2) then get next row in table2
else if (join column in table1 = join column in table2) then {
pass
get next row in table1
get next row in table2
}
}
This method is always going forward. Compares to nested loop, sort-merge only scan the row once. The down side is the cost of sorting. If both tables are large, merge sort join is faster generally. However, DBMS prefers nested loop because it requires less RAM and no start up overhead. Sort merge join is favourable especially if both table are already in the right sequence (cluster index sequence).
Hash Join
for each row in inner, compute a hash
for each row in outer, compute a hash
if hash value is in inner table hash bucket then pass
else fail
RAM is required to hold the temporary table of hash. Therefore enough RAM is a pre-requisite condition for this join. The join must be a equal join. Also the inner table join column must be unique and can be uniquely hashed. Hash join is useful when the outer table is very large which means the inner table will be searched many time.
When more than 2 tables are joined, each join is processed separatedly. There is no such thing as nested loop join within a nested loop. One reason is that the joins could be processed in parallel. Another reason is that nested loop within nested loop could keep the inner table out of cache.
Nested-loop Join
for each row in outer table {
for each row in the inner table {
if outer table join column matches inner table join column then add to temp table
else fail
}
}
The smaller size table should be the inner table so that it can be cached and reduce the number of page read. If cache is too small, the DBMS needs to look for index to avoid scanning every row in the inner table. Thus the table with good index should be the inner table. Having good indexes in both tables is not necessary useful as the outer table only goes through once. If there are multiple join columns, a compund index (c1, c2) is needed.
If there is a restrictive expression in the SQL, for example
select * from table1, table2
where table1.c1 = table2.c1
and table1.c2 = "abc"
the table with the restrictive expression should be the outer table. If table1 is the inner table, the restrictive expression will need to be applied every time through the inner loop.
To speed up the inner loop, IBM hybrid join sort data so outer table access will be in the order of the join column on the inner table. Scan the inner table and apply the join column, build a temporary table by appending the RID of the inner table to the outer table row Soprt the temporary table by RID order (in other words, the outer table is now in RID order of the inner table). Retrieve the inner row based on the RID in the temporary table and form the result set. Because the retrieval is in RID order, performance increase with prefetch. Hybrid join is essentailly a nested join that requires the outer table be in sequence.
Sort-merge Join
/*sort phase*/
sort table1
sort table2
get first row of table1
get first row of table2
/*merge phase*/
for (;; until no more rows in tables) {
if (join coumn in table1 < join column in table2) then get next row in table1
else if (join column in table1 > join column in table2) then get next row in table2
else if (join column in table1 = join column in table2) then {
pass
get next row in table1
get next row in table2
}
}
This method is always going forward. Compares to nested loop, sort-merge only scan the row once. The down side is the cost of sorting. If both tables are large, merge sort join is faster generally. However, DBMS prefers nested loop because it requires less RAM and no start up overhead. Sort merge join is favourable especially if both table are already in the right sequence (cluster index sequence).
Hash Join
for each row in inner, compute a hash
for each row in outer, compute a hash
if hash value is in inner table hash bucket then pass
else fail
RAM is required to hold the temporary table of hash. Therefore enough RAM is a pre-requisite condition for this join. The join must be a equal join. Also the inner table join column must be unique and can be uniquely hashed. Hash join is useful when the outer table is very large which means the inner table will be searched many time.
When more than 2 tables are joined, each join is processed separatedly. There is no such thing as nested loop join within a nested loop. One reason is that the joins could be processed in parallel. Another reason is that nested loop within nested loop could keep the inner table out of cache.
Sunday, May 20, 2012
Sorting Characters
Binary sort are sorts of the cods used to store the character. Binary string comparison are twice as fast as dictionary-order string comparison on average but the order of rows may not be desired.
Dictionary sorts require a conversion step before comparison as no code page stores characters in the desired order. Dictionary sorts are reasonably right (in desired order) with some compromise for speed.
Dictionary sorts with tie breakers return the desired order especially when the data contains accented characters and is case sensitive. There is up to 3 passes:
Primary sort orders by letters as mapped to some cononical form (i.e. A comes before B comes before C etc)
Secondary sort looks for accent distinction if primary sort values are equal
Tertiary sort looks for case distinction if primary and secondary sort values are equal (e.g. Smith is followed by smith followed by Soho)
Dictionary sorts without tie breaker are same as primary sorts.
Dictionary sorts require a conversion step before comparison as no code page stores characters in the desired order. Dictionary sorts are reasonably right (in desired order) with some compromise for speed.
Dictionary sorts with tie breakers return the desired order especially when the data contains accented characters and is case sensitive. There is up to 3 passes:
Primary sort orders by letters as mapped to some cononical form (i.e. A comes before B comes before C etc)
Secondary sort looks for accent distinction if primary sort values are equal
Tertiary sort looks for case distinction if primary and secondary sort values are equal (e.g. Smith is followed by smith followed by Soho)
Dictionary sorts without tie breaker are same as primary sorts.
Database SORT
Three factors affecting sort speed are
The number of rows selected
The number of columns in the ORDER clause
The defined length of the columns (not the actual length in case of VCHAR)
If the physical length of the record is long, DBMS may perform a tag sort in dynamic memory with only the necessary part of record.
Position of row when sorting column with NULL various from DBMS to DBMS.
The number of rows selected
The number of columns in the ORDER clause
The defined length of the columns (not the actual length in case of VCHAR)
If the physical length of the record is long, DBMS may perform a tag sort in dynamic memory with only the necessary part of record.
Position of row when sorting column with NULL various from DBMS to DBMS.
Saturday, May 19, 2012
Sargable
The ideal SQL search condition is in the form of . The left side is a column name and the right side is a easy to lookup value. IBM called these kinds of condition "sargable predicates". SARG is short for "Searchable ARGument". Microsoft and Sybase later redefined the term as "can be looked up via index".
Each component of a search condition has a point count. The higher the count, the faster the component (i.e. imvolves fewer row or easier to compare). For operator, the relative sequence is as follow:
For operands, the relative sequence is as follow:
Optimization technique include
(1) Put multiple conditions in order by their relative point count
(2) Constant propagation (transitive rule) is to replace column name by literal
e.g. WHERE column1 < column2 and column1 = 5
to WHERE 5 < column2 and column1 = 5
(3) Evaluate expression and replace by literal
(4) Avoid function on column. Use LOWER instead of UPPER if no choice as UPPER will lead to lost information in some languages.
(5) For AND conditions, put the least complex first so that DBMS can eliminate the rest when evaluated to be false. For OR, put the most likely condition first.
(6) (Distrubtive rule) A and (B or C) is faster than (A and B) OR (A and C) as there are fewer evaluation steps.
(7) Avoid NOT
e.g. NOT (column1 > 5)
to column1 <= 5
(8) Use IN over OR
e.g. column1 IN (5,6) instead of column1 = 5 or column1 = 6
(9) DBMS will use index if LIKE pattern starts with a real character but avoid index if it starts with a wild card.
Note that LIKE 'ABC' is not same as = 'ABC' as LIKE takes trailing blanks into account and not =.
(10) Transform UNION to OR
Each component of a search condition has a point count. The higher the count, the faster the component (i.e. imvolves fewer row or easier to compare). For operator, the relative sequence is as follow:
- =
- >, >=, <, <=
- LIKE
- <>
For operands, the relative sequence is as follow:
- literal alone
- column alone, parameter alone
- multi-operand expression
- exact numeric data type
- other numeric data type, temporal data type
- character data type, NULL
Optimization technique include
(1) Put multiple conditions in order by their relative point count
(2) Constant propagation (transitive rule) is to replace column name by literal
e.g. WHERE column1 < column2 and column1 = 5
to WHERE 5 < column2 and column1 = 5
(3) Evaluate expression and replace by literal
(4) Avoid function on column. Use LOWER instead of UPPER if no choice as UPPER will lead to lost information in some languages.
(5) For AND conditions, put the least complex first so that DBMS can eliminate the rest when evaluated to be false. For OR, put the most likely condition first.
(6) (Distrubtive rule) A and (B or C) is faster than (A and B) OR (A and C) as there are fewer evaluation steps.
(7) Avoid NOT
e.g. NOT (column1 > 5)
to column1 <= 5
(8) Use IN over OR
e.g. column1 IN (5,6) instead of column1 = 5 or column1 = 6
(9) DBMS will use index if LIKE pattern starts with a real character but avoid index if it starts with a wild card.
Note that LIKE 'ABC' is not same as = 'ABC' as LIKE takes trailing blanks into account and not =.
(10) Transform UNION to OR
Bottom Halves
The purpose of bottom halves is to perform any interrupt-related work not performed by the interrupt handler. The reason to limit the amount of work in interrupt handler because the handler runs with the current interrupt line disabled on all processors. Some interrupt handler that register with IRQF_DISABLED run with all interrupt lines disabled on all processors. Minimizing the time spent while intrrupt lines are disable improve system performance and responsiveness. Another reason is that intrrupt processing was triggered asynchornously. It is desirable to defer works as much as possible and return to the work being disrupted.
Defer work "later" means "no now". In other words, defer work to any point in the future when the system is less busy. The importance is to run these works when the interrupt lines are enabled.
The original infrastructure provided in Linux for implementing bottom halves is called BH. BH interface was simple. It provided a statically created list of 32 bottom halves for the entire system. The top half sleect the specific bottom half to run by setting a but in the 32-bit register. Each BH is globally synchornized which means no two BH could run at the same time, even on different processors. This simple way is inflexible and also a bottleneck.
Later on, task queue was introcduced as a replacement for BH. The kernel defines a family of queues. Each queue contained a linked list of functions to call. The functions in the queue are run at specific time depending on the queue they were in. Drivers registered their bottom halves in the appropriate queue. This is still inflexible to replace BH entirely and also not a light-weight enough mechanism to replace performance critical susbsystems such as metworking.
During Linux 2.3 kernel development, softirq and tasklets were introduced which can replace the BH interface. It is non-trivial to convert BH to softirq and tasklets becasue BHs are globally synchronized. Softirq are a list of statically defined bottom halves that can run simultaneously on any processor, even 2 of same types concurrently. Tasklets are dynamically created bottom halves based on softirq. However, 2 of same types cannot run together on the same processor. Therefore, tasklets are good trade-off between flexibility and performance.
Softirq are used when performance is critical. However, care must be taken as 2 of same softirqs can run concurrently. Also softirq must be statically registered at compile time. Ont the other hands, tasklet code can be registered dynamically.
In 2.5 kernel, work queue was intoduced. Work queu is a simple method of queuing work to be later performed in process context. Currently 3 methods exist for deferring work - softirq, tasklet and work queue. BH and taskqueu interface have been removed.
Defer work "later" means "no now". In other words, defer work to any point in the future when the system is less busy. The importance is to run these works when the interrupt lines are enabled.
The original infrastructure provided in Linux for implementing bottom halves is called BH. BH interface was simple. It provided a statically created list of 32 bottom halves for the entire system. The top half sleect the specific bottom half to run by setting a but in the 32-bit register. Each BH is globally synchornized which means no two BH could run at the same time, even on different processors. This simple way is inflexible and also a bottleneck.
Later on, task queue was introcduced as a replacement for BH. The kernel defines a family of queues. Each queue contained a linked list of functions to call. The functions in the queue are run at specific time depending on the queue they were in. Drivers registered their bottom halves in the appropriate queue. This is still inflexible to replace BH entirely and also not a light-weight enough mechanism to replace performance critical susbsystems such as metworking.
During Linux 2.3 kernel development, softirq and tasklets were introduced which can replace the BH interface. It is non-trivial to convert BH to softirq and tasklets becasue BHs are globally synchronized. Softirq are a list of statically defined bottom halves that can run simultaneously on any processor, even 2 of same types concurrently. Tasklets are dynamically created bottom halves based on softirq. However, 2 of same types cannot run together on the same processor. Therefore, tasklets are good trade-off between flexibility and performance.
Softirq are used when performance is critical. However, care must be taken as 2 of same softirqs can run concurrently. Also softirq must be statically registered at compile time. Ont the other hands, tasklet code can be registered dynamically.
In 2.5 kernel, work queue was intoduced. Work queu is a simple method of queuing work to be later performed in process context. Currently 3 methods exist for deferring work - softirq, tasklet and work queue. BH and taskqueu interface have been removed.
UNIX History
UNIX grew out of Multic, a failed multiuser OS in Bell Labs. In 1969, Bell Labs programmer sketched out a filesystem design that evolved to become UNIX. Thompson implement the system to an idle PDP-7. It was ported to PDP-11 in 1971. In 1973, UNIX was rewritten in C which pave the way of portability to various platforms. The first UNIX widely used outside Bell Lab was UNIX System, 6th edison (V6). Other companies ported UNIX to new machines. In 1977, Bell Labs released a combination of these variants into a single system called UNIX system III. In 1982, AT&T released System V (system IV was an internal development version).
One of the main contributor of development was University of Berkeley. Variants of UNIX from Berkeley was called BSD (Berkeley Software Distributions). 1BSD, released in 1977, was a collection of patches and additional software on the top of Bell Labs UNIX. 2BSD in 1978 continued this trend and include csh and vi. the first standalone BSD release was 3BSD in 1979 which added virtual memory (VM). A serious of 4BSD releases (4.0 to 4.3) added job control, demand paging and TCPIP. The final release is 4.4 in 1979 which featured a rewritten VM subsystrem. Development of BSD continues with the Darwin, FreeBSD, NetBSD and OpenBSD systems.
In early 1980 and 1990s, various companies introduced their own commercial versions of UNIX. These variants were based on either AT&T or BSD releases and features high-end features developed for their hardware architecture. Examples are Digital's Tru64, HP's HP-UX, IBM's AIX, Sequent's DYNIX/ptx, SGI's IRIX and SUN's Solaris and SunOS.
UNIX is relative simple with 100s of system calls comparing to other OS which have 1000s. Everything in UNIX is abstracted as a file (with some exception like socket) which simplifies manipulation with a handful of commands (open, close, read, write and lseek). UNIX also features quick process creation time and the unique fork() system call.
One of the main contributor of development was University of Berkeley. Variants of UNIX from Berkeley was called BSD (Berkeley Software Distributions). 1BSD, released in 1977, was a collection of patches and additional software on the top of Bell Labs UNIX. 2BSD in 1978 continued this trend and include csh and vi. the first standalone BSD release was 3BSD in 1979 which added virtual memory (VM). A serious of 4BSD releases (4.0 to 4.3) added job control, demand paging and TCPIP. The final release is 4.4 in 1979 which featured a rewritten VM subsystrem. Development of BSD continues with the Darwin, FreeBSD, NetBSD and OpenBSD systems.
In early 1980 and 1990s, various companies introduced their own commercial versions of UNIX. These variants were based on either AT&T or BSD releases and features high-end features developed for their hardware architecture. Examples are Digital's Tru64, HP's HP-UX, IBM's AIX, Sequent's DYNIX/ptx, SGI's IRIX and SUN's Solaris and SunOS.
UNIX is relative simple with 100s of system calls comparing to other OS which have 1000s. Everything in UNIX is abstracted as a file (with some exception like socket) which simplifies manipulation with a handful of commands (open, close, read, write and lseek). UNIX also features quick process creation time and the unique fork() system call.
Oracle tablespace space management
Prior to Oracle 8.1.5, there was only dictionary managed tablespaces. Space in tablespaces is managed using a dictionary. Space to be allocated causes row to be deleted from 1 data dictionary table and added to another, and de-allocated (reverse). SQL executed by background process to get space is called recursive SQL. An INSERT can trigger other recursive SQL. Recursive SQL is expensive if frequent. Updates to the data dictionary must also be serialized.
Oracle 8.1.5 intoduce locally managed tablespaces. Instead of using data dictionary, it uses a bitmap stored in the data file to manage extent. It speed up operation as manipulating bit has lower overhead.
Oracle 8.1.5 intoduce locally managed tablespaces. Instead of using data dictionary, it uses a bitmap stored in the data file to manage extent. It speed up operation as manipulating bit has lower overhead.
Oracle Storage Hierarchy
Segments are database objects (table, index, rollback, temproary, cluster segments etc) that consume storage. Segment contains one or more extents. An extent is a logically contiguous allocatiom of space in a file. Some segments start with 1 extent. Some segments start with 2 extents (e.g. rollback segment). Extent contains blocks, which are the smallest unit of allocation in Oracle. Block is also the unit of I/O in Oracle.
Database may have more than 1 block size. This is mainly used for the purpose of data transportation than anything else. Tablespaces have only 1 consistent block size although the segments inside can have mutluple size.
Block contains a header which indicates the block type (table, index etc), transaction information regarding active or past transactions on the block and address on the disk. After that is Table Directory whcih contains information about the tables that store rows in this block. After that is Row Directory describing the rows stored in the block. After that is an array containing pointers to the actual rows in the block. The rest of the block contains data and free space.
Database comprises tablespaces. Tablespace is a container that holds segements. Segments never cross tablespace boundary. Tablespaces comprises one or more files. Files can be raw or cooked file systems. An extent within a segment will be contained entirely within one file. A segment can have extents from different files.
In summary:
Database may have more than 1 block size. This is mainly used for the purpose of data transportation than anything else. Tablespaces have only 1 consistent block size although the segments inside can have mutluple size.
Block contains a header which indicates the block type (table, index etc), transaction information regarding active or past transactions on the block and address on the disk. After that is Table Directory whcih contains information about the tables that store rows in this block. After that is Row Directory describing the rows stored in the block. After that is an array containing pointers to the actual rows in the block. The rest of the block contains data and free space.
Database comprises tablespaces. Tablespace is a container that holds segements. Segments never cross tablespace boundary. Tablespaces comprises one or more files. Files can be raw or cooked file systems. An extent within a segment will be contained entirely within one file. A segment can have extents from different files.
In summary:
- Database contains tablespaces
- Tablespace contains segements
- Tablespace contains files
- Segment contains extents
- File contains extents
- Extent contains blocks
Sunday, April 29, 2012
x86-64
Intel create IA64 RISC and AMD create AMD64. AMD64 is backward compatible and allows running of 32-bit programs. Intel subsequently create its own compatible version called EM64T/IA-32e. The difference between the two is minimal.
AMD64 has 2 modes. Legacy mode makes the CPU behaves like a 32-bit CPU and all 64-bit enhancements are turned off. Long Mode is the native 64-bit mode. 32-bit programs can still run in compatible mode which can easily and quickly switch to 64-bit mode. MAC OS X uses this mechanism to run a 32-bit kernel and 64-bit applications.
AMD64 enhancements include:
(1) 32-bit regiesters extended to 64-bit (e.g. EAX -> RAX)
(2) 8 new 64-bit registers called R8 to R15
(3) A nonexecute (NX) bit is by default to mark pages as nonexecutable. (NX bit was already available in some x86 processors wjere PAE was enabled.)
(4) Since a full 64-bit virtual address space requires lots of memory to store page tables, a subset is used, namely 48-bit. The remaining 16 bit is a copy of the 47th bit in an address.
(5) Pages can be 4K, 2M or 1G in size.
(6) Segmentation has been crippled. GDT exists but most entries are ignored.
(7) Calling convention procedure has changed. IA32 pass paramenters on stack. x86-64 passess majority of parameters via registers.
AMD64 has 2 modes. Legacy mode makes the CPU behaves like a 32-bit CPU and all 64-bit enhancements are turned off. Long Mode is the native 64-bit mode. 32-bit programs can still run in compatible mode which can easily and quickly switch to 64-bit mode. MAC OS X uses this mechanism to run a 32-bit kernel and 64-bit applications.
AMD64 enhancements include:
(1) 32-bit regiesters extended to 64-bit (e.g. EAX -> RAX)
(2) 8 new 64-bit registers called R8 to R15
(3) A nonexecute (NX) bit is by default to mark pages as nonexecutable. (NX bit was already available in some x86 processors wjere PAE was enabled.)
(4) Since a full 64-bit virtual address space requires lots of memory to store page tables, a subset is used, namely 48-bit. The remaining 16 bit is a copy of the 47th bit in an address.
(5) Pages can be 4K, 2M or 1G in size.
(6) Segmentation has been crippled. GDT exists but most entries are ignored.
(7) Calling convention procedure has changed. IA32 pass paramenters on stack. x86-64 passess majority of parameters via registers.
TLB
Virtual memory translation is an expensive operation. Translation Lookaside Buffer is used to improve performance by keeping the recently translated addresses without the need to go through the MMU (Memory Management Unit) and access to the page table stored in RAM. TLB exploits both temporal and spatial locality.
When there is a context switch, the TLB must be flushed to avoid wrong translation. All CPU provide a mean to flush the entire TLB or specific entries. There is also mechanism to keep specific entry that do not changes across context switch as global entries. Flushing TLB is a performance penalty.
From MMU's perspective, operating system access its own page table like any user process. Therefore, no only TLB is flushed every context switch, it is also flushed at every entry/exit from kernel. Moreover, kernel needs to access user memory (e.g. bring in arguments of a call or return result to user space). For architecture such as x86/x86-64 that do not provide any hardware support to access context of another process, this situation results in TLB flushes at each kernel entry/exit and the need to access page tables each time a reference to another context is needed. This has huge performance impact.
To reduce such performance impact, OS implement a combined user/kernel address space scheme (divide one address space of 4G into kernel portion and user portion). Translated address entries in kernel area will be marked as accessible by by kernel code only. These entries will not be flushed.
Some other architecture such as SPARC v9 provides sipport for accessing a context from inside another context and to associate TLB entires to specific context. In such case, it is possible to separate kernel address space from user address space
When there is a context switch, the TLB must be flushed to avoid wrong translation. All CPU provide a mean to flush the entire TLB or specific entries. There is also mechanism to keep specific entry that do not changes across context switch as global entries. Flushing TLB is a performance penalty.
From MMU's perspective, operating system access its own page table like any user process. Therefore, no only TLB is flushed every context switch, it is also flushed at every entry/exit from kernel. Moreover, kernel needs to access user memory (e.g. bring in arguments of a call or return result to user space). For architecture such as x86/x86-64 that do not provide any hardware support to access context of another process, this situation results in TLB flushes at each kernel entry/exit and the need to access page tables each time a reference to another context is needed. This has huge performance impact.
To reduce such performance impact, OS implement a combined user/kernel address space scheme (divide one address space of 4G into kernel portion and user portion). Translated address entries in kernel area will be marked as accessible by by kernel code only. These entries will not be flushed.
Some other architecture such as SPARC v9 provides sipport for accessing a context from inside another context and to associate TLB entires to specific context. In such case, it is possible to separate kernel address space from user address space
Saturday, April 28, 2012
Heap and Stacks
Each program has at least 2 stacks - a user stack and a kernel stack. Kernel stack is used when user program switch into kernel mode (like making a system call). Kernel stack operates (growth direction, stack point usage, local variable usage) same way as user stack. There are slight difference for example, kernel stack is usually limit in size (e.g. 4K or 8K in x86) and thus kernel programming uses as few local variables as possible. Also stacks of all processes reside in the same kernel address space but in different virtual addresses.
The baisc unit of memory that kernel manages is a physical page frame which never smaller than 4K. Using the physical page allocator is inefficient for allocating space for small objects and buffers. In addition, these small objects usually have a short life time and will hit system performance. Modern OS will use a separated kernel-level memory allocator that communicate with the physical page allocator and is optimized for fast and contiuous allocation and de-allocation of small objects. This allocator is a consumer of the physical page allocator, in that it ask for pages from the physical page allocator and returns them. Each page is divided into a number of fixed-length chumk called slabs (from the slab allocator in SUN OS). Pages containing objects of same size are grouped together and called a cache.
The slab allocator also must keep track of the states of objects in the cache too so that space can be reclaimed. The reclamation is done by specific function.
Object allocator usually contains mechanism to detect overflow corruption called redzoning. An arbitrary value will be written at the end of chunk and is checked at release time. However, this will degrade performance and thus turned off by default.
The baisc unit of memory that kernel manages is a physical page frame which never smaller than 4K. Using the physical page allocator is inefficient for allocating space for small objects and buffers. In addition, these small objects usually have a short life time and will hit system performance. Modern OS will use a separated kernel-level memory allocator that communicate with the physical page allocator and is optimized for fast and contiuous allocation and de-allocation of small objects. This allocator is a consumer of the physical page allocator, in that it ask for pages from the physical page allocator and returns them. Each page is divided into a number of fixed-length chumk called slabs (from the slab allocator in SUN OS). Pages containing objects of same size are grouped together and called a cache.
The slab allocator also must keep track of the states of objects in the cache too so that space can be reclaimed. The reclamation is done by specific function.
Object allocator usually contains mechanism to detect overflow corruption called redzoning. An arbitrary value will be written at the end of chunk and is checked at release time. However, this will degrade performance and thus turned off by default.
Data Model of System
Data Model usually expressed using integer, long and pointer (ILP) size. For example, ILP32 means the integer, long and point are all of size 32-bits. LP62 means that long and pointer are 64-bit and integer are 32-bits (not specified). LLP means integer and long are 32-bits but long long and pointer are 64-bits. Note that char and short are 8 and 16-bits in all models.
Most compilers uses ILP32 for 32-bit code LP64 for 64-bit code. Most major UNIX distributions including Mac OS X use LP64 model except Windows uses LLP64. C program runs well on 32-bit system (ILP32) may not work in LP64.
Most compilers uses ILP32 for 32-bit code LP64 for 64-bit code. Most major UNIX distributions including Mac OS X use LP64 model except Windows uses LLP64. C program runs well on 32-bit system (ILP32) may not work in LP64.
Sunday, April 15, 2012
Connecting to Oracle
A user makes a request using a Oracle client software to connect to the database. For example, the client software can be SQLPlus.
sqlplus user/password@oracle10g.localdomain
The user supplies the username and password. The target database is represented by a TNS (Transport Network Substrate, a piece of software that make remote connection) service name. The TNS name is translated to a hostname and port number using mapping file like tnsname.ora file, or Oracle Internet Directory (OID, a LDAP).
A TNS listener on the database server will listen on the port for incoming connection request. For a dedicated server connection, the listener will fork() and exec() to create a new UNIX process to communicate with the user. The dedicated server process will execute SQL request, read data from files or find data in the SGA cache.
For shared server (previously called MTS, Multi-threaded Server) connection, Oracle uses a pool of shared processes to handle user requests. However, user does not talk directly to the shared process like a dedicated server connection. User talks to a process called dispatcher. The dispatcher put user request into a request queue in SGA. The first shared process that is not busy will pick up the request from the queue, execute the request and put the result into a respond queue in SGA. The dispatcher picks up the result and sends back to the user over network. When user connect to the database, the listener will choose a dispatcher from the dispatcher pools and redirect the client to the dispatcher by sending the port the dispatcher listens to. The client software then talks to the dispatcher directly.
sqlplus user/password@oracle10g.localdomain
The user supplies the username and password. The target database is represented by a TNS (Transport Network Substrate, a piece of software that make remote connection) service name. The TNS name is translated to a hostname and port number using mapping file like tnsname.ora file, or Oracle Internet Directory (OID, a LDAP).
A TNS listener on the database server will listen on the port for incoming connection request. For a dedicated server connection, the listener will fork() and exec() to create a new UNIX process to communicate with the user. The dedicated server process will execute SQL request, read data from files or find data in the SGA cache.
For shared server (previously called MTS, Multi-threaded Server) connection, Oracle uses a pool of shared processes to handle user requests. However, user does not talk directly to the shared process like a dedicated server connection. User talks to a process called dispatcher. The dispatcher put user request into a request queue in SGA. The first shared process that is not busy will pick up the request from the queue, execute the request and put the result into a respond queue in SGA. The dispatcher picks up the result and sends back to the user over network. When user connect to the database, the listener will choose a dispatcher from the dispatcher pools and redirect the client to the dispatcher by sending the port the dispatcher listens to. The client software then talks to the dispatcher directly.
Oracle Instance
When Oracle starts up in UNIX, the processes are of many names and they are referred as Oracle background processes. They are persistent and exist from the time when Oracle is started until it is shut down. They are started from a single Oracle executable but they picked up different personalities depending on the functions they perform. In Windows, these processes becomes individual thread in a single process.
Oracle SGA is a shared memory segment that are attached to by the background processes. In UNIX, the processes uses shmget() and shmat() calls. Under Windows, the threads use C call malloc() to allocate the memory and share it within one address space.
Oracle SGA is a shared memory segment that are attached to by the background processes. In UNIX, the processes uses shmget() and shmat() calls. Under Windows, the threads use C call malloc() to allocate the memory and share it within one address space.
Sunday, April 1, 2012
ACL
User's SID is the account number that Windowss assigns to the user during login. The access token that holds the SID also contrains structures that identify the groups the user belongs to and what privileges that the user has. Each group entry also as a SID. This SID points to structures that describe the group's right.
The privileges section of the access token begins with a count of the number of privileges that the user has. This section contains an array of privilege entries. Each privilege entry contains a Locally Unique ID (LUID), essentially a pointer to the entry object, and an attribute mask. The mask tells what rights the user has to the object. Group SID entries are essentially the same - a privilige count and an array of privilege entries.
Object rights flow down to the lowest possible node unless overridden by another SID. For example, if a user has read and write privileges to \temp, those rights is applicable to all sub-folders. This applies to container such as a word document which may contains other files.
The privileges section of the access token begins with a count of the number of privileges that the user has. This section contains an array of privilege entries. Each privilege entry contains a Locally Unique ID (LUID), essentially a pointer to the entry object, and an attribute mask. The mask tells what rights the user has to the object. Group SID entries are essentially the same - a privilige count and an array of privilege entries.
Object rights flow down to the lowest possible node unless overridden by another SID. For example, if a user has read and write privileges to \temp, those rights is applicable to all sub-folders. This applies to container such as a word document which may contains other files.
NT Security
The protection mechanism has been around since Windows NT and thus it is called NT security. It uses a lock and key concept. The lock is a kind of access control and the keys are rights. There are mulitple level of permission associated with most resources to provide fine grain control.
A user has personal rights that are assigned by administrator. A user can belong to group which all share the same rights. The user's access is limited to the combination of group and individual rights that the administrator assigned.
Rights can be assigned by administrator. Likewise, developer can write code that sets Windows securotu for particular objects, calls and portions of an application. Changes by the administraor or developer affect the rights required to perform sepecific tasks using resources such as a file. the right to write to a file is separate from the right to read from the file.
User level access depends on Security Identifier (SID). When a user logs in, Windows assigns an access token and place the user's SID (stored in DC) in it. The access token contains DACL (Discretionary Access Control List) and SACL (System Access Control List). The combination of SID and ACL in the access token allows the user access to certain resources. As the access token is session based, user need to logoff and re-login to gain addition rights assigned by administrator during the session.
The lock is called securty descriptor on resources. The security descriptor tells what rights the user needs to access the resource. If the ACL meets or exceeds the rights in the security decsriptor, the lock opens.
A user has personal rights that are assigned by administrator. A user can belong to group which all share the same rights. The user's access is limited to the combination of group and individual rights that the administrator assigned.
Rights can be assigned by administrator. Likewise, developer can write code that sets Windows securotu for particular objects, calls and portions of an application. Changes by the administraor or developer affect the rights required to perform sepecific tasks using resources such as a file. the right to write to a file is separate from the right to read from the file.
User level access depends on Security Identifier (SID). When a user logs in, Windows assigns an access token and place the user's SID (stored in DC) in it. The access token contains DACL (Discretionary Access Control List) and SACL (System Access Control List). The combination of SID and ACL in the access token allows the user access to certain resources. As the access token is session based, user need to logoff and re-login to gain addition rights assigned by administrator during the session.
The lock is called securty descriptor on resources. The security descriptor tells what rights the user needs to access the resource. If the ACL meets or exceeds the rights in the security decsriptor, the lock opens.
Saturday, March 10, 2012
RAID
0 = Disk striping for performance
1 = Disk mirroring with slight performance improvement
2 = Hamming encoding. Encode data on disk with error correcting code. Data on a single disk is correctable when error occurs
3 = Virtual disk blocks. Data is striped across multiple disks with a single disk for parity information. Data can be reconstructed based on parity.
4 = Dedicated parity disk. Parity of all data is stored on a separate disk. Data can be reconstructed based on parity
5 = Striped parity disk. Parity is striped across all disks.
1 = Disk mirroring with slight performance improvement
2 = Hamming encoding. Encode data on disk with error correcting code. Data on a single disk is correctable when error occurs
3 = Virtual disk blocks. Data is striped across multiple disks with a single disk for parity information. Data can be reconstructed based on parity.
4 = Dedicated parity disk. Parity of all data is stored on a separate disk. Data can be reconstructed based on parity
5 = Striped parity disk. Parity is striped across all disks.
Measurement of Fault Tolerance
Coverage is the conditional probability that the system will recover automatically within the required time interval given that an error has occurred. Coverage can be computed from the probability associated with detection and recovery.
Coverage = Prob(successful error detection) x Prob(successful error recovery)
Obtaining the probabilities used to compute coverage is difficult and requires extensive stability testing and fault insertion testings.
MTTF (Mean Time to Fail) is the average time from the start of operation until the time when the first failure happens.
MTTR (Mean Time to Repair) is the time required to restore a failing component to operation. This includes travel time to the site to perform the repair, system intialization and recovery actions to be executed.
MTBF is the time from the start of operation until the component is repaired. In other words, MTBF = MTTF + MTTR.
MTBF is used when the component is repairable and MTTF is when the component is not repairable.
Reliability is the probability that a system can perform according to specification for a specific period of time. In other words, there is no failures within the specific time.
Reliability = e ^ -(1/MTTF)
Failure rate = 1/MTTF
FIT (Failure in Time) = number of failures in 1x10^9 hours (1 billion hours)
Availability is the percentage of time that a system is able to perform its designed function.
Availability = MTTF/(MTTF+MTTR) = MTTF/MTBF.
Note that availability is a percentage and reliability is a probability.
Dependability is a measure of a system's trustworthiness to be relied upon to perform the desired function. The attribute of dependability is reliability, availability, safety and security. Safety refers to the non-occurence of catastrophic failures, whose consequence is greater than the benefits.
The capacity of a system represents a tradeoff between the system's cost and its dependability under load. When the load is higher than the capacity, the system can have a total or partial failure or in degraded performance.
Coverage = Prob(successful error detection) x Prob(successful error recovery)
Obtaining the probabilities used to compute coverage is difficult and requires extensive stability testing and fault insertion testings.
MTTF (Mean Time to Fail) is the average time from the start of operation until the time when the first failure happens.
MTTR (Mean Time to Repair) is the time required to restore a failing component to operation. This includes travel time to the site to perform the repair, system intialization and recovery actions to be executed.
MTBF is the time from the start of operation until the component is repaired. In other words, MTBF = MTTF + MTTR.
MTBF is used when the component is repairable and MTTF is when the component is not repairable.
Reliability is the probability that a system can perform according to specification for a specific period of time. In other words, there is no failures within the specific time.
Reliability = e ^ -(1/MTTF)
Failure rate = 1/MTTF
FIT (Failure in Time) = number of failures in 1x10^9 hours (1 billion hours)
Availability is the percentage of time that a system is able to perform its designed function.
Availability = MTTF/(MTTF+MTTR) = MTTF/MTBF.
Note that availability is a percentage and reliability is a probability.
Dependability is a measure of a system's trustworthiness to be relied upon to perform the desired function. The attribute of dependability is reliability, availability, safety and security. Safety refers to the non-occurence of catastrophic failures, whose consequence is greater than the benefits.
The capacity of a system represents a tradeoff between the system's cost and its dependability under load. When the load is higher than the capacity, the system can have a total or partial failure or in degraded performance.
Saturday, March 3, 2012
Failure Perception
A fail-silent failure is one in which the failing unit either present incorrect result or no result at all. Fail-slient failure is the easist type of failure to be tolerated because observerd failure is that the failing unit has stopped working. The reason for failure is unknown but the failing element is identified and the failure is contained and not spread to other part of the system.
A crash-failure is one where the unit stops after the first fail-silent failure.
A fail-stop failure is a crash-failure that is visible to the rest of the system.
Consistent failures are seen as the same kind of failure by all observers in the system. Inconsistent failures are ones that appear different to different observers. These are also called two-faced failures, malicious failure of Byzantine failures These are most diffiuclt to isolate or correct. An example of consistent failure is a system that report 1 to all questions. An example of inconsistent failure is a system that report 1 to user 1 and 2 to the rest of the users, or route all traffic to a certain network address and none to another.
It is a common design principle for fault tolerance is to assume only one failure at any one time. However, many failures have occured when this assumption is invalid.
Fail-silent failures requires n+1 to tolerate. Consistent failure requires 2n+1 to tolerate. Malicious failure requires 3n+1 to tolerate. The computer system in Space Shuttle is designed to tolerate 2 simultaneous failures which must be consistent but need not be fail-silent requires to have 5 computer systems.
A crash-failure is one where the unit stops after the first fail-silent failure.
A fail-stop failure is a crash-failure that is visible to the rest of the system.
Consistent failures are seen as the same kind of failure by all observers in the system. Inconsistent failures are ones that appear different to different observers. These are also called two-faced failures, malicious failure of Byzantine failures These are most diffiuclt to isolate or correct. An example of consistent failure is a system that report 1 to all questions. An example of inconsistent failure is a system that report 1 to user 1 and 2 to the rest of the users, or route all traffic to a certain network address and none to another.
It is a common design principle for fault tolerance is to assume only one failure at any one time. However, many failures have occured when this assumption is invalid.
Fail-silent failures requires n+1 to tolerate. Consistent failure requires 2n+1 to tolerate. Malicious failure requires 3n+1 to tolerate. The computer system in Space Shuttle is designed to tolerate 2 simultaneous failures which must be consistent but need not be fail-silent requires to have 5 computer systems.
Fault, Error and Failure
Failure = delivered service is no longer complies with specification (agreed description of the system expected function and service)
Error = part of the system state that is liable to lead to subsequent failure
Fault = adjudged or hypothesized cause of an error
Failures are detected by observers or users of system. Failures are dependent upon the definition of agreed-upon correct operation of the system. If there is no specification of what a system should do, there could not be a failure. The same failure can be resulted from different errors.
Error is the incorrect system behanvior from which a faulure may occur. Error can be categorized into 2 types. Error that manifest as value error might be incorrect discrete values or incorrect system state. Timing error can include total non-performance or race condition. Errors can be detected before they become failures. Error is a manifestion of fault. Error is the way that we can look into the system to discover if fault is present.
Fault is the defect that in system that can cause error. Neither the software or observer are aware of the presence of fault until an error occurs. Latent fault is fault that is lying dormant and has not cause any error. A latent fault becomes an active fault when the circumstances arise.
Error = part of the system state that is liable to lead to subsequent failure
Fault = adjudged or hypothesized cause of an error
Failures are detected by observers or users of system. Failures are dependent upon the definition of agreed-upon correct operation of the system. If there is no specification of what a system should do, there could not be a failure. The same failure can be resulted from different errors.
Error is the incorrect system behanvior from which a faulure may occur. Error can be categorized into 2 types. Error that manifest as value error might be incorrect discrete values or incorrect system state. Timing error can include total non-performance or race condition. Errors can be detected before they become failures. Error is a manifestion of fault. Error is the way that we can look into the system to discover if fault is present.
Fault is the defect that in system that can cause error. Neither the software or observer are aware of the presence of fault until an error occurs. Latent fault is fault that is lying dormant and has not cause any error. A latent fault becomes an active fault when the circumstances arise.
Monday, January 23, 2012
C pointer
Pointers are created by the indirect operator (*). The syntax is
type *variable
For example, int *p is a pointer to an integer
If A is the memory location of pointer p and B is the memory location of the integer p points to,
*p = 123 => store the value in location B
P = 123 => store the value in location A. In other words, change p to point to another memory location
The & operator returns the address of the variable. For example,
p = &n makes p points to n
Tuesday, January 10, 2012
Radius
Dial up networks usually have a local modem pool to provide cheap access. However, the ISP does not work to keep a copy of user database in each point of presence. The user is the supplicant, the POP is the authenticator and the central database is the authorizer. The protocol used between the POP and the database is called RADIUS (Remote Access Dial-In User Service)
PPP
Point-to-point protocol converts the unstructured modem link into a packet-based environment suitable for transporting IP packets. PPP has 2 authentication methods. PAP sends the user name and password in clear. CHAP uses a challenge response mechanism.
WEP keys
WEP keys have the following characteristics:
(1) fixed length - usually 40 bits or 104 bits. (for the latter case, vendor touted that the solution is 128bits because the protocol will add a 24bit Initialization Vector to form the full key)
(2) Static - no changes except by reconfiguration
(3) Shared - AP and station have the same key(s)
(4) Symmetric - same key used to encrypt and decrypt data
There are 2 approaches to use WEP keys - Default Keys and Key Mapping Keys.
Default Keys
All station and the APs uses a single set of keys. The standard specifies that there should be 4 default jeys for each devices. Only one default key is needed for security to work. The additional keys are used to ease key change progressively in an environment with many devices. When 2 default keys are defined, all transmissions from AP are encrypted using a single key selected, called the active key. However, received frames (from station) can be decrypted using either of the 2 keys when appropriate. The AP can decrypt data from station that have changed to the new key or those that are yet to be changed.
2 set of default keys further allow the AP and station uses different key set for transmitting. This is called directional key use. The key used by the transmitter is indicated in the KeyID bits in the encrypted frame so the receiver know which key (0 to 4) is used.
Key Mapping Keys
Each station has its own jey value. Not all vendor support this because of the complexity for key configuration and maintenance. Use of different key for each station complicates broadcast. In this case, all multicast traffic is encrypted using a default key that is shared by all stations. Only unicast traffic are sent using the key mapping key.
The challenge for AP is that it has to keep track of muliple key. Another difficuty is to ensure all AP has the same copy of jey table.
AP can operated with default jeys and key mappn gjeys simultaneously. When the AP sends or receives a frame, it looks in the key table to see whether there is a matching MAC entry for the station. It there is, it uses the key mapping key. Otherwise, it use the default key (that's why it called).
(1) fixed length - usually 40 bits or 104 bits. (for the latter case, vendor touted that the solution is 128bits because the protocol will add a 24bit Initialization Vector to form the full key)
(2) Static - no changes except by reconfiguration
(3) Shared - AP and station have the same key(s)
(4) Symmetric - same key used to encrypt and decrypt data
There are 2 approaches to use WEP keys - Default Keys and Key Mapping Keys.
Default Keys
All station and the APs uses a single set of keys. The standard specifies that there should be 4 default jeys for each devices. Only one default key is needed for security to work. The additional keys are used to ease key change progressively in an environment with many devices. When 2 default keys are defined, all transmissions from AP are encrypted using a single key selected, called the active key. However, received frames (from station) can be decrypted using either of the 2 keys when appropriate. The AP can decrypt data from station that have changed to the new key or those that are yet to be changed.
2 set of default keys further allow the AP and station uses different key set for transmitting. This is called directional key use. The key used by the transmitter is indicated in the KeyID bits in the encrypted frame so the receiver know which key (0 to 4) is used.
Key Mapping Keys
Each station has its own jey value. Not all vendor support this because of the complexity for key configuration and maintenance. Use of different key for each station complicates broadcast. In this case, all multicast traffic is encrypted using a default key that is shared by all stations. Only unicast traffic are sent using the key mapping key.
The challenge for AP is that it has to keep track of muliple key. Another difficuty is to ensure all AP has the same copy of jey table.
AP can operated with default jeys and key mappn gjeys simultaneously. When the AP sends or receives a frame, it looks in the key table to see whether there is a matching MAC entry for the station. It there is, it uses the key mapping key. Otherwise, it use the default key (that's why it called).
802.11 Radio
There are 2 frequency bands for sending IEEE 802.11 datat - 2.4GHz and 5 GHz. The part of the radio that turns biuts into analog is called the modem (or baseband section). The second part is very high frequency electronics to drive the antenna, usually called the radio frequenct (RF) section.
Improvements in modem techniques have resulted in successive version of IEEE 802.11 offerings. The initial 1997 standard provided 2Mbps in the 2.4GHz band. 802.11a increased to 54Mbps in 5GHz due to better modem and medium. 802.11b increased the speed to 11Mbps in 2.4GHz band. 802.11g further increased speeds again in 2.4GHz by more sophisticated modem technique.
Improvements in modem techniques have resulted in successive version of IEEE 802.11 offerings. The initial 1997 standard provided 2Mbps in the 2.4GHz band. 802.11a increased to 54Mbps in 5GHz due to better modem and medium. 802.11b increased the speed to 11Mbps in 2.4GHz band. 802.11g further increased speeds again in 2.4GHz by more sophisticated modem technique.
Stream Cipher and Block Cipher
Stream cipher takes a sequence of bytes and produce a stream of ciphertext. A block cipher handles a single block of data at a time. Stream cipher is like a sausage machine while block cipher is like a bakery. A distinction between the 2 cphers is that the internal state of a stream cipher is continuously updated as data is processed. by contract, the state of a block cipher is reset for each block prior to processing.
Monday, January 9, 2012
Infrastructure mode operation
The AP (access point) advertises its presence by transmitting short wireless message at regular interval, typicall 10 times per second. The short message is called beacon which allow the STA (station) to discover the identify of the AP.
A STA will start to search fo an AP after start up initialization phase. There are a number of radio frequencies (called channels) that could be used and so the STA must tune in each channel in turn to discover the beacons. This process is called scanning. Alternatively, the STA may send a probe request message for in-ranged AP to response. Probing is also used for roaming.
When the STA is ready to connect to the AP, it first sends an authenticatae request message to the AP. The AP will reply with an authenticate response indicating acceptance. This give permission for the STA to connect.
The STA then issues an association request to connect. AP will response with an association response to complete the connection.
Durig roaming scenario, the STA choose to move from one AP to another. The STA will first send the old AP a disassocation message and reconnect to the new AP using a reassociation message. The reassociation message has information about the old AP to allow the new AP to talk to the old AP to confirm roam has taken place.
A STA will start to search fo an AP after start up initialization phase. There are a number of radio frequencies (called channels) that could be used and so the STA must tune in each channel in turn to discover the beacons. This process is called scanning. Alternatively, the STA may send a probe request message for in-ranged AP to response. Probing is also used for roaming.
When the STA is ready to connect to the AP, it first sends an authenticatae request message to the AP. The AP will reply with an authenticate response indicating acceptance. This give permission for the STA to connect.
The STA then issues an association request to connect. AP will response with an association response to complete the connection.
Durig roaming scenario, the STA choose to move from one AP to another. The STA will first send the old AP a disassocation message and reconnect to the new AP using a reassociation message. The reassociation message has information about the old AP to allow the new AP to talk to the old AP to confirm roam has taken place.
Wireless LAN modes
Access point is similar to hub or switch in a wired LAN. When IEEE 802.11 systems work through an access point, it is called to be operating in infrastructure mode because access point is coordinating the WiFi LAN from a fixed point and often providing a connection to wired network.
Wireless devices can also transmit directly to any other. It is intended for group of people wanted to share information anywhere anytime on ad hoc basis. This set up is thus called ad-hoc mode.
Wireless devices can also transmit directly to any other. It is intended for group of people wanted to share information anywhere anytime on ad hoc basis. This set up is thus called ad-hoc mode.
Monday, January 2, 2012
Symmetric Multithreading (SMT) or Hyperthreading
A processor with hyperthreading looks like multiple CPU to the OS. They are not true multicore machine as SMT CPU shares most of the on-chip resources and contends for each other. Each logical processor has its own register set and instruction pipeline only. It shares the on-chip cache, MMU, TLB and all other execution units. Therefore, the SMT CPU cannot process instruction twice as fast as a multicore machine.
SMT is an opportunity for parallelism unique to CISC processor. Because the pipeline is deeper in CISC, the effect of stall is more significant. Providing multiple pipelines help to reduce stalling.
Hyperthreading gets its name because it is more effective at accelerating multithread software. Thread, unlike process, shares memory and page table entries which make them optimal to distribute across logical CPUs. Because there is only one MMU on a SMT CPU, threads experience more of a performance boost than processes do.
Database Driver Architecture
(1) Bridge
app <-> bridge <-> DB driver <-> database
Use to bridge between an existing database connectivity standard and a new one. For example, a JDBC/ODBC bridge driver presents Java call interface to the program but translate the call to ODBC standard. These are also called JDBC type 1 drivers which are often not desirable because they often constrain by the existing driver (ODBC) and thus cannot implement the new standard (JDBC) fully. These driver also possess security risk and performance constrain under high load.
(2) Database Client-Based
app <-> DB driver <-> client software <-> database
The driver communicate through the database client software (e.g. Oracle NETx, Sybase OpenClient). Many ODBC drivers and ADO.NET data providers uses this architecture. In JDBC, this is classified as type 2 driver and is generally not desirable. The disadvantages are:
(a) Management overhead to maintain client software in all client platforms
(b) Functional and quality constrain of the client software to the driver
(c) For ADO.NET, the need to transit to unmanaged code in client software
(3) Database Wire-Protocol
app <-> DB driver <-> database
JDBC drivers using this architecture are called type 4 driver. The driver generates the required wire protocol calls (e.g. DB2 uses DRDA) to communicate directly with the database. Performance is optimal. For Java and .NET, because the driver does not need to call the client software, the DB calls are executed in fully managed code.
(4) Independent Protocol
app <-> DB driver client <-> DB Driver Server <-> database
These drivers translate the standard based API calls into a database independent protocol, which is then translated to the database wire protocol by a server. In JDBC, this is known as type 3. This architecture can provide advanced feature (e.g. security or encryption), connect to varies set of database source (e.g. VSAM) and central management and monitoring capability.->->->->->->->->->->->
app <-> bridge <-> DB driver <-> database
Use to bridge between an existing database connectivity standard and a new one. For example, a JDBC/ODBC bridge driver presents Java call interface to the program but translate the call to ODBC standard. These are also called JDBC type 1 drivers which are often not desirable because they often constrain by the existing driver (ODBC) and thus cannot implement the new standard (JDBC) fully. These driver also possess security risk and performance constrain under high load.
(2) Database Client-Based
app <-> DB driver <-> client software <-> database
The driver communicate through the database client software (e.g. Oracle NETx, Sybase OpenClient). Many ODBC drivers and ADO.NET data providers uses this architecture. In JDBC, this is classified as type 2 driver and is generally not desirable. The disadvantages are:
(a) Management overhead to maintain client software in all client platforms
(b) Functional and quality constrain of the client software to the driver
(c) For ADO.NET, the need to transit to unmanaged code in client software
(3) Database Wire-Protocol
app <-> DB driver <-> database
JDBC drivers using this architecture are called type 4 driver. The driver generates the required wire protocol calls (e.g. DB2 uses DRDA) to communicate directly with the database. Performance is optimal. For Java and .NET, because the driver does not need to call the client software, the DB calls are executed in fully managed code.
(4) Independent Protocol
app <-> DB driver client <-> DB Driver Server <-> database
These drivers translate the standard based API calls into a database independent protocol, which is then translated to the database wire protocol by a server. In JDBC, this is known as type 3. This architecture can provide advanced feature (e.g. security or encryption), connect to varies set of database source (e.g. VSAM) and central management and monitoring capability.->->->->->->->->->->->
Subscribe to:
Posts (Atom)