Sunday, February 23, 2014

Socket Address Structure

The socket API specifies a generic data type called sockaddr for used by API calls.

struct sockaddr {
    sa_family_t sa_family;  // address family e.g. AF_INET or AF_INET6
    char sa_data[14];    // address info - A blob of bits to handle diff OS and network
};

Note that this sockaddr structure is not large enough to handle a IPV6 address which is 16 bytes long.

The actual data structure used in socket call are sockaddr_in (for IPV4) and sockaddr_in6 (for IPV6).  The structure is casted with (struct sockaddr *) when used.

struct in_addr { uint32_t s_addr; }; // 4-byte IPV4 address
struct sockadr_in {
    sa_family_t sin_family;  //address family IPV4
    in_port_t sin_port;    //16-bit port
    struct in_addr sin_addr;
    char sin_zero[8];    //padding
};

Socket

It is an general abstraction through which programs send and receive data.  Different types of socket correspond to different underlying protocol suites and different stacks of protocol within the suite. 

The main types of TCPIP socket are stream socket and datagram socket.  A stream socket represents one end of the TCP connection.  It consists of an IP addressm a port number and the end to end protocol (TCP).

A socket is created by a socket call which returns a handle to the socket:

int socket(int domain, int type, int protocol)

"Domain" refers to the communication domain, recall that socket API is a generic interface for a large number of communication domains (e.g. AF_INET for IPV4 and AF_INET6 for IPV6).

HSocket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)

"Type" determines the semantics of the data transmission with the socket.  For example, if the transmission is reliable or message boundary is preserved etc.  Valid values are SOCK_STREAM or SOCK_DGRAM.

"Protocol" refers to the end to end protocol to be used.  Valid values are IPPROTO_TCP or IPPROTO_UDP.  A value of 0 means to use the default protocol for the "Type".

The close() call close the socket.

Special Network Addresses

(1) Loopback address is assigned to a loopback interface which is a virtual device that echoes transmitted packets back to the sender.  For IPV4, it is 127.0.0.1 and for IPV6, it is ::1.

(2) Private addresses
This group of address is for used by locations which connect to internet via NAT.  These addresses cannot be reached from the global internet.  For IPV4, they start with 10 or 192.168 or 172.16-31.  There is no correspondence for IPV6.

(3) Link Local or Autoconfiguration addresses
These addresses can only be used to communicate with hosts on the same network.  Routers will not forward these addresses.  For IPV4, it is 169.254.  For IPV6, it is start with FE80, FE90, FEA0 and FEB0.

(4) Multicast addresses
For IPV4, it is 224. to 239.  For IPV6, it start with FF.

JVM Architecture

It has a stack based architecture without registers.  This allows JVM to run the same code regardless of underlining hardware.  Real hardware machines differs in number and size of registers and how they relate to memory.  The only register like structure is the program counter.  Result of method call is returned on stack.

Mutex

It is referred as Mutants when in the kernel.  Mutexes are global objects for syncronizing execution.  Mutex names are usually hard-coded because the name must be consistent if it is used by 2 processes or threads.  Only one thread can own a mutex at any one time.  Thread gains access to mutex using WaitForSingleObject.  ReleaseMutex call release the mutex after use.  CreateMutex function creates a mutex.  The other thread uses OpenMutex to obtain a handle to the mutex before using it. 

First and Second Chance Exceptions

Debuggers are given 2 chances to handle an exception of the program being debugged.  When an exception occurs, the execution of the program will stop and the debugger is given a first chance to handle the exception.  The debugger can handle it or choose to pass it on to the program.  In the latter, the program registered exception handler will be given control.

If the program does not handle the exception, the debugger is given a second chance to handle the exception.  If there is no debugger attached, the program will usually crash at this point.  The debugger must resolve the exception to enable the program to continue to run.

Break Points

Software breakpoints are implemented by overwriting the instruction at the break location with 0xCC which is a INT 3 instruction.  This allows control passed to the debugger when execution reach that point.  The debugger will show the instruction before patching but if one inspect the memory, the value has changed to INT 3.

Software breakpoints may not work when a code is self modifying (e.g. malware).  In this case, the patch may be overwritten and the breakpoint will not be effective

Hardware breakpoints are assisted by hardware.  For each instruction being executed, hardware will compare the address with the special register to determine if a breakpoint is reached.  One major drawback is that there are only 4 debug register in x86.  DR0 to DR3 store the addresses of breakpoints.  DR7 is the control register which indicates if any of the DR0-3 is active and if the address represent a read, write or execute breakpoint.  Read/write breakpoint allow the program to break out when an address is referenced.

To protect the DR from modified by malware, set the General Detect flag in DR7.  It will break prior to any mov instruction that modify the DR0-3.

Conditional breakpoint breaks when certain predefined condition is reached.  For example, break when the second parameter of a function is of a particular value.  This facilitates debugging to stop frequently executed point only on condition of interest.  Conditional breakpoints are implemented as software breakpoints

Stack Layout

ESP points to the top of the stack.  EBP is usually not change during the call to provide a reference point to access local variable using offset.

(1) arguments was pushed onto the stack first
(2) Next is the return address is pushed automatically because of the CALL instruction
(3) The old EBP is pushed next
(4) Lastly the local variable is allocated

pusha and pushad push a set of 16- and 32-bit registers onto the stack - EAX, EBX, ECX, EDX, EBP, ESP, ESI and EDI.

ESP always points to the top element in the stack.

NOP (Intel)

Actually a XCHG EAX,EAX instruction. Opcode is 0x90.  NOP is commonly seen in buffer overflow hack when the exact code address can only be approximate.  So lacing a series of NOP allow the code jump to complete

Windows Thread

Threads share the address space of the process.  Each thread has its own stack and registers.  When OS switches thread, the CPU context is stored in a structure called thread context.

CreateThread fucntion create a new thread.  The function call specify a start address of the program to be executed.  If the start address is LoadLibrary call, the DLLMain will be executed after the DLL is loaded

Windows Network API

Berkeley Compatible Sockets function similar to UNIx.  It is implemented in the Winsock libraries, primarily in ws2_32.dll.  Common socket functions:

  • socket - create a socket
  • bind - attach a socket to a port
  • listen - start a socket to listen to a port
  • accept - open a connection to a remote socket and accept the connection
  • connect - open a connection to a remote socket which is waiting for a connection
  • recv - receive data
  • send - send data

Prior to use these function, the WSAStartup function must be call to load the network library and allocate resources.

WinINet is a higher level API which implement HTTP and FTP protocols.  It is implemented in Wininet.dll.
  • InternetOpen - initialize a connect to Internet
  • InternetOpen Url - open a connection to HTTP or FTP site
  • InternetReadFile - retrieve a file from the site

reg File

File with reg suffix is a readable text file.  When user double-click the reg file, the content will be automatically merge with the registry.  For example, the following add a program to run automatically when Windows starts:

Windows REgistry Editor Version x.xx

[HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Run]
"abcvalue"="C:\abc.exe"

Alternate Data Stream (ADS)

It is a feature allows additional data to be added to existing file in NTFS, essentially adding one file to another.  The extra data does not show up in DIR command listing.  It is not visible when the file is browsed or edited.  Program can access the stream via the name file.txt:Stream:$DATA

Long Pointer (LP)

Strings are usually named as lp (e.g lpStr1) as they really point to memory location where the strings start.  LP is 32-bit.  P (pointer) is same as LP in 32-bit systems.  They only make a difference in 16-bit system.

Windows Handles

Like pointers, handle refer to object or memory location.  However, handles cannot be used in arithmatic operations and they do not always represent memory addresses.  They can only be used in function calls to refer to the same objects

Oracle Network Architectu​re

Application layer are implemented by OCI (Oracle Call Interface) in client and OPI (Oracle Program Interface) in server side.

Presentation layer protocol is call Two-Task Common (TTC) and is responsible for character set and data type conversion between client and server.

Session layer and network layer are implemented by Net8/Net9 and SQL*Net before that.  The Net8 protocol has 2 components - Net Foundation and Protocol Support.  Protocol Support further breaks down into 2 layers - Routing/Naming/Auth and TNS (Transport Network Substrate). 

The role of TNS is to select the Oracle Protocol Adapter which wrap around one of the support transport protocol – TCPIP, name pipes and SDP (Socket Direct Protocol) for Infiband network.

TNS Data Packet Structure

Byte 0-7 is the header

Byte 8-9 is the data flag.  0x0040 indicates a disconnect packet.  0x0000 indicates normal data.

Byte 10 determines what is in the data packet

Type
Description
0x01
Protocol negotiation.  The client sent to the server the protocol versions acceptable (e.g. 6, 5, 4, 3, 2, 1, 0).  The server will response with the common version and other information such as character set, version string and server flags
0x02
Data type representation exchange
0x03
Two-Task Interface (TTI) function call
0x02 Open
0x03 Query
0x04 Execute
0x05 Fetch
0x08 Close
0x09 Disconnect/logoff
0x0C Autocommit ON
0x0D Autocommit OFF
0x0E Commit
0x0F Rollback
0x14 Cancel
0x2B Describe
0x30 Start up
0x31 Shutdown
0x3B Version (will be called before authentication)
0x43 K2 Transactions
0x47 Query
0x4A OSQL7
0x51 Logon (present password)
0x52 Logon (present userid)
0x5C OKOD
0x5E Query
0x60 LOB Operations
0x62 ODNY
0x67 Transaction end
0x68 Transaction begin
0x69 OCCA
0x6D Start up
0x73 Logon (present password – send AUTH_PASSWORD)
0x76 Logon (present username – request AUTH_SESSKEY)
0x77 Describe
0x7F OOTCM
0x8B OKPPC
0x08
Indicate OK – send from server
0x11
Extended TTI functions
0x6B Switch or detach session
0x78 Close
0x87 OSCID
0x9A OKEYVAL
0x20
Used when calling external procedures with service registration
0x44
ditto

TNS Headers

Every TNS packet has an 8-byte header. 

Byte 0-1 (WORD) is the packet length inclusive of the header in big-endian format.

Byte 2-3 (WORD) is the packet checksum if applicable.  Default is not used and fill with value 0x0000.

Byte 4 contains the packet type

Type
Description
1
Connect
2
Accept
3
Ack
4
Refuse
5
Redirect
6
Data
7
NULL
9
Abort
11
Resend
12
Marker
13
Attention
14
Control

When client connects to the Oracle server (Listener), it sends a Type-1 Connect packet specifying the service name it wishes to access

The listener will  send a Type-2 Accept packet if the service is known.  The listener may also send a Type-5 Redirect packet to the client to redirect it to another port.  The client upon receiving the Redirect packet will send a Connect packet to the alternate port.

Once the connection is accepted, the client will proceed to authenticate.  All authentication packet is Type-6 Data.

If the service is unknown, the Listener will send a Type-4 Refuse packet.

All queries and results are sent in Type-6 Data packets.  Once in awhile, there will be Type-12 (0x0C) Marker packet serving the purpose to interrupt.  For example, the server may send a Marker packet to tell the client to stop sending data.

Byte 5 is the header flag which is generally unused.  10g client may set to 0x04.

Byte 6-7 (WORD) contains the header checksum which is also not used by default and set to 0x0000

The Refuse packet is also sent when the login fails.  Byte 54 shows if this is due to invalid ID (0x02) or password (0x03)

CLASSPATH and Class Loading

It is used by JVM to locate the java class files. Locations are separated by colons (UNIX) or semicolons (Windows) in CLASSPATH.  Class files are group into JAR which is a zip file.

When JVM starts, it loads the bootstrapping classes into memory.  The classes in the CLASSPATH is loaded when they are referenced.  Variable Xbootclasspth allows specific bootstrap classes to be loaded before the Java core classes.

Classes are accessed on the last declared-last load basis.  For classes with same name in different JARs, only the last declared one is considered valid.  Exception to this is classloader.  JVM supports mulitple classloaders with same name.  By default, a JVM has 2 classloaders - one for the bootstrap classes and one for those in the CLASSPATH.  Customed classloaders can be used to load classes with same name.  The application must be designed from onset for this and implemented with Java API.  Java EE application servers are example of this which provide an isolated environment for each application.

Each JAR has no awareness of external dependency to other AR.  Unless the JVM is restarted, JAR cannot be upgraded.

java.lang.IncompatibleClassChangeError is thrown when there is a class conflict.  This type of errors can be common for enterprise application with multiple integrated parts.

OSGi

It is originally devised for embedded device market. Hot-pluggability is one of the feature envisaged.  It refers to a systems that can alter its capbility without reboot.  In the embedded device arena, hot pluggability is not common in 1999.  As these devices has limited processing power and memory, another feature envisaged is auto-discovery and reusing software.  OSGi defined itself as the dynamic module system for Java.

APPC Application Suite

APPC Application Suite is a set of applications that demonstrates the distributed processing capabilities of APPN networks, and can be helpful in configuration verification and problem determination. APPC Application Suite can be used to provide support for operations such as file transfers, which are frequently
performed across a network.

APPC Application Suite contains the following applications:

  1. ACOPY (APPC COPY)
  2. AFTP (APPC File Transfer Protocol)
  3. ANAME (APPC Name Server)
  4. APING (APPC Ping)
  5. AREXEC (APPC Remote EXECution)
  6. ATELL (APPC TELL)


These applications can be accessed from a server or from a AIX or Windows client.

CS/AIX

Enterprise Extender in CS/AIX is implemented simply as a new communications link. To connect two SNA applications over IP, you define an Enterprise Extender link in the same way as for any other link type such as SDLC or Ethernet. This new DLC allows you to take advantage of APPN/HPR functions in the IP environment.

In HPR the begin- and end-node are the only nodes responsible for this checking, flow control and, if necessary, reassembly and segmentation. This means that CPU and memory needs of intermediate nodes are less, but that CPU and memory needs of the end-nodes is more. In a 1-hop (2 node) HPR environment, you get none of the advantages but you pay all the costs of HPR, which means throughput is likely to be less than non-HPR.

Automatic Network Routing (ANR) is a low-level routing mechanism that minimizes cycles and storage requirements for routing packets through intermediate nodes. ANR represents significant increases in routing speed over basic APPN. ANR provides point-to-point transport between any two endpoints in the network. Intermediate nodes are not aware of SNA sessions or RTP connections passing through the node. ANR is designed for high-performance switching, since no intermediate node storage for routing tables is required and no precommitted buffers are necessary.

To define a EE link
- define a node, including cp info and nodeid (XID)
- define a port in the node - this add dlc and port seciton in /etc/sna/sna_node.cfg
- define a link station in the port

VTAM EE

The implementation of EE in z/OS involves data transfer between the VTAM and the TCP/IP address spaces. A special connection type called IUTSAMEH is used to move data from VTAM to TCP/IP and vice versa.

Define EE link in VTAM need to add info in 4 places
- TCPIP profile
- ATCSTRxx
- VTAM switch major mode
- VTAM XCA major node

TCPIP profile
IPCONFIG SOURCEVIPA - makes TCPIP to add source VIPA to all outbound datagram.  Source VIPA is required for EE support

A series of DEVICE, LINK and START statements is required to define a multipath channel point-to-point device (MPCPTP) for VTAM-to-TCP/IP communication.

DEVICE must use IUTSAMEH which is a reserverd TRLE (Transport Resource List Entry) for the EE connection on host

SNI

Two subarea networks can be interconnected through SNA network  interconnection (SNI). SNI is an SNA-defined architecture that enables independent subarea networks to be interconnected through a gateway.

APPC and APPN

A reasonable comparison between APPC and APPN is the difference between a person using the telephone and the services the telephone company offers.

APPC
For example, when you want to call someone, you look up the telephone number and then enter it. Both parties identify themselves and the exchange of information begins. When the conversation is finished, both parties say good bye and hang up. This protocol, although informal, is generally accepted and makes it much easier to communicate.

APPC provides the same functions and rules, only between application programs instead of people. An application program tells APPC with whom it needs a conversation. APPC starts a conversation between the programs so they can exchange data. When all the data has been exchanged, APPC provides a way for
the programs to end the conversation.

APPN
APPN provides networking functions similar to those provided by the telephone companies. After dialing a telephone number, the telephone network routes the call through trunks, switches, branches, and so on. To make the connection, the network takes into consideration what it knows about available routes and current problems. This happens without the caller understanding the details of the network. A person is able to talk on the telephone to another person no matter where they are or no matter how the call was routed.

APPN provides these functions for APPC applications and their data. It computes routes for APPC communication through the network, dynamically calculating which route is best. Like the telephone company, APPN's routing is done transparently. APPC applications cannot tell whether the communications partner in the APPN network is located in the same computer, one office away, or in another country. Similarly, if someone moves within the same city and takes their phone number, the phone network handles the change with no other user impact.

VTAM definition in zOS

The definition is defined in 2 dataset in the VTAM JCL:
- VTAMLST defines the SNA network.  It contains node definition, route and hardware such as CTC connectors or OSA cards.
  - ATCSTR00 member defines the VTAM start up parameters:
      - SSCPID and SSCPNAME defines a unique ID for this VTAM.  SSCPNAME is used in the CDRM definition
      - NETID defines the network ID
  - ATCCON00 member specifies the resources to be activated.  Resources include PATH, Major and minor nodes, tables and dynamic reconfiguration files.
- VTAMLIB contains binary load module of compiled VTAM macro such as LOGMODE tables and CoS tables.

Logical Units

End users and applications access SNA networks through logical units (LUs). Because SNA is a connection-oriented protocol, prior to transferring data the
respective logical units must be connected in a session.

In SNA hierarchical networks, logical units require assistance from system services control points (SSCPs), which exist in type 5 nodes, to activate a session with another logical unit.

The control point assists in establishing the session between the two LUs and does
not take part in the data transfer between the two LUs.

LU types identify sets of SNA functions that support end-user communication. LU-LU sessions can exist only between logical units of the same LU type. For example, an LU type 2 can communicate only with another LU type 2; it cannot communicate with an LU
type 3.

LU Type 1 - An example of the use of LU type 1 is an application program running under IMS and communicating with a 3270 printer.

LU Type 2 - application or devices that use SNA 3270 data stream. An example of the use of LU type 2 is an application program running under IMS and communicating with an IBM 3270 display station

LU TYpe 3 - This is for application programs and printers using the SNA 3270 data stream. An example of the use of LU type 3 is an application program running under CICS/VS and sending data to a 3270 printer.

LY Type 6.2 - This is for transaction programs communicating in a client/server data processing environment. The type 6.2 LU supports multiple concurrent sessions. LU 6.2 can be used for communication between two type 5 nodes, a type 5 node and a type 2.1 node, or two type 2.1 nodes.

LU-LU session initiation generally begins when the session manager in an LU (secondary LU) submits a session-initiation request to the appropriate control point. Using the specified set of session parameters (defined in a mode table), the control point builds a BIND image. The control point transmits the BIND image in a control initiate request
(CINIT request) to the primary logical unit (typically the application LU). The primary logical unit (PLU) is the LU responsible for activating the session. The PLU activates the session by sending a bind session request (BIND request, also called a session-activation request) to the secondary logical unit (SLU). The SLU then returns a BIND response to the PLU. A response unit flows between the session partners and the session is started.  Note that the "server" (PLU) is the one initiate the session with the "client" (SLB).  Contrasting with TCPIP, the client always initiate the session and the server listens for such requests.

Physical Units

Physical units are components that manage and monitor resources such as attached links and adjacent link stations associated with a node. SSCPs indirectly manage these resources through physical units.

Physical units (PUs) exist in subarea and type 2.0 nodes. (In type 2.1 peripheral nodes, the control point performs the functions of a PU.)

A physical unit provides the following functions:
- Receives and acts upon requests from the system services control point (SSCP), such as activating and deactivating links to adjacent nodes
- Manages links and link stations, while accounting for the unique aspects of different link types

SSCP and Cross Domain Manager

Every z/OS system with VTAM that implements SNA is referred to as a domain, which is an area of control. Within a subarea network, a domain is that portion of the network managed by the SSCP in a T5 subarea node.

A subarea network that contains only one T5 node is a single-domain subarea network. When there are multiple T5 nodes in the network, each T5 node may control a portion of the network resources. A subarea network that contains more than one T5 node is a multiple domain subarea network.

The SSCP can also set up and take down sessions with other domains through the cross-domain resource manager (CDRM). Before applications in one domain can have cross-domain sessions with resources in  another domain, a CDRM session must be established between the SSCPs of the two domains.

For a session between SSCPs to exist, VTAM must know about all cross-domain resource managers with which it can communicate. You must define to VTAM its own cross-domain resource manager and all other cross-domain resource managers in the network.

The cross-domain resource manager that represents the SSCP in your domain is called the host cross-domain resource manager. The cross-domain resource managers that represent the SSCPs in other domains are called external cross-domain resource managers.

SNA NETID

The SNA network is assigned a network identifier referred to as a NETID. All the resources in the same subarea network carry the same NETID name. In the same NETID subarea network, you can have more than one z/OS system that implements the SNA protocol.

SNA node

When a T2.0 or T2.1 is connected directly to T4 node, the T4 node performs a boundary function. When  interconnecting nodes in different subarea networks, the T4 node performs a gateway function.

In a subarea network, every T5 and T4 node is assigned a subarea number. The subarea number has to be unique in the SNA network.

Every T5 node in a subarea network contains a control point, which in general manages the network resources. Management activities include resource activation, deactivation, and status monitoring. APPN node also implement control point.

Design of SNA and TCPIP

The goals of the protocols (TCP/IP and SNA) were different. TCP/IP was developed to provide  collaboration between computers and data sharing. SNA was developed for central control.

In the 1980s, TCP/IP was used extensively by scientists who wanted to share research papers and ideas stored on their campus computers with academic staff around the world. IBM designed SNA for business data processing applications.  The hierarchical topology of SNA matches the organizational structure of
businesses and enterprises. The most common example is a bank where the tellers in the branch require access to the bank's central database. The same paradigm is also true for the insurance and retail industry. Also, businesses that have regional offices connected to a corporate site can implement the hierarchical network model.

Subarea SNA and APPN

Hierarchical systems are organized in the shape of pyramid, with each row of objects linked directly to objects beneath it. SNA subarea, besides implementing the model of a hierarchical system, is centrally managed from the top of the pyramid.

Network resources in SNA are managed (that is, known and operated) from a central point of control that is aware of all the activity in the network, whether a resource is operational, and the connectivity status of the resource. The resources can send reports on their status to the control point. Based on networking and
organizational requirements, a hierarchical network can be divided into sub-networks, where every sub-network has a control point with its controlled resources.

We can use an airport control tower as an example to explain the centrally-managed approach. All airplanes in the control tower sphere of control (a sub-network) are controlled and report to the control tower. The control tower also "operates" the resources (airplanes and runways) by granting landing and takeoff authorization.

In a peer network, every resource is self-contained and controls its own resources. Most of the time a networking resource in a peer network is not aware of its network peers, and learns about their existence when it starts to communicate with the peer resources. We can use a Windows workstation as an example. We define only the local network of the workstation.

A national real estate franchise is good illustration of a peer network. Every local real estate office maintains the listing in its area and is not aware of the information stored in other offices. If a customer who plans to relocate asks for service from the local office, the office will call (connect to) the office in the city his  customer plans to move to and get the listing from the remote location. If the customer had not made this request, the local office would not be aware of the remote office, and would learn about the remote office only when there was a need to access data that was stored remotely.

High Performance Routing

Neither subarea networking nor APPN resolved a weakness related to the loss of an SNA session when a resource along the session route fails. Besides improving routing performance, HPR provides non-disruptive re-routing of the SNA session to an available alternate route. HPR also enables the integration of SNA into
IP-based backbones.

HPR supports 2 sub-functions.  RTP (rapid transport protocol) supports sophisticated function such as non-disruptive path autofailover, end-to-end error recovery, packet resquencing, flow control.  RTP is used for high speed network

ANR (Automatic Network Routing) is a source-routing protocol. It is designed to have low CPU and storage overhead.  It may be used for low speed link.

Link Station

At each end of the phone line or LAN, the computer needs to keep track of a few items. What is the number that should be assigned to the next I-frame transmitted, what was the number of the last I-frame received, and has it been acknowledged yet? There are time limits to detect lost messages, and a counter to managed the window of unacknowledged frames. The 802.2 standard calls this a "connection component." In SNA, it is a Link Station.

The Link Station controls the flow of data between two network nodes. Successive I-frames may belong to the same session, or they may belong to different programs or terminals. When an I-frame is acknowledged on the LAN or SDLC line, this does not mean that the data in it is correct or has been processed. The Link Station dumps incoming data into buffers and queues them up for later processing.

A SNA link station is the hardware or software within a node that enables the node to attach to, and provide control over, a link connection.  It exchanges information and control signals with its partner link station in the adjacent node.  Link stations use data link control protocols to transmit data over a link connection.  A link connection is the physical medium over which data is transmitted.  Examples of transmission media include telephone wires, microwave beams, fiber-optic cables, and satellite circuits.  Multiple links between the same two nodes are referred to as parallel links.

A transmission group consists of 1 or more links between 2 nodes.  Multilink TG is a TG consists of 2 or more parallel links.  One advantage of multilink TG is that it preserve a session when link fails which parallel TG does not.

Sunday, February 16, 2014

Relativity

Special relativity explains how to interpret motion between different inertial frames of references, which means 2 places that are moving at the same constant speed relative to each other in a straight line.  No acceleration and no curves.

Special relativity said the law of physics does not change in each of the frames.  The speed of light is constant to each observer in the 2 frames regardless of their motion relative to the light source.  For example, if A is traveling in one direction and A bounces a light to a mirror on the ceiling and subsequently catches the light by a detector on the floor.  If B is traveling in the opposite direction of A, he will see the light travel diagonally to the ceiling and diagonally again to the floor.  The path traveling by the light will be longer as seen by B.  As the speed of light is the same for A and B, the time must be longer to B (time dilation).

Special relativity only works in a special case - inertial frame of reference.  General relativity is to explain when the frames are not inertial.  Einstein called it the principle of equivalence which state that an accelerating system is physically equivalent to a system inside a gravitation field.  If you are on a plane and during take off, you fill a force pressing you into the chair.  This shows the effect of acceleration is "equivalent" to gravity.

For example, if one drop a ball in an accelerating spaceship, he will observe the ball falls as if it is in a field of gravitational acceleration.  The gravity pull is equivalent to the degree of acceleration.  If light enters a hole on an accelerating space ship and hit the opposite wall, the light will appear as bent as the ship moves on.  This is equivalent if a gravitational force is applied to the path of light (though the bend would have been very small).

Saturday, February 15, 2014

Field

Field is used to describe something that does not have a particular position but exist at every point in space.  For example, temperature in a room.  Field theory is a set of rules describing how a field will behave.

Pierre-Simon Laplace explained gravity use the concept of gravitational field.  Presence of object changed the gravitation field values around it (vibration)and created a potential difference.  Objects are attracted as they move across areas of potential difference.

Photons (light) are vibration of the electromagnetic field.

Sunday, February 2, 2014

Solaris Caches

Three caches are system generic - old buffer cache, page cache and DNLC.  The others are file system specific.

Old buffer cache is used by the block device interface to cache disk blocks. This cache is of fixed size.  With the introduction of page cache, Solaris merge the buffer cache into page cache to create the unified buffer cache,.  Solaris uses buffer cache to store UFS inode and metadata only.

Page cache was introduced during a virtual memory rewrite for SunOS 4 in 1985 and added to SVR4.  It caches virtual memory pages as well as memory mapped file pages.  It is more efficient than buffer cache which requires translating file offset to disk offset for each look up.  The cache size is dynamic and releases when application needs more pages.  Page size is used by many file systems e.g. UFS and NFS.  ZFS does not use page cache.  Dirty pages are written by fsflush daemon which periodically scan the whole cache.  Tthe page scanner (pageout daemon) also write dirty page to disk to free up pages.  Both are kernel threads and not processes but appear as PID 2 and 3 respectively.

There are 2 kernel drivers for page cache.  segmap maps file to process memory.  segvn caches file system read and write pages.

DNLC (Directory Name Lookup Cache) maps directory entries to vnode.  It was developed in early 1980.  This improve performance for open call.

VFS

The VFS (Virtual File System) interface provide a common interface for all types of file system.  VFS has two interfaces - VFS which include file system level operations (e.g. mount), and vnode which offers call for file operations (e.g. open, close, read, write etc.)

VFS can be used as a common location to measure performance of various types of file system.

UNIX I/O

RAW I/O is issued directly to the disk driver without going through the file system.  One drawback is the data cannot be managed by file system tools.  Double buffering is avoided.  Applications that maintain its own cache (e.g. database) uses this technique.

Direct I/O allows an application to use a file system but bypass the cache.  It is almost as direct as Raw I/O except the file system is still required to map the file offset to disk offset and I/O must be resized to match the physical on-disk layout.  Depending on file system, not only read and write buffering is disabled, pre-fetch may also be disabled too.  Direct I/O is often used by back up program that avoid polluting cache with data that only used once (not reused).

Non-blocking I/O is enabled by opening file using O_NONBLOCK or O_NDELAY flags.  System will returns EAGAIN error instead of blocking to tell the application to try again.

Memory mapped file is created using mmap() call.  The file is mapped to memory and access by offset instead of READ and WRITE system calls.  This can avoid double buffering.  Use of mmap() cannot solve performance problem due to I/O latency.  The saving is syscall overhead is insignificant comparing to long I/O time.

UNIX WRITE

A write back cache improves performance because a WRITE call will return once the data is transfered from user buffers to kernel buffers.  The kernel will scheduled write to disk at a later time asynchronously. The down side of write back cache is data corruption when system crashes,  To balance between performance and reliability, system often offers write back cache and synchronous write option to bypass this mechanism to application.

Write I/O is synchronous when a file is open with O_SYNC option.  Some file system has mount option to force all WRITE to be synchronous.  Instead of write each I/O synchronously, an application can opt to commit prior WRITES using fsync() call to improve performance.

UMA and NUMA

UMA (Uniformed Memory Access) - memories are connected to CPUs via a shared system bus.  For example, Intel Northbridge.

NUMA - memory is attached to CPU via memory bus.  CPUs are connected via an interconnect.  CPU access to its attached memory is faster than access to memory attached to another CPU.

UNIX Pagiing

Paging was first introduced by the Atlas Computer in 1962.  Paging with virtual memory was introduced in UNIX via BSD.  There are 2 types of paging:

File System Paging is caused by applications reading and writing of pages in memory-mapped files or file systems that uses page cache.  This is considered good paging.

Anonymous Paging involves data in process - heap and stack.  It is called anonymous because there is no named file that back up these page ion the file system.  Anonymous pages are moved in and out of swap devices.  Anonymous paging is considered bad paging as it hurt performance.  When application accesses pages that has been paged out, it will be blocked for I/O.  Read is always synchronously handled by kernel.  Page out on the other hand is handled asynchronously by kernel.  Performance is best without anonymous paging.

Demand paging refers to mapping of virtual pages to physical pages in memory.  Pages will be allocated first and mapping will be deferred to when the page is accessed by application.  If the page can be satisfied by a page already in memory, it is called a minor fault.  Otherwise, the needed page will be read into memory and is called a major fault.  A page can be in one of these four stages at any one time:

(1) unallocated
(2) allocated but unmapped
(3) allocated and mapped to RAM
(4) allocated and mapped to swap devices

Transition from (2) to (3) is a page fault (minor or major).  Resident Set Size = (3).  Virtual Memory Size = (2) + (3) + (4)