Sunday, October 26, 2014

AIX maxclient

This is a tunable that determine the amount of RAM to cache non-computational client (JFS) pages. The value should be less than maxperm as Client page is a subset of all cached permanent pages.  When numclient (reported in vmstat -v) reaches maxclient, lrud will page replace the client only pages.

AIX VMM Page Classification

Working storage pages are pages that are not preserved across a system reboot.  Examples are process data, heap, stack and shared memory etc.  They are also called anonymous pages.  VMM will release the working storage when a process ends.

Permanent pages are pages that have a backup store (file) and so it persists across system reboot. There are 2 subtypes:


  • Non-client pages are those in a JFS file system.  They are also called persistent pages.
  • Client pages are pages from other file systems like NFS and JFS2.


Computational pages are working storage pages.  Program text is also computational pages.  Pages from other type of files are designated as non-computational pages.    Non-computational pages will retain in the file cache for performance purpose in case it is required at a later time.

A file can start off as non-computational but when a page fault is triggered for the file due to instruction fetch (i.e. system is trying to fetch an instruction from the file), the file will then be marked as computational.  All the pages in the RAM belong to the file will be marked as computational.

The tunables minperm and maxperm control the amount of permanent storage pages cached in the system.  When the number of permanent pages (numperm) is above maxperm, lrud will steal only from the permanent pages store (file cache) to replenish the free page list.  When numperm is lower than minperm, lrud will steal from both computational and non-computational pages.

When numperm is between minperm and maxperm, AIX will consult the lru_file_repage tunable.  When this tunable is active (set to 1), AIX maintains a lru repage table to track pages (computational and non-computational) that have been paged out but then got paged in shortly again.  This indicates these pages are needed frequently and should not have paged out.  If the computational pages has a higher repaging rate than non-computational, AIX will steal from non-computational pages.  Otherwise, AIX will steal from both computational and non-computational.  When the tunable is set to 0, AIX will always steal from non-computational which generally the better choice.

Saturday, October 11, 2014

Synchronous I/O in Linux

int fsync(int fd) system call flushes the data to the disk synchronously.  The call flush both data and metadata such as creation timestamp and other attributes in the inode.

int fdatasync(int fd) system call flush the data and only metadata (e.g. file size) that is required to access the file in the future.

The sync() flushes data for all fd.

Alternately, passing O_SYNC to OPEN call causes every READ and WRITE to be synchronized IO.  It is like forcing a fsync() call after each IO but Linux implement this more efficiently.

Specifying O_SYNC will increase the CPU for WRITE and elapsed time of the process as IO wait time is included. Using fysnc() and fdatasync() is comparatively less overhead as the program can make these call at specify logic point and not after every IO.

POSIX also defined O_DSYNC and O_RSYNC flags for OPEN. These 2 flags is defined as O_SYNC in Linux.  By definition in POSIX, O_DSYNC is same as fdatasync().

O_RSYNC means READ and WRITE IO are synchronized.  READ is already always synchronized (it will not return unless some data is available for the caller).  O_RSYNC also stipuated that the metadata (file access time) associate with the READ call must also be updated to disk before READ returns.  Although this behaviour does not match O_SYNC, LINUX defined O_RSYNC as O_SYNC

Linux Delayed Writeback

Note that write() returns after the kernel copies the data to the kernel buffer.  The data may not have externalized to the disk.  Dirty buffers will be batched and write out at a latter time (writeback).

Delayed writeback does not affect subsequent read() which will return the updatd data from the dirty buffers instead of from the disk copy.

If the system crashed, data in dirty buffers will be lost.  Another problem with delayed write is that it does not enforce I/O sequence.  For database, this can cause data integrity problem.

Also if I/O error (e.g. disk failure) was encountered later when the data is written out to the disk, it may not be possible to report the error back to the originiating process which could have been terminated.  In fact, the dirty buffer may contain updated data from multiple processes.

To minimize the risk, kernel write out the dirty buffers at regular interval specified via /proc/sys/vm/dirty_expire_centisecs

Page writebacks are carried out by a set of kernel threads - flusher.  Multiple flushers work on different devices.  This fixes a deficiency of older Linux (pdflush and bdflush) which work on one devices at a time and spent much time waiting causing build up of dirty pages in a high volume environment.

Linux write()

Unlike read, write will write out the whole buffer before returning unless it encountered error.  Therefore, there is no need to code a write loop.

EDADF, EFAILT, EINVAL and EIO has meaning similar to read().

EFBIG - the file size limit is exceeded
ENOSPC - the filesystem runs out of space
EPIPE - the reading end of the pipe the fd assicates with has closed.  Normally a SIGPIPE will send to the process attempts to write in such situation.  If the signal is not handled,the process would be terminated.  If the signal is handled, the write() system call will return this errno.

Linux read()

A read call can return several ways:

If len is returned, the read is successful as expected
If greater than 0 and less then len is returned, the read may be interrupted or EOF reached.  Reissue the read.
If 0 is returned, EOF is reached
If -1 is returned and errno is EINTR, the process is interrupted by a signal.  Reissue the read
If -1 is retruned and errno is EAGAIN, this is a non=blocking read and currently no data is available.  Reissue the call at a latter time
If -1 is returned with errno set to value other than EINTR and EAGAIN, an error has happened and reissuing the call will probably not successful.

EBADF - bad fd passed
EFAULT - the buffer to hold the data is in the process address space
EINVAL - the fd does not allow reading
EIO - IO error has occurred

Note that read can return with partial result (when len is less than the size passed).  Therefore, read should be done in a loop to reissue the call under some conditions above.

Linux creat()

Opening file with O_WRONLY, O_CREAT and O_TRUNC is vert common and this system call does exactly that.

Sunday, October 5, 2014

Linux file ownership

A owner of a new file is the effective userid of the process

The owner group is more complicated to determine.  For system V behaviour (default for Linux), the owner group is the effective gid of the process.  For BSD behavior, the owner group is the gid of the parent directory.  BSD behaviour can be set by a mount-time option.

Linux by default will use the BSD behaviour if the set group ID bit is set (setgid).

Fortunately, the group owner is usually not important.

Flags in Linux OPEN call

O_APPEND
Before each write, the file posiiton will be updated to point to the end of the file.  This happens even a second program writes to the same file since the last write by the first program.  The write by the first program will started from the new end of file position.

O_ASYNC
A signal (SIGIO) will be generated when the file becomes readable or writable.  This flag is used for pipes, sockets or terminal and not for regular file.

O_CLOEXEC (close on exec)
Upon executing a new process, the file will automatically closed.  This saved the call to fcntl and eliminates possible race condition

O_CREAT
Create the file if it does not exist.  If the file exist, this flag has no action unless O_EXCL is used

O_EXCL
When used with O_CREAT, the open call will fail if the file already exists.  This is to prevent race condition. This flag has no meaning if not used with O_CREAT

O_DIRECT
Open the file for direct I/O (i,e, no system buffering)

O_DIRECTORY
If the file is not a directory, open will fail.  This flag is used internally by the opendir() call

O_LARGEFILE
Use a 64-bit offset for the file.  This is to break the 32-bit (2G) size barrier.

O_NOATIME+
The file last access time is not to be updated upon opening.  This is mainly used for performance purpose for backup and indexing programs that need to open and inspect a large number of file constantly.

O_NOCTTY
Rarely used.  The file refer to a terminal device.  This flag indicate the terminal will not become the controlling terminal even if there is no terminal currently.

O_NOFOLLOW
If the file is a symbolic link, fail the open call.  For example, opening /a/b/c, the path entries (a and b) can contain symbolic link.  Only the last file name (c)  must not be a symbolic link

C_NONBLOCK
Non-blocking open call.  This flag is only used for FIFO.

O_SYNC
Synchronous I/O - WRITE only return after data has externalized to disk.  As READ is always synchronous, this flag has no effect for READ only file.

O_TRUNC
If the file exist, a regular file and opened for WRITE, the file length will be reset to zero.  This flag is ignored for FIFO or terminal.  Used for other file type is undefined.  Use of this flag for file opened for O_RDONLY  is also undefined

File Table

Linux kernel maintain the list of opened file in a table called file table.  The index to the table is the file descriptor.  The file table entries contain information about the file such as pointer to the inode image in memory and file meta data such as access mode and file position.

Child process received a copy of the file table verbatim from its parent.  Subsequent changes to the file table will not affect the parent state.

errno

The return code (commonly -1) in C informed the operation has failed.  Specific failure condition is notified via the variable errno

extern int errno

perror prints the textual description indicated by errno.

void perror (const char *str);


The string will be printed with following colon preceding the error description message.

Another function provided by C is strerror.  This function return a pointer to the description message.  The function is not thread safe as the message buffer returned could be modified by subsequent strerror or perror call.  strerror_r is a thread safe version which accept an externally allocated buffer as argument in which to place the error description string.

errno must be set to 0 before it is used (i.e. before making call).

Process Reparenting

Process tree is rooted at the init process.  Every process in the hierarchy has a parent except the init process.    when a parent ends before its child, kernel reparent the child to the init process.

The init process routinely wait on the child processes to eliminate zombie.

UNIX Domain Socket

It is a form of socket used for communication within the local system.  It uses a special files defined in a filesystem.

Named Pipes

It is also called FIFO and is a special file for interprocess communication.  Regular pipe uses to transfer data from one application to another exists purely in memory and not on disk.  Named pipes are like regular pipes but are accessed via a file.

Linux files and directories

File or regular file is a stream of bytes.  There is no structure imposed on file like other operating systems.  The length of a file is bound by the C type used to store the file position (or offset).  The length of the file can be changed by truncating the file.  Truncation can cut the file short or make it longer.  For the latter case, zero bytes are filled in from the original EOF to the new end point.

File can be opened multiple times by different processes or event the same process.  Linux does not regulate access by different processes and is up to the processes to coordinate themselves.

A file is referenced by the inode (information node).  Inode is identified by a number which is unique within a file system only.  An inode contains meta information about file such as timestamps, owner, type, length and location on the disk.  Filename is not stored in inode.  Inode is stored physically on disk.

Directories provide the names used by user to access file.  It maps the file name to inode number.  A name-inode pair is called a link.  Link is implemented physically on disk as a table.  Conceptually, directory is also a file.

When a file is referenced, kernel walks the full pathname to find the inode of the next level of directory entry (dentry).  The kernel cache the dentry in the dentry cache to ease future lookup.

Although directory is a regular file, Linux kernel does not allow it to be manipulated by the usual set of file operations (e.g. open, read etc.).  Directory is processed by its own set of system calls.

CICS External Authentication Module (EAM)


To enable EAM

  • Set the EAMLoad attribute to yes in the /var/cics_regions/region_name/RD/RD.stanza file. 
  • Set the EAMModule attribute to the compiled output of the EAM Module Name along with the patch in the /var/cics_regions/region_name/RD/RD.stanza file. 


To enable the LDAP connection through EAM, set the following values in the CICS® region's environment file:

  • CICS_LDAP_HOST is used to specify the name of the host where the LDAP server is configured and running, for example:  CICS_LDAP_HOST=myldap.aetna.com  CICS_LDAP_PORT is used to specify the port where the LDAP server is listening for the client connections, for example: 
  • CICS_LDAP_PORT=4000.  If the CICS_LDAP_PORT environment variable is not specified in the region's environment file, the EAM assigns 389 as the default port. 


This EAM module is called whenever:

  • A user ID and password combination needs authentication 
  • A password needs changing in the external user ID and password repository 
  • A user definition that is in UD.stanza is not present for the user who is trying to log on 
  • After a successful password validation of an EAM user, EAM is called to install the user definition at CICS runtime. 


By default, CICS uses internal authentication that uses UD stanza. To use an External Authentication Manager instead of CICS, you must:


  • Install the EAM module 
  • Change the Region Definitions (RD) EAMLoad attribute to yes 
  • Use the RD EAMModule attribute to specify the EAM program path and name 
When the CICS region comes up, the EAMModule that the CICS Administrator specified is loaded into each cicsas process. When a CICS user tries to login with a user ID and password, CICS checks whether EAM is loaded. If the EAM is loaded, it passes that user ID and password to the EAM program for authentication.

Entanglement


When 2 particles interacted and they become correlated, or mathematically, their wavefunctions are intertwined and become one wavefunction in superposition, any change (e.g. collapse of wavefunction by taking a measurement or one of the wavefunction encountered a double-slit) to one of the particle will have a non-local effect to the other instantaneously, no matter how far the 2 particles are apart in distance.

Eclipse Equinox


The original plug-in architecture in Eclipse was not dynamic.  Once loaded, it will stay in memory.  OSGi framework enables the dynamic behavior.  The merge of the these 2 technology create Equinox.

The Eclipse runtime that underpin WAS is now implemented as OSGi services.  WAS also implements its components in OSGi services.  Doing so enables WAS to add and change features dynamically

WAS as an Eclipse Application


WAS is package as an Eclipse plug-in (which is equivalent to OSGi bundles).  WAS extends the extension point org.eclipse.core.runtime.applications in Eclipse plugin.xml.  Eclipse provides startup.jar to start any Eclipse application.

For WAS startup, IBM repacakage startup.jar is its own code.  The start up Java program is com.ibm.wsspi.bootstrap.WSPreLauncher in bootstrap.jar file.

This execute the Eclipse framework and pass it the name of the Eclipse application com.ibm.ws.bootstrap.WSLauncher (similar to org.eclipse.core.launcher.Main).  The launcher will read the plugin.xml file and find the extension points for org.eclipse.core.runtime.applications - in this case is com.ibm.ws.runtime.eclipse.WSStartServer which will start WAS.

- - - - -

Websphere Garbage Collection


The mark and sweep algorithm is suitable for application throughput.  The application will pause each time the GC is running.  Generational GC is good for application that creates large number of objects, uses them and destory them within a short interval.  The young objects are kept in the nursery.  A minor GC takes place regularly.  Older objects are migrated to the old generation space which a mark and sweep GC will be performed.  This method improves performance and reduce fragmentation.

The mark and sweep method will need to acquire exclusive access to JVM which means all thread activities are stopped (STW = stop the world).

In the mark phase, all live objects are marked.  All unreachable objects are considered garbaged.  The process of markin all reachable objects is called tracing.  Tracing starts off from stacks, static objects, local and global JNI references.

Parallel mark uses N-1 helper thread to trace in parallel.  N equals to the number of processor.  One application thread is used as the master coordinating agent.  Parallel marking is turned on by default and controlled by Xgcthreads parameter.  To turn off, set Xgcthreads = 1.

Concurrent mark performs the tracing concurrently with the application activities.  It ask each of the application thread to scan its stack.  Tracing is done by a low priority background thread and the application thread when it does a heap lock allocation (i.e. allocation that need to acquire an exclusive log to the heap to serialize access).  Concurrent mark reduce the GC pause and make the pause time more consistent by spreading the tracing to run concurrently with other application activities.  As the application needs to perform some tracing, it will run slightly longer and throughput will be impacted slightly.  Concurrent mark is controlled by the xgcpolciy parameter. "optthruput" disables it and "optavgpause" enables it.

When mark phase completes, the mark bit vector identifies the location of all live objects in the heap.  One bit in the mark bit vector represents 8 bytes in the heap.  To avoid filling the free pool with many small size object, only chunk with 512 bytes or more will be reclaimed.  Minimize chunk size for 64-bit platform is 768 bytes.  The chunks that are not reclaimed are called "dark matter" and they will be recovered with the adjacent object blocks when the time comes.

Parallel bitwise sweep speeds up the sweep using the same set of helper threads used for sweep.  Each helper threads will sweep an area of 256KB.

Concurrent sweep likes concurrent mark, reduces average pause time.  It shares the same mark map with the concurrent map and so these 2 activities are exclusive

Oracle Shared Pool Free Space



Comparing shared pool with buffer cache, buffer cache uses a single chuck size.  Requests to shared pool however varies in sizes.  Therefore, the management of buffer cache is relatively simplier.  To satisfy a request, the buffer cache just supplies the first item on the free list.  For shared pool, the objective is to find the chunch with the appropriate size quickly.

Oracle reserves about 5% of space from each granule (unit of allocation that make up each pool in SGA).  This is the reserved pool for large object (>4MB).  Separating the large objects from smaller ones reduce the degree of fragmentation.  Flanking the reserved pool chunk are 2 24-bytes chunks called reserved stopper.  The stopper is to help to ensure the free reserved pool will not be merged with adjacent free block.

In the extent dump, both chunks labelled recreate and freeable are free chunk.  The heap manager (which manage the shared pool) issue call to destroy the recreate chunk when it needs space.  The call will be issued to the specific SGA pool manager (e.g. Lib cache manager) which will actually carry out the destroy request.  The freeable chunks links to recreateable chunk.  When the recreateable chunk is freed, the associated freeable chunks will also be freed at the same time.  Note that there is no direct call to the Lib Manager to destroy a freeable chunk. Only call to destroying the recreateable chunk is available.

There are a large number of free lists for the shared pool because the size of space requests varies.  The first 176 lists holds chunk of increment of 4 bytes.  the next few increment by 12 bytes.  Then the next few increment by 64 bytes and so on.  If Oracle need to find space for a certain size and the best fit list does not have free space, Oracle look at the list with the next bigger size.  When a free chunk is used and the size of the free chunk is larger than the request (because there is no free chunk with exact size match), the remaining free space may be considered used or return to a free list of smaller size depending on the size differential.

LRU list in shared pool contain recreateable chunks only.  The LRU list is divided into 2 sub-lists: one list is called recurrent and the other called transient.  Recurrent list contains chunk that are used repeatedly recently (hot) and the transient list contains chunks that are not used recently (cold).  When a chunk is inserted, it is place in the head of the transient list.  When the chunk is reused, it is transferred to the head of the recurrent list.

When the freelist does not contain chunk with size large enough to satisfy the request, Oracle will go for the LRU list.  It will free some chunks which are not pinned from the transient list, transferred these chunk to the free list and check again if there is enough space for the request. If not, it will repeat this process for a few times.  After a definite time and no contiguous freespace is available, Oracle issue the 4031 error.

Oracle Cursors Sharing


Sys-recursive SQL statements are generated by oracle to query the data dictionary (system catalog) to find out information on objects and relations to interpret SQL statement.  Oracle keep some bootstrap objects in the shared pool (marked as fixed object) to allow it to start query processing.

When a SQL is passed to Oracle, it will hash the SQL and check if there is an entry in the library cache with the same hash value.  If there is a match, Oracle will then compare the SQL to the lib cached one to make sure they are indeed the same.  This is called cursor authentication.

Session cursor caching is to keep the frequently used cursors in session memory so that you do not need to search for it in the library cache.  Session cursor caching happened after the cursor authentication.

call 1 - oracle optimize the SQL
call 2 - cursor authentication and pick it up from library cache
call 3 - cache the cursor in session memory after call completed
call 4 - reuse the cursor in session memory

If someone is using a particular query, it will already be optimized in the library cache.  When you use the same query, you go straight to call 2 scenario and the cursor will cache in your session memory after call 2.

Connection State in Socket Calls


Client
- when socket is created, the connection is in CLOSED state
- when connect() is called, TCP initiates the 3-ways handshake and the status change to CONNECTING.
- when the 3-way hand shake completes, the connection status change to ESTABLISHED
- 2 possible errors
- ETIMEDOUT = TCP does not received a response from server for the handshake even with retransmission
- ECONNREFUSED = server sends a RESET packet which could means the socket server is not listening

Server
- when the socket is created, the connection is in CLOSED state
- when bind() is called, the local Iport is filled in the socket structure.  STatus is still CLOSED
- when listen() is called, status change to LISTENING
- when client request comes in, a new socket structure is allocated with the local IP address (remember the request can come in multiple interfaces of the servers) and the remote IP/port.  Status changed to CONNECTING.
- When the 3-way handshake completed, the status changed to ESTABLISHED
- when accept() is called, the new socket descriptor is returned to the caller.