Thomas Thiel, 27/03/95

Operating System Architecture

For MEMSY we choose Unix as basis for MEMSOS, the operating system of MEMSY. Unix supplies a good development environment and many useful tools. The multitasking/multiuser feature of Unix is included with no additional effort.

On each processor node we use the UNIX SYSTEM V/88 Release 3 of MOTOROLA, which is adapted to the multiprocessor architecture of the processor board. The operating system has a peer processor architecture, meaning that there is no special designated processor, e.g. master processor. Every processor is able to execute user code and can handle all I/O requests by itself. The kernel is divided into two areas. One area contains code that can be accessed in parallel by all processors, because there is either no shared data involved or the mutual exclusion is achieved by using fine grain locks. The second area contains all the other code that can not be accessed in parallel. This area is secured with a single semaphore. For example, all device drivers can be found here. In SYSTEM V/88 the usual multiprocessor concepts are implemented, such as message-passing, shared-memory, interprocessor communication and global semaphores.

Extensions

For the implementations of the above mentioned multiprocessor concepts the assumption has been made that all processors share a global main memory. But the operating system is not able to deal with distributed memory such as our communication memory. Therefore certain extensions and additions have been made to the operating system. Only little changes have been made to the kernel itself. Standard Unix applications are runnable on MEMSY because the system-call interface stayed intact.

Integration of the additional hardware, particularly the communication memories and the distributed interrupt-system, was one of the first steps. One of the next steps made was the implementation of basic mechanisms for all sorts of communication and coordination, which depend on the shared-memory. On top of these mechanisms most of our other extensions are built. The Unix system-call interface was extended by additional system calls.

Support of Distributed User Programs

Application concept

Various demands on high-performance multiprocessor systems are made by the users:

the system should be highly available and easy to use
there should be as little interference with other user programs as possible
the computing power should always be the maximum available
the functionality provided should be the usual or even an enlaged one
short start-up times for programs
interactive user programs should be supported

In MEMSOS most of the users' needs are supported by the implementation of our application concept. We define the set of all processes belonging to one single user program as an application. An process, belonging to an application can be identified by:

A globally unique application id which all processes of an application inherit.
A task id which is asigned to each process (called task or job) of an application and which is locally unique for this application and processor node.
A globally unique node id which identifies any node in the system.

This simple application concept makes it possible to easily control and monitor distributed user programs. Because single applications can be distinguished from one another, more than one application can be allowed to run in parallel on MEMSY.

Process/processor binding

The underlying operating system allows to alter the binding of processes to runqueues and the binding of runqueues to processors. In the original preliminary code all processors of one board share a single runqueue. To support applications more efficiently the processor binding and the number of run queues was altered. In our implementation there exists one system run queue for all system processes, which is bound to one of four processors. For each application running on a node an application run queue, local to that node, will be created. These application run queues are handled by the remaining processors. The binding is not static and can be changed dynamically.

Global scheduling mechanism

Additionally a new concept called gang scheduling has been implemented. The purpose of this is to assure the concurrent execution of tasks of one application on all nodes. This is achieved by creating a runqueue for each application on each node. Additionally applications are devided into classes expressing their need for concurrent scheduling. A global component, called applserv, retrieves information about the nodes and the applications started and running and establishes a ranking of applications which are to be scheduled.

Interrupt mechanism

As shown in the hardware section the access time to the communication memory is higher than, for example, to the local memory. The use of polling mechanisms on the communication memory is out of question. They must be avoided as far as possible. Therefore an interrupt connection between neighbouring nodes was implemented. We use a special hardware (interrupt subsystem), supported by software, to generate these inter-node interrupts.

With every interrupt triggered a word is provided at a defined memory location in the communication memory owned by the triggering node. The interrupt hardware recognizes the port on which the interrupt occurred and the software can locate the corresponding node number and communication memory, from which the supplied word can be read.

The interrupt word consists of two parts. The first part (currently 8 bit) represents the interrupt type. The second part form the data (curently 24 bit) which is to be interpreted depending on the interrupt type.

An interface is provided by the interrupt module, so that it can be used by other kernel modules. Kernel modules, which want to use the interrupt facility have to reserve system-wide identical interrupt types as needed and register a callback function for every reserved type. A single interrupt-send routine is provided to initiate an interrupt.

Message-passing mechanism

The message-passing mechanism was implemented as one of those kernel modules using the interrupt mechanism.

A message consists of the message header and the message body, which can hold six words (24 bytes). For the messages a static buffer pool is allocated in the communication memory modules which the node owns. A special buffer management is keeping track of each buffer sent to maintain consistency of the buffer pool.

The interface of the message-passing module is constructed in the same way as the interface of the interrupt module. One has to reserve message types and register a callback function for each type reserved. To actually send a message a single message-send function is provided.

If the message-send routine is called, the message-passing mechanism allocates a buffer for the message, fills in the message header and copies the message body. The message-passing mechanism then calls the interrupt-send function with parameters destination node, type and index of the allocated message buffer.

Message buffers sent to an neighbouring node are not sent back immediately, but are gathered at the receiver to reduce the interrupt rate. There are three occasions when accumulated buffers are sent back:

A certain amount is exceeded. The limit is a tunable parameter.
A message is sent in the opposite direction. The accumulated buffers belonging to the receiver are simply added to the message.
A neighbour requests the return of the used buffers.

A simple protocol guarantees that a message is received by the destination node. Additional protocols assure that a received message is accepted by the destination kernel module.

Shared-memory mechanism

The shared-memory mechanism consists of two parts:

the communication memory manager and
the shared-memory manager

Communication Memory Manager

The communication memory manager provides the linkage to the physical shared-memory. To allocate or free pages from the communication memory, the shared-memory mechanism uses calls to the communication memory manager. The pages available for shared-memory are numbered consecutively and linked by using a table containing one entry for each page. The entries determine the number of the following page, which need not be the one physically following. What distinguishes this memory manager from others is the lack of automatic memory mapping or unmapping. Therefore a call to the allocate function does not return the start address of the allocated memory, but a pointer to a table containing the page numbers. The information about the pages is essential, because the address space mapping may not be the same on all nodes. All tables are situated in the communication memory itself so that they are accessible by neighbouring nodes. Additional calls exist for calculating addresses out of page numbers and for mapping and unmapping the allocated memory into and out of the kernel address space.

Shared-Memory Manager

The shared-memory manager provides the functionality used for communicating with neighbouring nodes and keeps track of allocated pages to allow their re-integration in case of faults. On allocation of a shared-memory segment, the manager validates provided parameters and chooses that communication memory which the destination node has access to. If an neighbouring node wants to share an allocated memory segment, the manager provides upon request the offset to the corresponding page table situated in the communication memory. For the inter-node communication the message-passing mechanism is used. On the destination node the shared-memory manager is able to locate the page table and to map the shared segment into the kernel address space.

On top of the shared-memory manager a system-call interface is established. This interface allows an efficient use of the shared-memory mechanism by the application programmer.

Thomas Thiel (thiel@informatik.uni-erlangen.de)