Thomas Thiel, 27/03/95
Operating System Architecture
For MEMSY we choose Unix as basis for MEMSOS,
the operating system of MEMSY. Unix supplies a good development environment and many
useful tools. The multitasking/multiuser feature of Unix is included with no
additional effort.
On each processor node we use the UNIX SYSTEM V/88 Release 3 of MOTOROLA, which is
adapted to the multiprocessor architecture of the processor board. The operating system
has a peer processor architecture, meaning that there is no special designated processor,
e.g. master processor. Every processor is able to execute user code and can handle all
I/O requests by itself. The kernel is divided into two areas. One area contains code that
can be accessed in parallel by all processors, because there is either no shared data
involved or the mutual exclusion is achieved by using fine grain locks. The second area
contains all the other code that can not be accessed in parallel. This area is secured
with a single semaphore. For example, all device drivers can be found here. In SYSTEM
V/88 the usual multiprocessor concepts are implemented, such as message-passing,
shared-memory, interprocessor communication and global semaphores.
Extensions
For the implementations of the above mentioned multiprocessor concepts the assumption
has been made that all processors share a global main memory. But the operating system
is not able to deal with distributed memory such as our communication memory. Therefore
certain extensions and additions have been made to the operating system. Only little
changes have been made to the kernel itself. Standard Unix applications are runnable
on MEMSY because the system-call interface stayed intact.
Integration of the additional hardware, particularly the communication memories and
the distributed interrupt-system, was one of the first steps. One of the next steps
made was the implementation of basic mechanisms for all sorts of communication and
coordination, which depend on the shared-memory. On top of these mechanisms most of
our other extensions are built. The Unix system-call interface was extended by
additional system calls.
Support of Distributed User Programs
Application concept
Various demands on high-performance multiprocessor systems are made by the users:
- the system should be highly available and easy to use
- there should be as little interference with other user programs as possible
- the computing power should always be the maximum available
- the functionality provided should be the usual or even an enlaged one
- short start-up times for programs
- interactive user programs should be supported
In MEMSOS most of the users' needs are supported by the implementation of our
application concept. We define the set of all processes belonging to one
single user program as an application. An process, belonging to an application can
be identified by:
- A globally unique application id which all processes of an
application inherit.
- A task id which is asigned to each process (called task or
job) of an application and which is locally unique for this application and
processor node.
- A globally unique node id which identifies any node in the system.
This simple application concept makes it possible to easily control and monitor
distributed user programs. Because single applications can be distinguished from
one another, more than one application can be allowed to run in parallel on MEMSY.
Process/processor binding
The underlying operating system allows to alter the binding of processes to runqueues
and the binding of runqueues to processors. In the original preliminary code all
processors of one board share a single runqueue.
To support applications more efficiently the processor binding and the number of run
queues was altered. In our implementation there exists one system run queue for all
system processes, which is bound to one of four processors. For each application running
on a node an application run queue, local to that node, will be created. These application
run queues are handled by the remaining processors. The binding is not static and can be
changed dynamically.
Global scheduling mechanism
Additionally a new concept called gang scheduling has been implemented.
The purpose of this is to assure the concurrent execution of tasks of one application on
all nodes. This is achieved by creating a runqueue for each application on each node.
Additionally applications are devided into classes expressing their need for concurrent
scheduling. A global component, called applserv, retrieves information about
the nodes and the applications started and running and establishes a ranking of
applications which are to be scheduled.
Interrupt mechanism
As shown in the hardware section
the access time to the communication memory is higher than, for example, to the
local memory. The use of polling mechanisms on the communication memory is out of
question. They must be avoided as far as possible. Therefore an interrupt connection
between neighbouring nodes was
implemented. We use a special hardware
(interrupt subsystem),
supported by software, to generate these inter-node interrupts.
With every interrupt triggered a word is provided at a defined memory location in
the communication memory owned by the triggering node. The interrupt hardware
recognizes the port on which the interrupt occurred and the software can locate the
corresponding node number and communication memory, from which the supplied word can
be read.
The interrupt word consists of two parts. The first part (currently 8 bit) represents
the interrupt type. The second part form the data (curently 24 bit) which is to be
interpreted depending on the interrupt type.
An interface is provided by the interrupt module, so that it can be used by other
kernel modules. Kernel modules, which want to use the interrupt facility have to
reserve system-wide identical interrupt types as needed and register a callback
function for every reserved type. A single interrupt-send routine is provided
to initiate an interrupt.
Message-passing mechanism
The message-passing mechanism was implemented as one of those kernel modules using
the interrupt mechanism.
A message consists of the message header and the message body, which can hold six
words (24 bytes). For the messages a static buffer pool is allocated in the
communication memory modules which the node owns. A special buffer management is
keeping track of each buffer sent to maintain consistency of the buffer pool.
The interface of the message-passing module is constructed in the same way as the
interface of the interrupt module. One has to reserve message types and register a
callback function for each type reserved. To actually send a message a single
message-send function is provided.
If the message-send routine is called, the message-passing mechanism allocates a
buffer for the message, fills in the message header and copies the message body.
The message-passing mechanism then calls the interrupt-send function with parameters
destination node, type and index of the allocated message buffer.
Message buffers sent to an neighbouring node are not sent back immediately, but are
gathered at the receiver to reduce the interrupt rate. There are three occasions
when accumulated buffers are sent back:
- A certain amount is exceeded. The limit is a tunable parameter.
- A message is sent in the opposite direction. The accumulated buffers belonging
to the receiver are simply added to the message.
- A neighbour requests the return of the used buffers.
A simple protocol guarantees that a message is received by the destination node.
Additional protocols assure that a received message is accepted by the destination
kernel module.
Shared-memory mechanism
The shared-memory mechanism consists of two parts:
- the communication memory manager and
- the shared-memory manager
Communication Memory Manager
The communication memory manager provides the linkage to the physical shared-memory.
To allocate or free pages from the communication memory, the shared-memory mechanism
uses calls to the communication memory manager. The pages available for shared-memory
are numbered consecutively and linked by using a table containing one entry for each
page. The entries determine the number of the following page, which need not be the
one physically following. What distinguishes this memory manager from others is the
lack of automatic memory mapping or unmapping. Therefore a call to the allocate
function does not return the start address of the allocated memory, but a pointer to
a table containing the page numbers. The information about the pages is essential,
because the address space mapping may not be the same on all nodes. All tables are
situated in the communication memory itself so that they are accessible by
neighbouring nodes. Additional calls exist for calculating addresses out of page
numbers and for mapping and unmapping the allocated memory into and out of the kernel
address space.
Shared-Memory Manager
The shared-memory manager provides the functionality used for communicating with
neighbouring nodes and keeps track of allocated pages to allow their re-integration
in case of faults. On allocation of a shared-memory segment, the manager validates
provided parameters and chooses that communication memory which the destination node
has access to. If an neighbouring node wants to share an allocated memory segment,
the manager provides upon request the offset to the corresponding page table situated
in the communication memory. For the inter-node communication the message-passing
mechanism is used. On the destination node the shared-memory manager is able to locate
the page table and to map the shared segment into the kernel address space.
On top of the shared-memory manager a system-call interface is established. This
interface allows an efficient use of the shared-memory mechanism by the application
programmer.
Thomas Thiel (thiel@informatik.uni-erlangen.de)