SIKS of Unified Computer Science TR Index

Keywords: distributed shared memory

Entries matching at least 3 keywords found: 82

Matching 3 keywords:

Boston U CS 95-006(1K)

abstracts/95-006 ::::::::::::::

Title: Using Speculation to Reduce Server Load and Service Time on the WWW Author: Azer Bestavros Date: February 21, 1995

Abstract: Speculative service implies that a client's request for a document is serviced by sending, in addition to the document requested, a number of other documents that the server speculates will be requested by the client in the near future. This speculation is based on statistical information that the server maintains for each document it serves. The notion of speculative service is analogous to prefetching, which is used to improve cache performance in distributed/parallel shared memory systems, with the exception that servers (not clients) control when and what to prefetch. Using trace simulations based on the logs of our departmental HTTP server http://cs-www.bu.edu, we show that both server load and service time could be reduced considerably, if speculative service is used. This is above and beyond what is currently achievable using client-side caching and server-side dissemination. We identify a number of parameters that could be used to fine-tune the level of speculation performed by the server based on the level of lookahead, the state of the network, the tradeoffs between bulk and individual transmission of documents, and the relative popularity of documents, among other factors.

U of Warwick, CS 254(dir)

Alexander-Craig, I.D. ``A New Interpretation of the Blackboard Architecture.'' Technical Report, UWARWICK//CS-RR-254, University of Warwick, Department of Computer Science. October 1993. 19 pages.

ABSTRACT: This paper describes a new interpretation of Newell's blackboard metaphor of group problem-solving activity. The new interpretation considers the experts of the metaphor to be independent, concurrently active, agents that communicate by posting information on a shared memory (the blackboard). The blackboard in our new interpretation is expanded in functionality and is implemented as a concurrently executing process. The blackboard is able to perform tasks such as agent creation message direction and forwarding, and message censorship. We outline all of these functions and compare them with the conventional (HEARSAY-II derived) interpretation of the metaphor. We end by indicating some of the various ways in which the new interpretation can be used to construct systems that are distributed as well as parallel; as part of this, we show how the blackboard structures of the new interpretation can be recursively composed (embedded) to produce complex systems built from simple components.
BIB-VERSION: CS-TR-v2.0
ENTRY: January 20, 1995
LANGUAGE: English
RETRIEVAL: Postscript (all.ps.Z) by anonymous FTP from ftp://ftp.dcs.warwick.ac.uk/pub/reports/rr/254/

U of California Berkeley, CS csd-89-534(dir)

Boothe, Robert Francis. ``Multiprocessor Strategies for Ray-Tracing.'' UCB//CSD-89-534, 66 pages.

ABSTRACT: Ray-tracing is often suggested as a problem which is well suited for execution on multiprocessors. It is characterized by having abundant parallelism, a very small sequential part, and aggravatingly long run-times. For simple will behaved scenes, linear speedup is easily achievable. However for realistic scenes which are typically both complex and non-uniformly distributed, parallel ray-tracing is a challenging problem. This thesis evaluates and compares implementations of a sophisticated ray-tracer on both shared memory and distributed memorymachines.
ENTRY: August 2, 1993
RETRIEVAL: ocr (in all.ocr); tiff (in {001-066}.tif)

U of California Berkeley, CS csd-89-525(dir)

Lucco, Steven E.; Anderson, David P. ``Tarmac: A Language System Substrate Based on Mobile Memory.'' UCB//CSD-89-525, November 1, 1989. 18 pages.

ABSTRACT: Tarmac is a language system substrate on which systems for distributed parallel programming can be built. Tarmac provides a model of shared global state called mobile memory. The basic unit of state in this model can be viewed as both 1) a block of memory that can be directly accessed by machine instructions, and 2) a logical entity with a globally unique name that may be efficiently located, copied and moved. To support higher-level synchronization models, the movement of a memory unit may optionally enable computations.
Mobile memory is more flexible than models such as distributed virtual memory, shared tuple space, or distributed objects. It avoids the limitations of fixed page size, fixed data placement policy, and type-system or language dependence. This flexibility allows Tarmac to support a wide range of parallel programming models efficiently.
ENTRY: August 19, 1993
RETRIEVAL: ocr (in all.ocr); tiff (in {001-018}.tif)

U of California Berkeley, CS csd-88-463(dir)

Anderson, David P.; Tzou, Shin-Yuan. ``The DASH Local Kernel Structure.'' UCB//CSD-88-463, November 7, 1988. 35 pages.

ABSTRACT: The DASH project has designed the network communication architecture for a large, high performance distributed system, and is now building a portable operating system kernel to run on the nodes of this system. The DASH kernel supports the communication architecture by providing efficient local communication, support for user-level services, naming support, and transparent remote service access. It is designed to provide increased performance through parallelism on shared-memory multiprocessors.
This report describes some of the basic components of the DASH kernel: process scheduling, synchronization mechanisms, timers and message-passing. It also describes the ways in which these facilities are made available to user processes. The other components of the kernel, such as the virtual memory and network communication systems, are described in separate documents.
ENTRY: May 28, 1993
RETRIEVAL: ocr (in all.ocr); tiff (in {001-035}.tif)

U of California Berkeley, CS csd-88-461(dir)

Anderson, David P.; Tzou, Shin-Yuan; Graham, G. Scott. ``The DASH Virtual Memory System.'' UCB//CSD-88-461, November 8, 1988. 36 pages.

ABSTRACT

The DASH project has defined the network communication architecture for a large, high-performance distributed system. We are now designing a portable operating system kernel for the nodes of this system. The kernel is designed to run on shared-memory multiprocessors, and to exploit the performance potential of such machines.

This report describes the DASH kernel's virtual memory (VM) system. The following are key features of the VM system: * A virtual address space is partitioned into three regions, each providing a specific function: 1) private memory, 2) read-only shared memory, and 3) interprocess communication (IPC) buffers.

* The IPC region uses VM remapping to provide data movement between virtual address spaces. Software copying is minimized.

* Tasks such as page zeroing and pageout are done by processes that can execute concurrently with other activities.

* Most of the VM system implementation is machine-independent. The interface of the machine-dependent part is designed to allow efficient implementation on a range of architectures.

ENTRY

August 4, 1993

RETRIEVAL

ocr (in all.ocr); tiff (in {001-036}.tif)

U of California Berkeley, CS csd-83-133(dir)

Powell, Michael L.; Presotto, David L. ``A Reliable Broadcast Communication Mechanism.'' UCB//CSD-83-133, 10 pages.

ABSTRACT

Publishing is a model and mechanism for crash recovery in a distributed computing environment. Published communication works for systems connected via a broadcast medium by recording messages transmitted over the network. The recovery mechanism can be completely transparent to the failed process and all processes interacting with it. Although published communication is intended for a broadcast network such as a bus, a ring, or an Ethernet, it can be used in other environments.

A recorder reliably stores all messages that are transmitted, as well as checkpoint and recovery information. When it detects a failure, the recorder may restart affected processes from checkpoints. The recorder may restart affected processes from checkpoints. The recorder subsequently resends to each process all messages which were sent to it since the time its checkpoint was taken, while ignoring duplicate messages sent by it.

Message-based systems without shared memory can use published communications to recover groups of processes. Simulations show that at least 5 multi-user minicomputers can be supported on a standard Ethernet using a single recorder. The prototype version implemented in DEMOS/MP demonstrates that an error recovery can be transparent to user processes and can be centralized in the network.

ENTRY

July 9, 1993

RETRIEVAL

ocr (in all.ocr); tiff (in {001-010}.tif)

U of California Berkeley, CS csd-87-385(dir)

Tzou, Shin-Yuan; Anderson, David P.; Graham, G. Scott. ``Efficient Local Data Movement in Shared-Memory Multiprocessor Systems.'' UCB//CSD-87-385, 30 pages.

ABSTRACT: The DASH research project is addressing the general problem of achieving high-performance network communication in alarge-scale distributed systems. The efficiency of moving a large amount of data between virtual address spaces (both user and kernel) on a single machine is a major component of this problem. Virtual memory (VM) remapping, as opposed to memory copying, is an attractive approach to moving data. However, remapping in shared-memory multiprocessors can be costly due to the problem o f tranlsation lookaside buffer (TLB) inconsistency.
This paper describes the design of the DASH mechanism for moving data between virtual address spaces. This design integrates interprocess communication (IPC), virtual memory, and process scheduling mechanisms. By adopting a particular choice of IPC semant ics based on a protected shared memory model, we are able to eliminate many of the overheads that would otherwise arise from VM remapping in shared-memory multiprocessors. Put simply, we reduce the need for synchronous remapping and, when it is necessary , we do it efficiently.
ENTRY: June 15, 1993
RETRIEVAL: ocr (in all.ocr); tiff (in {001-030}.tif)

U of Rochester CS 94.tr547.Lazy_release_consistency_for_hardware_coherent_multiproc.ps.gz(208K)

Leonidas I. Kontothanassis, Michael L. Scott, Ricardo Bianchini. ``Lazy Release Consistency for Hardware-Coherent Multiprocessors.'' TR 547, URCSD, December 1994.

Keywords: cache coherence; lazy release consistency; scalable shared memory; programmable protocol processors; false sharing

Release consistency is a widely accepted memory model for distributed shared memory systems. It provides significant opportunities for a coherence protocol to improve performance by delaying and buffering coherence operations. Different protocol implementations exploit these opportunities to different extents. Eager release consistency represents the state of the art for hardware-coherent multiprocessors, while lazy release consistency has been shown to provide better performance for software distributed shared memory (DSM). Several of the optimizations performed by lazy protocols have the potential to improve the performance of hardware-coherent multiprocessors, but their complexity has precluded a hardware implementation. With the advent of programmable protocol processors it may become possible to use them after all.
We present and evaluate a lazy release-consistent protocol suitable for machines with dedicated protocol processors. This protocol admits multiple concurrent writers, sends write notices concurrently with computation, and delays invalidations until acquire operations. We also consider a lazier protocol that delays sending write notices until release operations. Our results indicate that the first protocol outperforms eager release consistency by as much as 20\% across a variety of applications. The lazier protocol, on the other hand, is unable to recoup its high synchronization overhead. This represents a qualitative shift from the DSM world, where lazier protocols always yield performance improvements. We also study protocol performance under a variety of architectural settings and show that the performance gap between lazy and eager implementations of release consistency will increase on future machines. Based on our results, we conclude that machines with flexible hardware support for coherence should use protocols based on lazy release consistency, but in a less ``aggressively lazy'' form than is appropriate for DSM

U of Rochester CS 94.OS_provided_coherence_can_work_almost_as_well_as_hardware.ps.gz(51K)

Michael L. Scott, Leonidas I. Kontothanassis, Michael W. Marchetti, Alexandros Poulos. ``OS-Provided Coherence Can Work Almost as Well as Hardware.'' November 1994.

submitted for conference publication

We argue that OS-provided data coherence on non-cache-coherent NUMA multiprocessors (machines with a single, global physical address space), can perform substantially better than distributed shared memory emulations on message-passing hardware, and almost as well as fully cache-coherent multiprocessors

U of Texas, San Antonio, High Perf Comp & Software Lab TR-94-01-03.ps.Z(119K)

TR-94-01-03.ps.Z

X. Zhang, Y. Yan, and R. Castaneda, ``Comparative performance analysis and evaluation of hot spots on network-based shared-memory architectures",

Preliminary version was published in Proceedings of 1993 IEEE Symposium of Parallel and Distributed Processing, December, 1993.

Revised June and December 1994.

The final revised version will appear in IEEE Transactions on Parallel and Distributed Systems.

Abstract --------

Hot spot contention on a network-based shared-memory architecture occurs when a large number of processors try to access a globally shared variable across the network. While Multistage Interconnection Network (MIN) and Hierarchical Ring (HR) structures are two important bases on which to build large scale shared-memory multiprocessors, the different interconnection networks and cache/memory systems of the two architectures respond very differently to network bottleneck situations. In this paper, we present a comparative performance evaluation of hot spot effects on the MIN-based and HR-based shared-memory architectures. Both non-blocking MIN-based and HR-based architectures are classified, and analytical models are described for understanding network differences and for evaluating hot spot performance on both architectures. The analytical comparisons indicate that HR-based architectures have the potential to handle various contentions caused by hot spots more efficiently than MIN-based architectures. Intensive performance measurements on hot spots have been conducted on the BBN TC2000 (MIN-based) and the KSR1 (HR-based) machines. Performance experiments were also conducted on the practical experience of hot spots with respect to synchronization lock algorithms. The experimental results support the analytical models, and present practical observations and an evaluation of hot spots on the two types of architectures.

INRIA (Natl Inst Res Comp & Ctl Sci, France) RR-2399.ps.gz(79K)

RR-2399.ps.gz SHAPIRO (Marc),FERREIRA (Paulo) : Larchant-RDOSS: a distributed shared persistent memory and its garbage collector

U of Rochester CS tr542.Unifying_data_and_control_transformations.ps.Z(87K)

Michal Cierniak, Wei Li. ``Unifying Data and Control Transformations for Distributed Shared Memory Machines.'' TR 542, URCSD, November 1994.

Keywords: unified data and control transformation; locality optimization; distributed shared memory machines; data mapping; mapping vector; locality model; stride vector

We present a unified approach to locality optimization that employs both data and control transformations. Data transformations include changing the array layout in memory. Control transformations involve changing the execution order of programs. We have developed new techniques for compiler optimizations for distributed shared-memory machines, although the same techniques can be used for sequential machines with a memory hierarchy.
Our compiler optimizations are based on an algebraic representation of data mappings and a new data locality model. We present a pure data transformation algorithm and an algorithm unifying data and control transformations. While there has been much work on control transformations, the opportunities for data transformations have been largely neglected. In fact, data transformations have the advantage of being applicable to programs that cannot be optimized with control transformations. The unified algorithm, which performs data and control transformations simultaneously, offers improvement over optimizations obtained by applying data and control transformations separately.
The experimental results using a set of applications on a parallel machine show that the new optimizations improve performance significantly. These results are further analyzed using locality metrics with instrumentation and simulation

INRIA (Natl Inst Res Comp & Ctl Sci, France) RR-2361.ps.gz(106K)

RR-2361.ps.gz HAHAD (Mounir),PRIOL (Thierry),ERHEL (Jocelyne) : Irregular Loop Patterns Compilation on Distributed Shared Memory Multiprocessors

Carnegie-Mellon CS CMU-CS-91-170.ps(145K)

CMU-CS-91-170.ps Midway: Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors Brian N. Bershad, Matthew J. Zekauskas September 1991

Ohio State U CIS TR54.ps.gz(46K)

Ramachandran, Mahendra & Singhal, Mukesh. "On the Synchronization in Distributed Shared Memory Systems," 22 pp. (OSU-CISRC-10/94-TR54) Electronic report under 1994/TR54.ps.gz

U of Rochester CS 94.High_performance_software_coherence.ps.Z(320K)

Leonidas I. Kontothanassis, Michael L. Scott. ``High Performance Software Coherence for Current and Future Architectures.'' September 1994.

Shared memory provides an attractive and intuitive programming model for large-scale parallel computing, but requires a coherence mechanism to allow caching for performance while ensuring that processors do not use stale data in their computation. Implementation options range from distributed shared memory emulations on networks of workstations to tightly-coupled fully cache-coherent distributed shared memory multiprocessors. Previous work indicates that performance varies dramatically from one end of this spectrum to the other. Hardware cache coherence is fast, but also costly and time-consuming to design and implement, while DSM systems provide acceptable performance on only a limited class of applications. We claim that an intermediate hardware option---memory-mapped network interfaces that support a global physical address space, without cache coherence---can provide most of the performance benefits of fully cache-coherent hardware, at a fraction of the cost. To support this claim we present a software coherence protocol that runs on this class of machines, and use simulation to conduct a performance study. We look at both programming and architectural issues in the context of software and hardware coherence protocols. Our results suggest that software coherence on NCC-NUMA machines is a more cost-effective approach to large-scale shared-memory multiprocessing than either pure distributed shared memory or hardware cache coherence

U of Rochester CS 94.tr535.Using_simple_page_placement_policies.ps.Z(72K)

Michael Marchetti, Leonidas Kontothanassis, Ricardo Bianchini, Michael L. Scott. ``Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems.'' TR 535, URCSD, September 1994.

Keywords: locality; migration; replication; placement

The cost of a cache miss depends heavily on the location of the main memory that backs the missing line. For certain applications, this cost is a major factor in overall performance. We report on the utility of OS-based page placement as a mechanism to increase the frequency with which cache fills access local memory in a distributed shared memory multiprocessor. Even with the very simple policy of first-use placement, we find significant improvements over round-robin placement for many applications on both hardware and software-coherent systems. For most of our applications, dynamic placement allows 35 to 75 percent of cache fills to be performed locally, resulting in performance improvements of 20 to 40 percent.
We have also investigated the performance impact of more sophisticated policies including hardware support for page placement, dynamic page migration, and page replication. We were surprised to find no performance advantage for the more sophisticated policies; in fact in most cases performance of our applications suffered

U of Texas, San Antonio, High Perf Comp & Software Lab TR-94-01-01.ps.Z(120K)

TR-94-01-01.ps.Z

X. Zhang, R. Castaneda, and W. E. Chan, ``Comparative performance evaluation of spin-lock synchronization on MIN-based and HR-based multiprocessors".

The revised version entitled ``Spin-lock synchronization on the Butterfly and KSR1" was published in IEEE Parallel & Distributed Technology, Vol 2, Spring Issue, 1994, pp. 51-63.

Abstract --------

Multistage Interconnection Network (MIN) and Hierarchical Ring (HR) structures are two important bases on which to build large scale shared-memory multiprocessors. The design and implementation of spin-lock synchronization algorithms on MIN-based and HR-based multiprocessors are complicated due to the complex structure of interconnection networks, the increased potential for contention of network and memory systems, and the performance effects of cache coherence in large shared-memory systems. The execution behavior of spin-locks is significantly different between MIN-based and HR-based architectures. We conduct an empirical study to evaluate and compare spin lock synchronization performance on these two types of multiprocessors. We present simple and efficient implementations of spin-lock algorithms on the BBN GP1000 and the BBN TC2000, both MIN-based multiprocessors, and on the KSR1, an HR-based multiprocessor. We report performance results for these machines, to provide a better understanding of how spin-lock algorithms can be carried out in a cost-effective manner on these two types of large scale shared-memory architectures.

U of Rochester CS 94.tr528.Scalability_of_atomic_primitives.ps.Z(118K)

Maged M. Michael, Michael L. Scott. ``Scalability of Atomic Primitives on Distributed Shared Memory Multiprocessors.'' TR 528, URCSD, July 1994.

Keywords: synchronization; scalability; fetch-and-$\Phi$; compare-and-swap; load-linked; store-conditional; cache coherence

Many hardware primitives have been proposed for synchronization and atomic memory update on shared-memory multiprocessors. In this paper, we focus on general-purpose primitives that have proven popular on small-scale bus-based machines, but have yet to become widely available on large-scale, distributed-memory machines. Specifically, we propose several alternative implementations of fetch_and_$\Phi$, compare_and_swap, and load_linked/store_conditional. We then analyze the performance of these implementations for various data sharing patterns, in both real and synthetic applications. Our results indicate that good overall performance can be obtained by implementing compare_and_swap in a multiprocessor's cache controllers, and by providing an additional instruction to load an exclusive copy of a line

U of Texas, San Antonio, High Perf Comp & Software Lab TR-94-06-01.ps.Z(91K)

TR-94-06-01.ps.Z

X. Zhang, Z. Xu and L. Sun ``Performance predictions on implicit communication systems"

Published in Proceedings of the Sixth IEEE Symposium on Parallel and Distributed Processing (SPDP 94), IEEE CS Press, October 1994, pp. 560-568.

Abstract -------- This paper presents a multiprocessor performance prediction methodology supported by experimental measurements, which predicts the execution time of large application programs on large parallel architectures based on a small set of sample data. We propose a graph model to describe application program behavior. In order to precisely abstract an architecture model for the prediction, important and implicit architecture parameters are obtained by experiments. We focus on performance predictions of application programs on multiprocessors with implicit communications. A large scientific simulation program is implemented using the shared-memory model on the KSR-1 and using the data-parallel model on the CM-5 for performance measurements and prediction validation. We show that experimental measurements provide strong support for the performance prediction on multiprocessors with implicit communications and complex memory systems.

U of Illinois at Urbana-Champaign, Cent of Supercomp R&D 1373.abs(1K) 1373.ps.gz(106K)

1373 - J. Moreira and C. - Autoscheduling in a Distributed Shared-Memory Polychronopoulos Environment

Chorus CS-TR-92-52.ps.Z(79K)

P. Amaral, C. Jacquemot, R. Lea "A model for persistent shared memory addressing in distributed systems" In: Proc, of IWOOOS'92, Paris, France September 24-25, 1992 CS-TR-92-52

U of Florida CIS tr93-028.ps.Z(106K) tr93-029.ps.Z(202K)

028 A Distributed, Replicated, Data-Balanced Search Structure

Theodore Johnson, Adrian Colbrook

Many concurrent dictionary data structures have been proposed, but usually in the context of shared memory multiprocessors. In this paper, we present an algorithm for a concurrent distributed B-tree that can be implemented on message passing computer systems. Our distributed B-tree (the {\em dB-tree}) replicates the interior nodes in order to improve parallelism and reduce message passing. The dB-tree stores some redundant information in its nodes to permit the use of lazy updates to maintain replica coherency. We show how the dB-tree algorithm can be used to build an efficient implementation of a highly parallel, data-balanced distributed dictionary, the {\em dE-tree}.

:number: 029 :title: Voronoi Diagrams of Polygons: A Framework for Shape Representation :author: Niranjan Mayya (University of Florida) :author: V. T. Rajan (IBM T.J. Watson Research Center) :abstract:

McGill U, SCS SPDP93.ps.gz(84K)

SPDP93.ps : SSN,RGo,GRG,VKA: Analysis of Multithreaded Multiprocessors with Distributed Shared Memory.

U of Rochester CS 94.tr504.Compiler_optimizations_for_cache_locality_and_coherence.ps.Z(121K)

Wei Li. ``Compiler Optimizations for Cache Locality and Coherence.'' TR 504, URCSD, April 1994.

Keywords: memory hierarchy; distributed shared memory architectures; data locality; cache coherence; false sharing; compiler optimizations; loop transformations; non-singular transformations; loop tiling; banded matrix problems

Almost every modern processor is designed with a memory hierarchy organized into several levels, each of which is smaller, faster, and more expensive than the level below. High performance requires the effective use of the cached data, i.e. cache locality. Smart compiler transformations can relieve the programmer from hand-optimizing for the specific machine architectures.
In a multiprocessor system, data inconsistency may occur between memory and caches. For example, the memory and multiple caches may have inconsistent copies of the same cache block. This introduces the problem of cache coherence. Several cache coherence protocols have been developed to maintain data coherence for multiple processors. Since multiple variables are located in the same block, it may cause the problem of false sharing, which has been identified by many researchers as a major obstacle to high performance. Therefore, in a multiprocessor system, we need to avoid false sharing as well as exploit cache locality.
In this paper, we first develop a new data reuse model and an algorithm called height reduction to improve cache locality. The advantage of this algorithm is that it can improve band matrix programs as well as dense matrix programs. It is more accurate and general than the existing techniques on improving cache locality, which were developed to optimize dense matrix programs. Then with the height reduction algorithm, we extend loop tiling to exploit not only intra-tile data locality but also inter-tile data locality. We call the new tiling affinity tiling. Our experiments show that affinity tiling is less sensitive to the choice of the tile size. Finally, we show that the algorithm also helps to eliminate or reduce false sharing in multiprocessor systems. With the height reduction algorithm and affinity tiling, significant performance improvement (speedups from 2.5 to 10) has been observed on HP workstations and KSR1 multiprocessors

U of Washington CS UW-CSE-93-12-05.PS.Z(73K)

UW-CSE-93-12-05.PS.Z 75K LEE Concord: Re-Thinking the Division of Labor in a Distributed Shared Memory System

U of Michigan EECS CSE-TR-206-94.ps.Z(289K)

TECHREPORT{Tomko:Abraham:a:94, AUTHOR = {Karen Tomko, Santosh Abraham}, TITLE = {Partitioning Regular Applications for Cache-Coherent Multiprocessors}, INSTITUTION = {Michigan}, YEAR = {1994}, TYPE = {CSE-TR}, NUMBER = {206-94}, MONTH = MAR } Abstract:

In all massively parallel systems (MPPs), whether message-passing or shared-address space, the memory is physically distributed for scalability and the latency of accessing remote data is orders of magnitude higher than the processor cycle time. Therefore, the programmer/compiler must not only identify parallelism but also specify the distribution of data among the processor memories in order to obtain reasonable efficiency. Shared-address MPPs provide an easier paradigm for programmers than message passing systems since the communication is automatically handled by the hardware and/or operating system. However, it is just as important to optimize the communication in shared-address systems if high performance is to be achieved. Since communication is implied by the data layout and data reference pattern of the application, the data layout scheme and data access pattern must be controlled by the compiler in order to optimize communication. Machine specific parameters, such as cache size and cache line size, describing the memory hierarchy of the shared-address space machine must be used to tailor the optimization of the application to the memory hierarchy of the MPP.

This report focuses on a partitioning methodology to optimize application performance on cache-coherent multiprocessors. We give an algorithm for choosing block-cyclic partitions for scientific programs with regular data structures such as dense linear algebra applications and PDE solvers. We provide algorithms to compute the cache state on exiting a parallel region given the cache state on entry; and methods to compute the overall cache-coherency traffic and choose block-cyclic parameters to optimize cache-coherency traffic. Our approach is demonstrated on two applications. We show that the optimal partition computed by our algorithm matches the experimentally observed optimum and we show the effect of cache line size on partition performance.

Amoeba (UCSC) spe92.ps.Z(95K)

spe92 Levelt, W.G., Kaashoek, M.F. Bal, H.E., and Tanenbaum, A.S.: "A Comparison of Two Paradigms for Distributed Shared Memory", Software--Practice and Experience, vol. 22, Nov. 1992, pp. 985-1010.

U of Florida CIS tr94-010.ps.Z(100K)

010 :title: A Fast and Low Overhead Distributed Priority Lock :author: Theodore Johnson :author: Richard Newman-Wolfe :abstract: Distributed synchronization is necessary to coordinate the diverse activities of a distributed system. Priority synchronization is needed for real time systems, or to improve the performance of critical tasks. We present a distributed priority lock that uses Li and Hudak's path compression methods to achieve a theoretical $O(\log n)$ messages per critical section request, where $n$ is the number of processors. In addition, our algorithm requires only $O(\log n)$ bits of storage per processor, by making use of distributed lists. We present performance results to show that the expected message complexity of the algorithm is indeed $O(\log n)$ per critical section request. The low storage and overhead requirements of the algorithm make it scalable and practical for implementation. In addition to its use in synchronization, our algorithm has applications to distributed shared virtual memory consistency with novel check-in/check-out semantics.

U of Maryland, College Park CS 3227(dir) 3227.ps.Z(70K)

C. Lee Giles, Mark W. Goudreau. ``Routing in Optical Multistage Interconnection Networks: a Neural Network Solution.'' CS-TR-3227, NEC Resesearch Institute, Princeton, NJ and Department of Computer Science, University of Central Florida, Orlando, FL and Institute for Advanced Computer Studies, University of Maryland, College Park, MD, February 1994.

Keywords: Recurrent Neural Networks, Multistage Interconnection Networks, Hopfield Neural Networks Interconnection Routing, Optical Interconnects, Fault-tolerance

There has been much interest in using optics to implement computer interconnection networks. However, there has been little discussion of any routing methodologies besides those already used in electronics. In this paper, a neural network routing methodology is proposed that can generate control bits for an optical multistage interconnection network (OMIN). Though we present no optical implementation of this methodology, we illustrate its control for an optical interconnection network. These OMINs may be used as communication media for shared memory, distributed computing systems.The routing methodology makes use of an Artificial Neural Network (ANN) that functions as a parallel computer for generating the routes. The neural network routing scheme may be applied to electrical as well as optical interconnection networks.However, since the ANN can be implemented using optics, this routing approach is especially appealing for an optical computing environment. The parallel nature of the ANN computation may make this routing scheme faster than conventional routing approaches, especially for OMINs that are irregular. Furthermore, the neural network routing scheme is fault-tolerant. Results are shown for generating routes in a 16 times 16, 3 stage OMIN

(Also cross-referenced as UMIACS-TR-94-21.)

U of Texas, San Antonio, High Perf Comp & Software Lab TR-94-02-01.ps.Z(137K)

TR-94-02-01.ps.Z

X. Zhang, Y. Yan, and K. He, ``Latency metric: an experimental method for measuring and evaluating parallel program and architecture scalability".

The revised version was published in Journal of Parallel and Distributed Computing, Vol. 22, No. 3, 1994, pp. 392-410.

Abstract --------

Latency measures the delay caused by communication between processors and memory modules over the network in a parallel system. Using intensive measurements and simulation, we show that network latency forms a major obstacle to improve parallel computing performance and scalability. We present an experimental metric, using network latency to measure and evaluate the scalability of parallel programs and architectures. This latency metric is an extension to the isoefficiency function and the isospeed metric. We give a measurement method for using this latency metric, and report the experimental results of evaluating the scalabilities of several scientific computing algorithms on the KSR-1 shared-memory architecture. Our analysis and experiments show that the latency metric is a practical method to effectively predict and evaluate scalability based on measured latencies inherent in the program and the architecture.

U of Rochester CS 94.tr485.Eager_combining_coherency_protocol.ps.Z(114K)

Ricardo Bianchini, Thomas J. LeBlanc. ``Eager Combining: A Coherency Protocol for Increasing Effective Network and Memory Bandwidth in Shared-Memory Multiprocessors.'' TR 485, URCSD, January 1994.

Keywords: non-uniform distribution of accesses; memory and network bandwidth; contention; scalable multiprocessors; coherency protocols

One common cause of poor performance in large-scale shared-memory multiprocessors is limited memory or interconnection network bandwidth. Even well-designed machines can exhibit bandwidth limitations when a program issues an excessive number of remote memory accesses or when remote accesses are distributed non-uniformly. While techniques for improving locality of reference are often successful at reducing the number of remote references, a non-uniform distribution of references may still result, which can cause contention both in the interconnection network and at remote memories.
Producer/consumer data, where one processor (the producer) writes data that many other processors (the consumers) must read, is a common sharing pattern in parallel programs that generates a non-uniform distribution of references. In this paper we quantify the performance impact of producer/consumer sharing as a function of memory and network bandwidth, and argue that the contention caused by this form of sharing can severely impact performance on large-scale machines. We then propose a new coherency protocol, called eager combining, which is designed to alleviate this contention. The protocol replicates the producer's data among multiple memory modules, thereby effectively increasing both the memory and network bandwidth of the producer, and dramatically decreasing the remote access latency of consumers. We compare eager combining to other techniques for reducing or eliminating contention, and use execution-driven simulation of parallel programs on a large-scale multiprocessor to show that eager combining can improve performance by a factor of 4 or more when used for programs with producer/consumer data on machines with hundreds of processors

U of Toronto, CSRI 244(dir)

*CSRI-244 HETEROGENEOUS DISTRIBUTED SHARED MEMORY S.Zhou*, M.Stumm*, K.Li**, and D.Wortman* *Computer Systems Research Institute, University of Toronto **Department of Computer Science, Princeton University September 1990

International Computer Science Inst (Berkeley) tr-94-004.ps.Z(58K)

Hermann Haertig. ``Near or Far.'' TR-94-004, International Computer Science Institute, Berkeley, CA, January 1994.

Keywords: massively parallel systems, logical shared address space, distributed memory architectures, programming languages

To efficiently program massively parallel computers it is important to be aware of nearness or farness of references. It can be a severe performance bug if a reference that is meant to be near by a programmer turns out to be far. This paper presents a simple way to express nearness and farness in such a way that compile-time detection of such performance bugs becomes possible. It also allows for compile-time determination of nearness for many cases which can be used for compile time optimization techniques to overlap communication with processing. The method relies on the type system of a strongly typed object oriented language whose type rules are extended by three type coercion rules

Inst for CS-FORTH 93.TR94.Locality_Based_Scheduling.ps.Z(124K)

Evangelos P. Markatos, Thomas J. LeBlanc. ``Locality-Based Scheduling in Shared-Memory Multiprocessors.'' Techical Report 94,, ICS-FORTH, Heraklio, Crete, Greece, August 1993.

(to appear in) Current and Future Trends in Parallel and Distributed Computing, Albert Zomaya (Ed.), World Scientific Publishing

The last decade has produced enormous improvements in microprocessor performance without a corresponding improvement in memory or interconnection network performance. As a result, the relative cost of communication in shared-memory multiprocessors has increased dramatically. Although many applications could ignore the cost of communication and still achieve good performance on the previous generations of shared-memory machines, good performance on modern machines requires that communication be reduced or eliminated. One way to reduce the need for communication is to use scheduling polices that exploit knowledge of the location of data when assigning processes to processors, improving locality of reference by co-locating a process with the data it will require. This chapter presents an overview of the tradeoffs to be made in process scheduling, and evaluates locality-based scheduling techniques at the level of the operating system kernel, thread package, and parallelizing compiler

93.TR94.Locality_Based_Scheduling.ps.Z

International Computer Science Inst (Berkeley) tr-93-074.ps.Z(88K)

Shay Kutten, Rafail Ostrovsky, Boaz Patt-Shamir. ``How and When to Be Unique.'' TR-93-074, International Computer Science Institute, Berkeley, CA, November 1993.

One of the fundamental problems in distributed computing is how identical processors with identical local memory can choose unique IDs provided they can flip a coin. The variant considered in this paper is the asynchronous shared memory model (atomic registers), and the basic correctness requirement is that upon termination the processes must always have unique IDs. We study this problem from several viewpoints. On the positive side, we present the first protocol that solves the problem and terminates with probability 1. The protocol terminates in (optimal) $O(\log n)$ expected time, using O(n) shared memory space, where n is the number of participating processes. On the negative side, we show that no protocol can terminate with probability 1 if n is unknown, and that no finite-state protocol can terminate with probability 1 if the schedule is non-oblivious (i.e., may depend on the history of the shared variable). We also discuss the dynamic setting (where processes may join and leave the system dynamically), and give a deterministic protocol for the read-modify-write model that needs only 3 shared bits

U of Manchester CS UMCS-93-7-2.ps.Z(518K)

David F. Snelling. ``The Design and Analysis of a Stateless Data-Flow Architecture.'' PhD thesis, July 1993.

Data-Flow computing is approximately 20 years old, and progress in that time has been steady but not dramatic. As new technologies have emerged they have been incorporated into new Data-Flow systems, however, Data-Flow has held onto many of its original tenets. One of these tenets, contrary to the Data-Flow ideal, is the use of explicit state for storing data structures, arrays in particular. By imposing a shared memory model this explicit notion of state creates a fundamental barrier preventing traditional Data-Flow systems from being scalable. In this thesis, a dynamic, tagged token Data-Flow architecture, SDFA (Stateless Data-Flow ARchitecture), is presented which diverges from other systems in that it does not support an explicit notion of state. In the absence of explicit state, it is possible to exploit the advantages of Data-Flow, such as uniform load balancing, latency hiding, and good performance on small problems, within a distributed memory architecture. With the shared memory barrier removed, the SDFA system is scalable. The thesis sets out the case for a stateless model, develops an architecture based on this approach, presents a detailed, simulation based analysis of the system,and proposes further research into stateless Data-Flow computing

UMCS-93-7-2.ps.Z

International Computer Science Inst (Berkeley) tr-93-063.ps.Z(682K)

Chu-Cheow Lim. ``A Parallel Object-Oriented System for Realizing Reusable and Efficient Data Abstractions.'' TR-93-063, International Computer Science Institute, Berkeley, CA, October 1993.

319 pages

We examine the use of an object-oriented language to make programming multiprocessors easier for the general programmer. We choose an object-oriented paradigm because we believe that its support for encapsulation and software reuse allows users who are writing general application programs to reuse class libraries designed by expert library writers. We describe the design, implementation and use of a parallel object-oriented language: parallel Sather (pSather). PSather has a shared address space independent of the underlying multiprocessor architecture, because we believe that the cooperative nature of parallel programs is most easily captured by a shared-memory-like model. To account for distributed-memory machines, pSather uses an abstract model in which processors are grouped in clusters. Associated with a cluster is a part of the address space with fast access; access to other parts of the address space is $\leq 2$ orders of magnitude slower. PSather integrates both control and data-parallel constructs to support a variety of algorithmic styles. We have an implementation of pSather on the CM-5. The prototype shows that even on distributed-memory machines without hardware/operating system support for a shared address space, it is still practical and reasonably efficient for the shared address abstraction to be implemented in the compiler/runtime. The experience also helps us understand the features of low-level libraries that are necessary for an efficient realization of a high-level language. For example, even though low message latency is crucial, the message-passing paradigm (active vs. passive, polling vs. interrupt-driven) is also important in deciding how easy and efficient the language implementation will be. We also study certain straight-forward compiler optimizations. Several abstractions and applications have been written for the CM-5 using the shared-address cluster model, and we have achieved reasonable speedups. In some cases, we can further demonstrate good absolute performance for pSather programs (by getting their speedups relative to a 1-processor C program). Some of the abstractions are reused in several applications, to show how the object-oriented constructs facilitate code reuse. The work described here supports our optimism that pSather is a practical and efficient parallel object-oriented language. There are, however, still many issues that need to be explored in order to provide parallel programming environments as powerful as the ones we are accustomed to on sequential environments. In the conclusion, we summarize some of the possible future research directions

Boston U CS 93-007(1K) 93-007-mermera-warp.ps.Z(250K)

abstracts/93-007 ::::::::::::::

Title: Using Warp to Control Network Contention in Mermera Author: Abdelsalam Heddaya, Kihong Park, and Himanshu Sinha, Boston University Date: June 1993

Abstract:

Parallel computing on a distributed system, such as a network of workstations, can saturate the communication network, leading to excessive message delays and consequently poor application performance. Current operating systems offer only partial support for flow control protocols that can help insulate application performance from extraneous traffic on the shared network. We examine empirically the consequences of integrating one such protocol, called Warp control~