abstracts
2002 AbstractsPlenary Presentations, Cluster Busters, Programming Models, Applications and Tools Track, Systems Administration Track, and Vendor Presentations Abstracts
Last updated: 11 October 2002
| Plenary
Presentations |
||||||||||||||||||||||||||||||||||||||||||||||
| Plenary Session
I |
||||||||||||||||||||||||||||||||||||||||||||||
| Title | Seismic & Linux: From Curiosity to Commodity | |||||||||||||||||||||||||||||||||||||||||||||
| Author(s) | Dr. Mike Turff | |||||||||||||||||||||||||||||||||||||||||||||
| Author Inst | PGS Data Processing | |||||||||||||||||||||||||||||||||||||||||||||
| Presenter | Dr. Mike Turff | |||||||||||||||||||||||||||||||||||||||||||||
| Abstract | The seismic processing business has always required massive amounts of compute capacity and has historically been an early adopter of anything providing price/performance improvements in this field. This has been exacerbated by the competitive nature of this business, the downward pressure on prices from the oil majors and the tender & bid process that decides the allocation of the available work. In this talk, I review the history of computer systems in PGS, the move to Linux clusters and the new applications that this price/performance step has enabled our industry to bring to the seismic processing market. I will talk from a business prospective about the acceptance of Linux clusters and show that, for the right applications, Linux clusters can be regarded as off-the-shelf commodities. | |||||||||||||||||||||||||||||||||||||||||||||
| |
||||||||||||||||||||||||||||||||||||||||||||||
| Plenary Session
II |
||||||||||||||||||||||||||||||||||||||||||||||
| Title | Scaling Up and Scaling Down | |||||||||||||||||||||||||||||||||||||||||||||
| Author(s) | Dr. Daniel A. Reed | |||||||||||||||||||||||||||||||||||||||||||||
| Author Inst | National Center for Supercomputing Applications (NCSA) | |||||||||||||||||||||||||||||||||||||||||||||
| Presenter | Dr. Daniel A. Reed | |||||||||||||||||||||||||||||||||||||||||||||
| Abstract | The continuum of computing continues to expand, ranging from mobile, embedded (or soon implantable) devices to terascale and petascale clusters. New applications of distributed sensors (e.g., environmental or structural monitoring, health and safety) pose new software challenges. How do we manage millions or even billions of mobile devices, each with power, weight, processing, and memory constraints? At the other end of the continuum, large-scale science and engineering applications demand ever faster and larger clustered systems. How do we design, package, and support systems with hundreds of thousands of processors in a reliable way? This talk sketches some of the technical challenges and opportunities in realizing the ubiquitous infosphere of interconnected systems, highlighted by recent developments at NCSA and the U.S. TeraGrid. | |||||||||||||||||||||||||||||||||||||||||||||
| |
||||||||||||||||||||||||||||||||||||||||||||||
| Plenary Session
III |
||||||||||||||||||||||||||||||||||||||||||||||
| Title | Challenges of Terascale Computing | |||||||||||||||||||||||||||||||||||||||||||||
| Author(s) | Dr. William Pulleyblank | |||||||||||||||||||||||||||||||||||||||||||||
| Author Inst | IBM Research | |||||||||||||||||||||||||||||||||||||||||||||
| Presenter | Dr. William Pulleyblank | |||||||||||||||||||||||||||||||||||||||||||||
| Abstract | We are rapidly moving into the domain of terascale computing, which includes the ASCI program, and the Blue Gene project. In addition to the significant hardware problems to be solved, we will also face software issues that are at least as large. I will discuss these in the context of BlueGene - a 180 teraflop/s super computer being built at IBM Research which will run on Linux as its operating system. Enabling Linux for this platform has necessitated a number of innovations in order to achieve the targeted levels of performance. We also discuss issues of reliability and availability, the so-called autonomic issues. We relate this to current activity on grid computing. | |||||||||||||||||||||||||||||||||||||||||||||
| |
||||||||||||||||||||||||||||||||||||||||||||||
| Cluster
Busters |
||||||||||||||||||||||||||||||||||||||||||||||
| Title |
Scalability and Performance of Salinas on the Computational Plant | |||||||||||||||||||||||||||||||||||||||||||||
| Author(s) | Manoj Bhardwaj, Ron Brightwell, Garth Reese | |||||||||||||||||||||||||||||||||||||||||||||
| Author Inst | University of New Mexico, USA | |||||||||||||||||||||||||||||||||||||||||||||
| Presenter | Ron Brightwell | |||||||||||||||||||||||||||||||||||||||||||||
| Abstract | Parallel computing platforms composed
of commodity personal computers (PCs) interconnected
by gigabit network technology are a viable alternative
to traditional proprietary supercomputing platforms.
Small- and medium-sized clusters are now ubiquitous,
and larger-scale procurements, such as those made
recently by NSF for the Distributed Teraflops Facility
and by Pacific Northwest National Lab, are becoming
more prevalent. The cost effectiveness of these
platforms has allowed for larger numbers of processors
to be purchased. In spite of the continued increase in the number of processors, few real-world application results on large-scale clusters have been published. Traditional large parallel computing platforms have benefited from many years of research, development, and experience dedicated to improving their scalability and performance. PC clusters have only recently started to receive this kind of attention. In order for clusters of PCs to compete with traditional proprietary large-scale platforms, this type of knowledge and experience may be crucial as larger clusters are procured and constructed. The goal of the Computational Plant (Cplant) project at Sandia National Laboratories is to provide a large-scale, massively parallel computing resource composed of commodity-based PC's that not only meets the level of compute performance required by Sandia's key applications, but that also meets the levels of usability and reliability of past traditional large-scale parallel machines. Cplant is a continuation of research into system software for massively parallel computing on distributed-memory message-passing machines. We have transitioned our scalable system software architecture developed for large-scale massively parallel processing machines to commodity clusters of PCs. In this paper, we examine the performance and scalability of one of Sandia's key applications on Cplant. We present performance and scalability results from Salinas, a general-purpose, finite element structural dynamics code designed to be scalable on massively parallel processing machines. Salinas offers static analysis, direct implicit transient analysis, eigenvalue analysis for computing modal response, and modal superposition-based frequency response and transient response. In addition, semi-analytical derivatives of many response quantities with respect to user-selected design parameters are calculated. It also includes an extensive library of standard one-, two-, and three-dimensional elements, nodal and element loading, and multi-point constraints. Salinas solves systems of equations using an iterative, multilevel solver, which is specifically designed to exploit massively parallel machines. Salinas uses a linear solver that was selected based on the criteria of robustness, accuracy, scalability and efficiency. Salinas employs a multilevel domain decomposition method, Finite Element Tearing and Interconnect (FETI), which has been characterized as the most successful parallel solver for the linear systems applicable to structural mechanics. FETI is a mature solver, with some versions used in commercial finite element packages. For plates and shells, the singularity in the linear systems has been traced to the subdomain corners. To solve such linear systems, an additional coarse problem is automatically introduced that removes the corner point singularity. FETI is scalable in the sense that as the number of unknowns increases and the number of unknowns per processor remains constant, the time to solution does not increase. Further, FETI is accurate in the sense that the convergence rate does not deteriorate as iterates converge. Finally the computational bottleneck in FETI, a sparse direct subdomain solve, is amenable to high performance solution methods. An eigensolver was selected for Salinas based on robustness, accuracy, scalability and efficiency. In this paper, we will describe in detail the hardware architecture of our 1792-node Cplant cluster. This machine is composed of single-processor Compaq Alpha workstations connected by a Myrinet gigabit network in a three-dimensional mesh topology. This machine has many unique characteristics, such as the ability to switch a large portion of the machine from any one of four network "heads" to another, allowing for greater flexibility in responding to user demand for classified versus unclassified computing. In addition to hardware, we describe the system software environment of our cluster. Linux is used as the base operating system for all nodes in the cluster, and we have designed and developed software for the parallel runtime system, high-performance message passing, and parallel I/O. We will discuss the details of this software environment that influence the scalability and performance of applications. Finally, we describe the performance and scalability of Salinas on up to 1000 nodes of Cplant. We will also discuss some of the challenges related to scaling Salinas beyond several hundred processors and how these challenges were addressed and overcome. We will show that the performance and scalability of Salinas is comparable to proprietary large-scale parallel computing platforms. |
|||||||||||||||||||||||||||||||||||||||||||||
| Title | Real Time Response to Streaming Data on Linux Clusters | |||||||||||||||||||||||||||||||||||||||||||||
| Author(s) | Beth Plale, George Turner, and Akshay Sharma | |||||||||||||||||||||||||||||||||||||||||||||
| Author Inst. | Indiana University | |||||||||||||||||||||||||||||||||||||||||||||
| Presenter | George Turner | |||||||||||||||||||||||||||||||||||||||||||||
| Abstract | Beowulf-style cluster computers
which have traditionally served batch jobs via
resource man-agers such as Portable Batch System
(PBS) are now under pressure from the scientific
community to support real time response to data
stream applications. That is, timely response to
events streamed from remote sensors or humans interactively
exploring the parameter space. Cluster management
must be agile and timely in its response to new
resource demands imposed by the occurrence of real-world
events. The work described in this paper
quantitatively evaluates two approaches to integrating
support for streaming applications into the framework
of a PBS managed cluster. One approach works cooperatively
with PBS to reallocate computational resources
to meet the needs of real time data analysis. The
second approach works outside of the PBS context
utilizing the real time scheduling capabilities
of the Linux kernel to acquire the resources necessary for its real
time analysis. Introduction Beowulf-style clusters have traditionally used resource managers such as PBS Pro [6] to opti-mize the utilization of a cluster’s resources. Today there is a growing need for these clusters to be responsive to non-deterministic event driven data streams. Upon recognition of an application specific event occurring outside the realm of PBS and the cluster, the necessary CPU, memory, network, and file system resources must be directed toward a timely response. Most production Beowulf clusters use batch scheduling to ensure fairness among users and to maximize a cluster’s throughput. But these objectives are at odds with the dynamic nature of data stream processing [5, 4, 1, 2], where resource needs cannot be predicted in advance, and any projection of duration of usage can only be estimated because of the variability of real-world events such as storms, earthquakes, etc. This paper addresses the need for integrated support for streaming applications in a batch job scheduled system. Through a quantitative evaluation of startup time in response to an environmental event, we are exploring the solution space for real-time responsiveness to data streaming. Motivation An atmospheric scientist at IU studies atmosphere-surface exchange of trace chemicals such as CO2, water, and energy from a variety of ecosystems. These atmospheric flux measurements generate large streams of data; sensors operating at rates up to 10 Hz can generate data at rates of hundreds of MB/day/sensor. The remote sensors use wireless 802.11b to communicate to a central computer remotely located within the forest. The remote computer is connected via a T1 line to the IUB campus network. Under normal operation the data from the remote sensors are collected at the remote site and streamed to IU at a nominal rate of 1-3 Hz where the event data are analyzed for particular meteorological conditions. The response to the detection of a meteorological condition is an increasing of the event rate, and the attendant enlistment of additional CPU cycles to handle the high-fidelity analysis. This work is being carried out on the proto-AVIDD facility on the Bloomington campus of Indiana University. The cluster consists of five (5) dual Pentium-IV Xeon nodes, two (2) Itaniuim dual processor IA64 nodes, and a RAID-5 filesystem NFS served internally to the cluster. Proto-AVIDD runs RedHat 7.2 with cluster management tools provided by the Oscar 1.2 project. Proto-AVIDD is a testbed cluster used to study the issues of interactive and stream data analysis in a production-level Linux cluster. AVIDD (Analyzing and Visualizing Instrument Driven Data) will be a geographically distributed data analysis cluster sited at three of the IU campuses: Bloomington (IUB), Indianapolis (IUPUI), and Gary (IUN). Due to ongoing vendor negotiations, an explicit description of the AVIDD cluster is not possible at this time; however, it is expect to be in the range of 0.5 –1 TFOPSaggregate, 6 TBytes of distributed cluster storage and will leverage IU’s existing HPSS based massive data storage system. Local intra-cluster networking will be provided by a Myrinet fabric with the the inter-cluster networking utilizing Indiana’s 1 Tb/s I-Light infrastructure. By the time of the conference, it is anticipated that initial experiences and early results will be available from the production AVIDD cluster. Approach Application events streaming from remote sources must be analyzed in real time in order to achieve timely recognition of scientifically interesting meteorological conditions. Our approach is to employ a user space daemon resident on a dedicated communications node within the AVIDD cluster to receive the data stream and monitor it for interesting events. Under normal conditions, the daemon simply drops the events into an event cache implemented as a circular buffer. When a meteorological event occurs, the daemon responds first by sending a message to the remote computer commanding it to reconfigure the sensor system for higher fidelity data acquisition, and second by enlisting additional resources within the cluster to process the higher fidelity data streams. This startup scenario is depicted in Figure 1. Figure 1. Data stream
processing response to application event of interest.
We are exploring the solution space for real-time responsiveness to data streaming through a quantitative evaluation of startup time in response to an environmental event. Our first approach is to work within the PBS domain to allocate resources and provide job startup services in response to the detection of developing real-world phenomenon. The second approach explores a ’bullying’ relationship with PBS in which the real time scheduling features of the Linux kernel are utilized to preempt resources already allocated by PBS. The full paper will quantify response times under both approaches and explore interesting issues involved in each approach. Startup Within Context of PBS Resource managers such as PBS have a coherent view of a cluster’s state. They also have the tools necessary to stop, start, and signal jobs running under their control. It would appear to be natural choice to use a resource manager’s preexisting infrastructure to implement cluster resource reallocation. In the final paper we will evaluate the steps involved in the startup of the high fidelity processes as well as the benefits and shortcomings of this approach. Linux Real-time Scheduling Startup High fidelity processes are scheduled and started up out-side the PBS domain using normal Linux process control mechanisms. If these data analysis tasks are scheduled by the kernel scheduler as real time processes they will be assured of being scheduled on the CPUs any time they are computable. Resources already used by a previous PBS job would only be consumed as needed. For example, memory previously used by the PBS job would be swapped out only as needed. In the final paper we will evaluate the steps involved in the startup of the high fidelity processes as well as the benefits and shortcomings of this approach. Other Issues Issues which will be touched upon but not discussed in detail include Ssh, Globus Toolkit [3], MPI, process preemption, process checkpointing. A general discussion of the various steps involved in the data acquisition, data transmission, and data disposition will be provided as a introductory foundation. [1] Shivnath Babu and Jennifer Widom. Continuous queries over data streams. In International Conference on Management of Data (SIGMOD), 2001. [2] Fabi´ an E. Bustamante and Karsten Schwan. Active I/O streams for heterogeneous high performance computing. In Parallel Computing (ParCo) 99, Delft, The Netherlands, August 1999. [3] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. International Journal of Supercomputer Applications, 11(2):115–128, 1997. [4] Sam Madden and Michael J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In International Conference on Data Engineering ICDE, 2002. [5] Beth Plale and Karsten Schwan. dQUOB: Managing large data flows using dynamic embedded queries. In Proc. 9th IEEE Intl. High Performance Distributed Computing (HPDC), Los Alamitos, CA, August 2000. IEEE Computer Society. [6] PBS Pro. Portable Batch System. http://www.pbspro.com/, 2002. |
|||||||||||||||||||||||||||||||||||||||||||||
| Programming
Models |
||||||||||||||||||||||||||||||||||||||||||||||
| Title | Remote Memory Operations on Linux Clusters Using the Global Arrays Toolkit, GPSHMEM and ARMCI | |||||||||||||||||||||||||||||||||||||||||||||
| Author(s) | Chona S. Guiang, A. Purkayastha and K. F. Milfeld | |||||||||||||||||||||||||||||||||||||||||||||
| Author Inst | Texas Advanced Computing Center,University of Texas - Austin, USA | |||||||||||||||||||||||||||||||||||||||||||||
| Presenter | Chona S. Guiang | |||||||||||||||||||||||||||||||||||||||||||||
| Abstract | Remote memory access (RMA) is a
communication model that offers different functionalities
from those provided in shared memory programming
and message passing. The use of one-sided communication
exhibits some of the advantages of shared memory
programming such as direct access to global data,
without suffering from the limited scalability
that characterizes SMP applications. Moreover,
RMA has advantages over MPI on two important respects: 1. Communications overhead: MPI send/receive routines generate two-way communications traffic (“handshaking”) to ensure accuracy and coherency in data transfers that is often unnecessary. In some scientific applications (e.g., some finite difference and molecular dynamics codes), the communication workload involves accessing data that does not require explicit cooperation between the communicating processes (e.g., writes to non-overlapping locations in memory). Remote memory operations in ARMCI [1], Global Arrays (GA) Toolkit [2], and GPSHMEM [3] decouple the data transfer operation from handshaking and synchronization. There is no extra coordination required between processes in a one-sided put/get operation, so there is less of the enforced synchronization that is sometimes unnecessary. Additional calls to synchronization constructs may be included in the application as needed. 2. API: The distributed data view in MPI is less intuitive to use in scientific applications that require access to a global array. In contrast, the GA API provides a global data view of the underlying distributed data that does not distinguish between accessing local versus remote array data. The Global Arrays (GA) Toolkit provides an SMP-like development environment for applications on distributed computing systems. GA implements a global address space, which makes it possible for elements in distributed arrays to be accessed as if they were in shared memory. The details of actual data distribution are hidden, but are still within user control through utility functions included in the library. MPI and GA library calls can be combined in a single program, as GA is designed to complement rather than replace existing implementations of message passing libraries. The GA library consists of routines that implement one-sided operations, interprocess, collective array operations, and utility and diagnostic operations (e.g., for querying available memory, data locality). As the name implies, GA implements a global address space for arrays of data, not scalars. Therefore, GA one-sided communications are designed to access arrays of data (double, integer, and double complex), not arbitrary memory addresses. GPSHMEM is a portable implementation of the Cray Research Inc. SHMEM [4], which is a one-sided communication library for Cray and SGI systems. Like GA, GPSHMEM implements a global address space for accessing distributed data, but GPSHMEM contains support for more data types. GPSHMEM operations are limited to symmetric data objects, objects for which local and remote addresses have a known relationship (e.g., arrays in common blocks). The library routines provide the following functionalities: 1) noncollective RMA operations including those for strided access , and atomic operations; 2) collective operations; 3) parallel environment query functions. In addition to the supported Cray SHMEM routines, GPSHMEM has added block strided get/put operations for 2D arrays. All RMA operations in GA and GPSHMEM use the Aggregate Remote Memory Copy Interface (ARMCI) library. ARMCI is a portable, one-sided communication library. On Linux clusters, ARMCI relies on low-level network communication layers (e.g., Elan for Quadrics, GM for Myrinet) to implement remote memory copies, while utilizing the native MPI implementation for collective communication operations. Noncontiguous data transfers are supported in ARMCI, which facilitates access of sections of multidimensional arrays in GA and 2D arrays in GPSHMEM. In general, MPI-2 RMA is more difficult to use than GA, and GPSHMEM because of the greater number of steps involved. A single one-sided operation includes setting up a ‘window’ for the group of communicating processes, moving the data, and ensuring completion of data transfer (among processes in the group) using different synchronization constructs for “active” or “passive” targets. RMA operations with passive targets are serialized, and hence degrades performance. Notwithstanding these caveats, MPI one-sided communication is part of the MPI-2 standard and consequently is more likely to gain support on a greater number of platforms than GA, GPSHMEM, and ARMCI. Previous work by Nieplocha et. al. [5, 6] on implementing ARMCI on Linux clusters has demonstrated that the ARMCI put/get operations outperform the MPICH-GM implementation of MPI 1.1 send and receive for both strided and contiguous data access. We have conducted additional performance measurements of ARMCI put/get versus MPI 1.1 send/receive on the TACC IA32 and IA64 clusters, using a simple “halo exchange” prototype where each process exchanges contiguous and strided data with its nearest neighbors on a 1-D process grid. Results, as well as further description of the experiments are shown in Figures 1 and 2. For contiguous data transfers, the ARMCI put/get operations exhibit better performance than the equivalent MPI 1.1 send/receive over the entire range of message sizes considered on the IA64 system. On the IA32 cluster, the ARMCI put and get operations outperform MPI send and receive for message lengths exceeding 8KB. On both systems, ARMCI put/get perform better than MPI send/receive for transfer of strided data over the entire range of message sizes. Future work will perform the following experiments to address important issues of RMA operations: 1. Latency and bandwidth measurements over different message sizes will be conducted for put/get operations as implemented in LAM-MPI versus ARMCI. The poor bandwidth obtained with the LAM-MPI implementation of the MPI-2 put/get operations, as shown in Figure 1, could be attributed to the fact that the most recent stable version does not take full advantage of the underlying GM layer. Our future study will use the most recent version of LAM-MPI, which has a completely rewritten version of its GM RPI (request progression interface) [7]. 2. The transfer of contiguous array data in GPSHMEM and GA will be compared against the corresponding point-to-point and/or collective communication calls in MPI. These timing results will be compared with ARMCI measurements to obtain the overhead involved in the GA and GPSHMEM routines. 3. Applications characterized by dynamic communication patterns, irregular data structures, and noncontiguous data transfers are likely to benefit from the remote memory copy functionality that GA and GPSHMEM provide. Molecular dynamics and finite difference applications fit this description well, since both involve the exchange of sections of multidimensional arrays at every time step of the simulation. To test the efficiency and applicability of GA and GPSHMEM for these types of applications, we will evaluate the performance of a simple MD and finite difference code using three different implementations for the transfer of array sections --- corresponding to the use of GA, GPSHMEM, and MPI point-to-point as well as collective communication functions. The experience on these applications will also provide insight on the general feasibility of using SMP-like programming interfaces on distributed-memory systems. All computational work will be performed on TACC’s Linux clusters: a 32-node cluster of Pentium III-based dual-processor SMPs and an Itanium-based cluster of 20 dual-processor SMP nodes. Each cluster uses a Myrinet 2000 switch fabric. Figure 1 Figure 2 References: [1] http://www.emsl.pnl.gov:2080/docs/parsoft/armci/ [2] http://www.emsl.pnl.gov:2080/docs/global/ [3] K. Parzyszek, J. Nieplocha, and R. Kendall, “A Generalized Portable SHMEM Library for High Performance Computing in Proceedings of the Twelfth IASTED International Conference Parallel and Distributed Computing and Systems [4] http://dynaweb.tacc.utexas.edu:8080/dynaweb/app_prog /004-2178-002/@Generic__BookTextView/2819;hf=0?DwebQuery=SHMEM#1 [5] J. Nieplocha, V. Tipparaju, A. Saify, D.K. Panda, "Protocols and Strategies for Optimizing Performance of Remote Memory Operations on Clusters," in Proc. Workshop Communication Architecture for Clusters (CAC02) of IPDPS'02 [6] J. Nieplocha, J. Ju, E. Apra, “One-sided Communication on Myrinet-based SMP Clusters using the GM Message-Passing Library,” in Proc. CAC01 Workshop of IPDPS'01, San Francisco, 2001 [7] http://www.lam-mpi.org/MailArchives/lam/msg04345.php |
|||||||||||||||||||||||||||||||||||||||||||||
| Title |
Supercomputing in Plain English: Teaching High Performance Computing to Inexperienced Programmers | |||||||||||||||||||||||||||||||||||||||||||||
| Author | H. Neeman, J. Mullen, L. Lee & G.K. Newman | |||||||||||||||||||||||||||||||||||||||||||||
| Author Inst | Oklahoma University, USA and Worcester Polytechnic Institute | |||||||||||||||||||||||||||||||||||||||||||||
| Presenter | Henry Neeman | |||||||||||||||||||||||||||||||||||||||||||||
| Abstract | Although the field of High Performance
Computing (HPC) has been evolving rapidly, the
development of standardized software systems --
MPI, OpenMP, LAPACK, PETSc and others -- has in
principle made HPC accessible to a wider community.
However, relatively few physical scientists and
engineers are taking advantage of computational
science in their research, in part because of the
considerable degree of sophistication about computing
that HPC appears to require. Yet the fundamental
concepts of HPC are fairly straightforward and
have remained relatively stable over time. These
facts raise an important question: can scientists
and engineers with relatively modest computing
experience learn HPC concepts well enough to take
advantage of them in their research? To answer this question, HPC educators must address several issues. First, what are the fundamental issues and concepts in computational science? Second, what are the fundamental issues and concepts in HPC? Third, how can these ideas be expressed in a manner that is clear to a person with relatively modest computing experience? Finally, is classroom exposure sufficient, or is guidance required to assist investigators in incorporating HPC in their research codes? A helpful way to describe the fundamental issues of computational science is as a chain of abstractions: phenomenon, physics, mathematics (continuous), numerics (discrete), algorithm, implementation, port, solution, analysis and verification. In general, physics, mathematics and numerics are addressed well by existing science and engineering curricula -- though often in isolation from one another -- and therefore instruction should be provided on issues relating primarily to the later items, and on the interrelationships between all of them. For example, algorithm choice is a fundamental issue whose gravity many researchers with less computing experience do not appreciate; a common mistake is to solve a linear system by inverting the matrix of coefficients, without regard for performance, conditioning, or exploitation of the properties of the matrix. As for HPC, the literature tends to agree on the following fundamental issues: the storage hierarchy; instruction-level parallelism; high performance compilers; shared memory parallelism (e.g., OpenMP); distributed parallelism (e.g., MPI); hybrid parallelism. The pedagogical challenge is to find ways to express the basic concepts with minimal jargon and maximal intuitiveness. Although expressing these concepts in a manner appropriate for the target audience can be challenging, it does not need to be daunting. For example, the use of analogies and narratives to explain these concepts can capture the fundamental underlying principles without distracting the students with technical details. Once the students understand the basic principles, the details of how to implement HPC solutions are much easier to digest. However, it is probably not reasonable, in many cases, to expect novice programmers to immediately understand how to apply HPC concepts to their science. As a follow on, regular interactions with experienced HPC users can provide the needed insight and practical advice to move research forward. We will discuss an ongoing effort to develop materials that express sophisticated scientific computing concepts in a manner accessible to a broad audience. In addition, we will examine a programmatic approach that incorporates both the use of these materials and the crucial contribution of followup. |
|||||||||||||||||||||||||||||||||||||||||||||
| Applications
and Tools Track Abstracts: |
||||||||||||||||||||||||||||||||||||||||||||||
| Applications
Session I: Applications Performance Analysis |
||||||||||||||||||||||||||||||||||||||||||||||
| Title | Scaling Behavior of Linear Solvers on Large Linux Cluster | |||||||||||||||||||||||||||||||||||||||||||||
| Author(s) | John Fettig, Wai-Yip Kwok, Faisal Saied | |||||||||||||||||||||||||||||||||||||||||||||
| Author Inst | NCSA/University of Illinois, USA | |||||||||||||||||||||||||||||||||||||||||||||
| Presenter | John Fettig | |||||||||||||||||||||||||||||||||||||||||||||
| Abstract | Large sparse linear systems of
equations arise in many computational science and
engineering disciplines. One common scenario occurs
when implicit methods are used for time-dependent
partial differential equations, and a nonlinear
system of equations has to be solved at each time
step. With a Newton-type approach, this in turn
leads to a linear system at each Newton iteration.
In such situations, the linear solver can account
for a large part of the overall simulation time.
Demonstrating the effectiveness of linear solvers
on large Linux clusters will help establish the
usefulness of these platforms for applications
in such diverse areas as CFD, Astrophysics, Bio-physics,
earthquake engineering, and structural mechanics. In this paper, we investigate the effectiveness of Linux clusters for state of the art software for parallel linear solvers such as PETSc, Hypre, and other packages drawn from the research community. In particular, we study scaling to large processor counts, and large system sizes (tens of millions of unknowns). The scalability of the algorithms, of the implementation, and of the cluster architecture will be investigated. We will focus on iterative Krylov subspace solvers like Conjugate Gradient, GMRES(k) and BiCGSTAB, coupled with parallel preconditioners, such as block Jacobi, domain decomposition, and multigrid methods. In our analysis of the parallel performance of these solvers on Linux clusters, we will identify which components of clusters (CPU, memory bandwidth, network latency/bandwidth) are the critical factors in the performance of the different software components of the parallel solvers, such as matrix vector products, inner products, preconditioners, grid transfer operators, coarse grid solvers. We will present results for Linux clusters based on first and second-generation 64-bit Itanium processors from Intel, connected by a Myrinet switch, used primarily in message-passing mode. The number of performance tools available on Linux clusters is increasing. In our performance analysis of parallel linear solvers, we will use these tools to give a more detailed performance profile of the solvers. In particular, tools that access hardware performance counters will be used to study the impact of the memory hierarchy on performance, and MPI analysis tools will be used to quantify the message-passing performance of the network for sparse linear solvers. |
|||||||||||||||||||||||||||||||||||||||||||||
| Title | Benchmarking of a Commercial Linux Cluster | |||||||||||||||||||||||||||||||||||||||||||||
| Author(s) | Daniele Tessera | |||||||||||||||||||||||||||||||||||||||||||||
| Author Inst. | Università di Pavia, Italy | |||||||||||||||||||||||||||||||||||||||||||||
| Presenter | Daniele Tessera | |||||||||||||||||||||||||||||||||||||||||||||
| Abstract | Exploiting the cumulative computing
capabilities of workstations and PCs has been pursued,
for a long time, as an inexpensive solution for
delivering high performance to many scientific
and industrial applications. Various solutions
have been proposed to allow parallel applications
to be executed over a set of independent systems.
Cluster machines, built on top of commodity off
the shelf components, initially developed as prototype
machines by research centers, are becoming very
popular. The increasing performance of commodity
components and the availability of interconnecting
networks able to link large number of such components
have fueled the development of these cluster machines.
Moreover, the development of specialized hardware
and software components tuned for matching the
demands of HPC applications have resulted in cluster
machines approaching the supercomputers' performance
and ranked among the top 500 most powerful supercomputers.
Hardware vendors, offering scalable cluster machines,
also testify the maturity of cluster computing
as a cost effective solution to cope with the high
performance needs of scientists and numerical engineers. This paper is focused on a detailed study of the performance achieved by an IBM NetFinity supercluster when performing various scientific benchmarks. Note that, although we have analyzed the performance of a specific machine, our results have a broader applicability. Indeed, the IBM NetFinity adopts a typical cluster architecture based on Intel Pentium processors interconnected via both Myrinet and Ethernet networks, and the Linux operating system. The cluster is composed of 260 nodes each of them housing two Intel Pentium processors. As a preliminary overview of the cluster performance, we have investigated the run time behavior of a few kernels performing widely used numerical algorithms, such as, matrix LU factorization, multigrid solvers, and distributed rankings. The aim of this study is to evaluate the performance benefits experienced by numerical applications due to specialized high bandwidth communication components, i.e., Myrinet switches, over a typical Beowulf cluster based on Ethernet networking. For such a purpose, we have analyzed the speedup achieved by the various kernels executed over either Myrinet or Fast Ethernet network. Our results are based on the post mortem analysis of the low level timing measurements collected by monitoring the executions of the various kernels. Performance models of simple distributed computations, aimed at assessing the severity of the communication versus computation performance, have been derived. The analysis of communication times has then focused on the wall clock times accounted by processors in data exchanging. Note that, these times take into account both the wall clock times required to deliver the data over the network links and the times spent to perform the adopted communication protocols. As a rule of thumb, Myrinet allows a greater scalability with respect to Ethernet, even when the amount of workload to be processed by each allocated processor is quite small. The speedup of a few kernels executed over Ethernet saturates when more that 32 processors are allocated. On the other hand, a surprising result is that, Ethernet performance is superior when parallel kernels issue a large number of blocking send/receive. For each communication protocol, we have also analyzed the computation overheads, that is, the increases of computation wall clock times due to the concurrent execution of system processes for managing the communication activities. Our performance characterization study has then focused on the behavior of a complex climate model benchmark, that is, the PSTSWM kernel from the Oak Ridge National Laboratory. Performance figures related to various testbed simulations, varying the size of the physical grid, the number of simulated steps, as well as the number of allocated processors, have been analyzed. Statistical clustering techniques have been then applied to these performance figures to derive synthetic characterizations of the behaviors of the allocated processors. The identified groups of homogeneous behaviors have been then related to the characteristics of the corresponding simulations, that is, with the processed workload. As a further step towards the characterization of the performance achieved by the IBM NetFinity cluster, we have studied the impact of the processor allocation policy on the various kernel execution times. Indeed parallel applications might be though as composed of tasks to be allocated to the various processors. Since each IBM NetFinity node houses two processors, two allocation policies are then available: we can allocate either one or two application tasks on each node. Note that, when only one task is allocated on each node both the processors might concur to the execution of such an application task. Numerical kernels experience performance degradations when two tasks per node are allocated on a large number of nodes (i.e., more that 32 nodes). These performance degradations are due to increases in both computation and communication wall clock time. Indeed, an in depth analysis of timing measurements results in outlining that communications between tasks allocated on the same node require longer times with respect to communications between tasks allocated on distinct nodes. Moreover, since the computation phases are somewhat synchronized across all application tasks, contentions due to synchronous memory accesses and concurrent execution of eventually spawned communication processes also result in increases in the computation wall clock times. Experimental evidences, presented in the paper, will substantiate that, for some type of applications, being fixed the number of nodes, allocating both their processors results in longer execution times with respect to allocating only one processor per node. |
|||||||||||||||||||||||||||||||||||||||||||||
| Applications
Session II: Applications
Performance on Clusters |
||||||||||||||||||||||||||||||||||||||||||||||
| Title | A Comparative Study of the Performance of a CFD program across different Linux Cluster Architectures | |||||||||||||||||||||||||||||||||||||||||||||
| Author(s) | Th. Hauser, R.P. LeBeau, T. Mattox, G.P. Huang, H. Dietz | |||||||||||||||||||||||||||||||||||||||||||||
| Author Inst | Utah State University | |||||||||||||||||||||||||||||||||||||||||||||
| Presenter | Thomas Hauser | |||||||||||||||||||||||||||||||||||||||||||||
| Abstract | [Paper #26] | |||||||||||||||||||||||||||||||||||||||||||||
| Title | Scalability of a Tera-scale Linux-based Clusters from Parallel Ab initio Molecular Dynamics | |||||||||||||||||||||||||||||||||||||||||||||
| Author(s) | Hongsuk Yi, Jungwoo Hong, Hyoungwoo Park, and Sangsan Lee | |||||||||||||||||||||||||||||||||||||||||||||
| Author Inst | KISTI, Korea | |||||||||||||||||||||||||||||||||||||||||||||
| Presenter | Jungwoo Hong | |||||||||||||||||||||||||||||||||||||||||||||
| Abstract | In the beginning of 2000, KISTI
has initiated a TeraCluster project to design a
linux-based cluster with tera-scale performance.
The main goal of this project is to provide resources
composed of PC clusters that meets the level of
compute performance required by the grand challenge
applications in Korea. The UP2000 with Myrinet
and DS10 with Fast Ethernet systems as a test platforms
of the TeraCluster project have benefited from
many years of research dedicated to improving their
scalability and performance. For actual experiments,
we have investigated the performance of scalable
ab initio molecular dynamics and quantum chromodynamics
(QCD) applications on the test platforms and compare
with the Cray T3E system. The parallelization is
based on the Massage Passing Interface standard.
Results show that the QCD application which requires
only nearest neighbor communications between computing
nodes reveals an excellent parallelization. While
for the ab initio molecular dynamics the operations
responsible for the use of three-dimensional fast
Fourier transformation dominate, representing the
most time-consuming part in the calculation of
C36 fullerence. Combining these results
with the performance evaluations of the additional
grand challenge applications involving the computational
fluid dynamics, structural analysis, and computational
chemistry, we have designed a prototype of the
TeraCluster with 516 computing nodes and build
the first phase of TeraCluster in 2002. TABLE I: The hardware and software components of the three different platforms and the basic results of NPB and PMB benchmark suite
We have built two different linux-base PC clusters of 64 computing nodes with one server node as a pilot systems [1]. The DS10 system with 466 MHz Ev6 and UP2000 with 667 MHz Ev6 Alpha processors are connected with Fast Ethernet and Myrinet, respectively. Each com-puting nodes have 250 MB memory and 20 GB local IDE disk, while the server node has 128 MB memory and 30 GB disk. The systems operate on Linux Redhat 6:0 with a kernel level of 2.2.0. The MPICH is used as a MPI communication library, and portable batch system (PBS) is selected as a queuing system. The pilot project has been a success story with the col-laboration of a few industrial partners in Korea. From the performance evaluations of the major KISTI [2] applications, we build the first phase of the TeraCluster denoted as PhaseI. The details is summarized in Table I. FIG. 1: (a) The sustained performance and (b) the corresponding speed-up of the full QCD with the L = 12 x 123 lattice size for three different machines. FIG. 2: Speedup of the ab initio molecular dynamics for the UP2000, DS10, and Cray T3E. The dashed line indicates the ideal speed-up. Speedup of the ab initio application using VASP on the PhaseI. We also calculate a new solid phase of carbon made from C 36 fullerence. The super- cell approximation is used to simulate the periodic boundary condition and the cell size is 11 x 11 x 11 °A3 for C 36 . Kohn-Sham single electron wavefunction is expanded by plane waves up to a cut-off energy 36 Ry. Since C36 is molecule we use a single k-point in the calculation. In Fig. 2(a), we illustrate the speedup of the C36 fullerence on the pilot system and compare it with the Cray T3E. In contrast to the above results, each shows an incomplete behavior in scalability. Especially, the Cray T3E system retains the linear scalability up to 16 nodes, but a rather monotonous increase of S reaches to nearly 16 with using the 64 computing nodes, while the DS10 exhibits a more deficient scalability. These results account for the importance of the capacity of the network interface through the collective communication between the executing nodes. Indeed, the overhead of data communication increases rapidly as the working nodes increases, reflecting the significant data exchange during the 3D FFT. In Fig. 2(b), we also present the performance evaluations of ab initio molecular dynamics using VASP package on the PhaseI. However, the scalability has not found at all even for 32 working nodes. In the remainder of this paper, we will discuss on the scalability of the ab initio grand challenging problems in more details. [1] TeraCluster Project (http://cluster.or.kr). [2] KISTI (http://www.hpcnet.ne.kr). |
|||||||||||||||||||||||||||||||||||||||||||||
| Title | openMosix vs Beowulf: a case study | |||||||||||||||||||||||||||||||||||||||||||||
| Author(s) | Moshe Bar, Stefano Cozzini, Maurizio Davini and Alberto Marmodoro | |||||||||||||||||||||||||||||||||||||||||||||
| Author Inst. | INFN and the University of Pisa, Italy | |||||||||||||||||||||||||||||||||||||||||||||
| Presenter | Stefano Cozzini | |||||||||||||||||||||||||||||||||||||||||||||
| Abstract | Today two main clustering paradigms
exist for the Linux environment: openMosix and
Beowulf (MPI-based). Whereas in MPI-based solutions,
the parallelization needs to be coded explicitly
through special library directives, in openMosix
the usual Unix style programming concepts apply.
Also, for clustering applications it becomes increasingly evident that the latency of the network interconnect component quickly gains a very significant weight on the overall system performance and throughput. In this paper we will describe our evaluation work of these two different clustering paradigms applied to a suite of production codes currently used in a specific scientific environment recently installed at Trieste (Italy): the INFM center of excellence DEMOCRITOS. The codes represent the standard computational tools used in computational research in atomistic simulations. The suite can be roughly divided in three main categories of codes:
We therefore decided to provide Democritos with two small clusters implementing the first the Beowulf paradigm, the second the openMosix approach. Both the cluster are the IBM 1300 cluster solution equipped by PIII 1.4.Ghz processor with the same number of CPUs (16 bi-processor nodes). One cluster is equipped with the Myrinet high speed network in order to allow us to test the role of an high speed network on both the paradigms. Our benchmarking work want to see which solution fits some precise requirements about scalability, performance and throughput aspects. Scalability and performance values are measured in a standard way and therefore compared against other (proprietary) parallel platforms. We define them satisfactory just computing a " performance/price'" ratio. These requirements are easily met by both the paradigm for all the class of codes. The aspect that makes the difference is throughput one: this is fundamental in order to guarantee efficient resource sharing among users and high percentage of utilization of the computational resources. Finally a key role is played by the "time to production" variable. It is in fact fundamental for our research to reduce at minimum the time needed to implement a scalable and efficient parallel application starting from a serial ones. Our preliminary results indicate that for highly parallel codes like ab initio ones the standard Beowulf paradigm should be the preferred solution in term of efficiency, scalability and throughput. We note however that the superiority of this solution is mainly due to the fact that highly parallel and scalable codes are ready to be implemented and therefore the time to production for this class of code is practically zero. On the other hand for QMC codes the Openmosix approach is certainly to be preferred. Efficiency and scalability are, as already said, the same as in the Beowulf case; the two main advantages are related to the other two aspects. Throughput is easily obtained through the load balancing technology within openMosix, and the "time to production" for serial programs is practically negligible: just run many instances of the program and all it is done. It is interesting to note that second class of codes (classical MD) despite the fact they are already parallel, generally run better on the openMosix cluster. These codes have in fact a low level of scalability (1-8 CPU) making the sharing of resources among this class of codes and the ab-initio ones (which requires lot of processors) difficult to be efficiently scheduled with standard queue systems on small cluster like ours.On the contrary, the openMosix solution deals at best with the requirements of QMC and moderate scalable codes like these ones. In the final paper we will report the results obtained on the computational suite under various lab environments considering the following aspect: topology of nodes, typology of node anf Network interconnects On hand of special hardware for the various interconnects (Myrinet, Dolphin, Fast Ethernet, Gigabit) we will show the resulting (reproducible) scalability, performance and throughput aspects regarding our computational problem. |
|||||||||||||||||||||||||||||||||||||||||||||
| Applications
Session III: Experiences
with Portability and I/O |
||||||||||||||||||||||||||||||||||||||||||||||
| Title | Adventures with Portability | |||||||||||||||||||||||||||||||||||||||||||||
| Author | Elizabeth Post | |||||||||||||||||||||||||||||||||||||||||||||
| Author Inst | Lincoln University, New Zealand | |||||||||||||||||||||||||||||||||||||||||||||
| Presenter | Elizabeth Post | |||||||||||||||||||||||||||||||||||||||||||||
| Abstract | As computers become more powerful
it becomes possible to develop computer models
that more accurately approximate the real-world
systems they are simulating. However these
models are increasingly complex, often taking years
to develop, so it is important to re-use existing
work as far as possible rather than continually “re-inventing
the wheel” and duplicating effort which could
be better spent elsewhere. With this aim of re-use scientists at Dexcel (a New Zealand dairy research institute) developed a dairy farm simulation framework with some capability to incorporate existing computer models of distinct entities in a pastoral dairy farm, e.g. pasture and crop growth models, animal metabolism models. This framework also allows run-time selection from several alternative models for each component (where they exist). With some being simpler and others more complex, this permits the same basic farm configuration to be modeled at different levels of detail, as appropriate for different research exercises. A major practical difficulty in implementing a system with this flexibility is that the various component sub-models have typically been written in a range of different computer languages. The dairy farm simulation framework itself is written in Smalltalk, and so far makes use of sub-models for the climate, pasture and animal components written in various languages such as Smalltalk, C, Fortran and Pascal. Initially the framework was developed as a research tool for agricultural scientists, and run on a Windows operating system using the Microsoft COM/DCOM protocol for interprocess communication. However there is now a requirement to investigate automated optimization of farm specification and operating parameters, and also visualisation of the associated high-dimensional objective functions. This work requires very large numbers of simulations to be performed so we are attempting to port the framework to run in parallel on a Linux cluster. While porting the existing framework and component models to run on the Linux cluster we also need to maintain the same dairy farm simulation model running on a Windows operating system. Ideally, as far as possible, we would like to maintain one single version that runs on both operating systems and also on a uniprocessor or in a parallel processing environment to avoid the difficulties of keeping the development of multiple versions for different environments synchronised. In practice this may not be completely achievable, but as far as possible we want to minimize the differences between these implementations. Thus we have had to use techniques that are portable between these environments, to ensure that all environments can use the same version. An added complexity is that researchers using the Windows version use a Windows interface but the parallel version must run without interacting with a Windows interface, so the code for the model itself must be independent of the code for the interface. Also, as we are using component models developed by other researchers we cannot change the component models themselves, and have to ensure that the interfaces between the framework and the component models are such that new versions of the component models may be substituted as they become available. It may also happen that alternative models for a particular component, such as the pasture or animal component, may require quite different inputs and produce outputs in different formats, or even completely different types of output. It is therefore something of a challenge to devise generic portable interfaces between the framework and the component models that will provide the necessary input parameters for any component model, and also be able to make use of the output from these component models, if necessary converting the output to a different form before it can be used in the dairy farm simulation. This paper identifies and discusses the problems to be solved in this porting process, and describes the techniques that have been used successfully to date. These include porting the framework and some of its component models from a Windows uniprocessor environment to a Linux parallel processing environment, the initial use of files to transfer data between processes, and then replacing the file transfers by socket communication. This has led to some interesting exercises in portability. As we use MPI to manage the parallel processing, and have not yet persuaded Smalltalk to communicate by using MPI, we have a C program, using MPI, which manages the parallelism. This C program then starts the Smalltalk framework, which in turn communicates with its component models, written in Smalltalk, C, Fortran or Pascal, and returns a final result for each simulation to the controlling C program. The problem has been in getting different processes, written in different languages, and perhaps started and running independently of one another, to communicate and pass data between themselves. We hope to include some preliminary results of extending MPI outside its normal C/C++/Fortran domain and making it the basic inter-process communication mechanism in the framework and will also be portable across operating systems. |
|||||||||||||||||||||||||||||||||||||||||||||
| Title | Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters | |||||||||||||||||||||||||||||||||||||||||||||
| Author | Kent Milfeld, Avijit Purkayastha, Chona Guiang | |||||||||||||||||||||||||||||||||||||||||||||
| Author Inst | TACC/University of Texas - Austin, USA | |||||||||||||||||||||||||||||||||||||||||||||
| Presenter | Kent Milfeld | |||||||||||||||||||||||||||||||||||||||||||||
| Abstract | One of the key elements to the
Linux Cluster HPC Revolution is the high speed
networking technology used to interconnect nodes
within a cluster. It is the backbone for “sharing” data
in a distributed memory environment. This
backbone, when combined with local high-speed disks,
also provides the framework for developing parallel
I/O paradigms and implementing parallel I/O systems—an
HPC environment for “sharing” disks
in a parallel file system. While much of the emphasis in cluster computing has focused on optimizing processor performance (e.g., compilers, hardware monitors, numerical libraries, optimization tools) and increasing message passing efficiency (e.g., MPICH-GM, MPI-Pro, LAM, M-VIA), much less effort has been directed towards harnessing the parallel capacity of multiple processors and disks to obtain “HPC I/O” in clusters. This is partly due to the fact that the MPI-I/O standard was not formulated until version 2 of MPI. It is also due to the cost of dedicating processors/nodes to I/O services. (For workloads that include only a few I/O intensive applications, dedicated nodes for parallel I/O may be an underused resource, to say the least.) In addition, dedicated I/O nodes introduce heterogeneity into the system configuration, and therefore require additional resources, planning, management, and user education. The Parallel Virtual File System (PVFS) [1] is a shared file system designed to take advantage of parallel I/O service on multiple nodes within a Linux cluster. In PVFS, Parallel I/O requests are initiated on clients (compute nodes), and the parallel I/O service occurs on I/O servers (I/O nodes). The PVFS API includes a partitioning mechanism to handle common, simple-strided access[2] with a single call. The file system is virtual in the sense that it can run without kernel support (that is, in user space) in its native form (pvfs), although a UNIX I/O interface is supported through a loadable kernel module. Also, | |||||||||||||||||||||||||||||||||||||||||||||

