Last updated: April 19, 2007.keynotes | papers | speakers | vendors
|Title||Parallelism and Power in the Age of Petascale Computing|
|author(s)||Horst Simon, Lawrence Berkeley National Laboratory, USA|
We are about to enter the age of Petascale computing, By about 2009 the first Petaflop/s system will have entered the TOP500, and in the following decade we will see Petaflops computing to become more common place. Just like today Terascale computing on commodity clusters is widespread, we will see about ten tears now wide spread adoption of Petascale computers using commodity technology. By June 2016 one Petaflop/s performance will be required to enter the TOP500 list, and the age of Exaflops computing will be upon us soon after.
In this talk we will survey two interrelated challenges for the age of commodity Petaflops computing: dealing with increasing parallelism and reducing the power consumption of future systems. We first point out that these challenges are interrelated, and that one way to lower power consumption is by increasing parallelism. We will also explain the seeming contradiction that low power solutions are not necessarily the most energy efficient solutions. We then claim that contrary to current hype about multicore processors that we will be able to deal with the parallelism challenge based on the experience of the HPC community in the last 15 years. Finally we will show that the problem of power consumption should be approached with a multi-tier strategy, attacking the problem at the component, system, computer room, and building-environment level. We have started such a multi-tier strategy in Berkeley, and will show some early results.
|title||Advances in Regional Climate System Modeling for a Better Understanding of Land Use and Climate Change Impacts in California|
|Author(s)||Norman L. Miller, Lawrence Berkeley National Laboratory, USA|
|Presenter||Norman L. Miller|
|abstract||Regional Climate System Modeling (RCSM) is an outgrowth of Global Climate System Modeling (GCSM) and Numerical Weather Prediction (NWP). The Berkeley RCSM framework consists of pre- and post- processors, numerical and statistical models, and a database management interface. Our RCSM includes high-resolution atmospheric dynamics, physics, and sub-grid physically based land surface processes with detailed snow and hydrology, and coupled surface-deep groundwater, with modules for water quality and agriculture. It has been applied for climate change research in the western U.S., East Asia, Australia, and results were used as contributions to the 2nd, 3rd, and 4th Intergovernmental Panel for Climate Change Assessment Reports. Some of our most recent advances with RCSM include long-term drought studies of the California Central Valley and water supply and demand optimization. New enhancements include ensemble simulations of soil moisture and plant functional types to generate initial conditions without the normal multi-year spin up for equilibrium conditions, a nested fine-scale (100 m) resolution urban runoff model and estuary model, and links with economic analysis We have implemented the latest version of the Weather Research and Forecasting (WRF) code into our RCSM and have completed a series of large multi-processor ensemble simulations evaluating pre- and post-industrial land use change and the shifts in daily minimum and maximum temperature, latent and sensible heat, and soil moisture. This presentation will provide a brief overview of climate modeling, numerical and parallel computing advances, and California land use and climate change impacts on water resources, energy, agriculture, and other sectors.|
|title||From Beowulf to Cray-o-wulf: Extending Linux Clustering Paradigm to Supercomputing Scale|
|author(s)||Peter J. Ungaro, Cray Inc., USA|
|presenter||Peter J. Ungaro|
Commodity technology trends are leading to an ever-increasing reliance on scalability for high performance computing. This talk will examine some of these trends and issues and discuss the implications for large-scale Linux cluster design and their use for scalable applications.
|Parallel I/O, File Systems, & Storage|
|title||pNFS and Linux: Working Towards a Heterogeneous Future|
|author(s)||Dean Hildebrand, Peter Honeyman, and William Adamson, University of Michigan, USA|
|abstract||Heterogeneous and scalable remote data access
is a critical enabling feature of widely distributed collaborations.
Parallel file systems feature impressive throughput, but
sacrifice heterogeneous access, seamless integration, security,
and cross-site performance. Remote data access tools such
as NFS and GridFTP provide secure access to parallel file
systems, but either lack scalability (NFS) or seamless integration
and file system semantics (GridFTP).
Anticipating terascale and petascale HPC demands, NFSv4 architects are designing pNFS, a standard extension that provides direct storage access to parallel file systems while preserving operating system and hardware platform independence. pNFS distributes I/O across the bisectional bandwidth of the storage network between clients and storage devices, removing the single server bottleneck so vexing to client/server-based systems.
Researchers at the University of Michigan are collaborating with industry to develop pNFS for the Linux operating system. Linux pNFS features a pluggable client architecture that harnesses the potential of pNFS as a universal and scalable metadata protocol by enabling dynamic support for layout format, storage protocol, and file system policies. This paper evaluates the scalability and performance of the Linux pNFS architecture with the PVFS2 and GPFS parallel file systems.
|title||Efficient Methods for Parallel I/O|
|author(s)||Jeff Larkin, Cray, USA and Mark Fahey, Oak Ridge National Laboratory, USA|
|title||SAN Lessons Learned|
|author(s)||Andy Loftus and Chad Kerner, NCSA, USA|
|title||Practical Experiences of Setting Up, Managing, and Diagnosing a Large Parallel Filesystem|
|author(s)||Jim Laros, SNL, USA|
|Tools & Programming Environments|
|title||Detecting and Solving Memory Problems in Linux Clusters|
|author(s)||Chris Gottbrath, TotalView Technologies, LLC, USA|
|title||Automated MPI Correctness Checking: What If There Were a Magic Option?|
|author(s)||Patrick Ohly and Werner Krotz-Vogel, Intel Corporation, DE|
|title||A Framework for Scalable Parameter Estimation on Clusters|
|author(s)||Tom Bulatewicz, Daniel Andresen, Stephen Welch, Wei Jin, Sanjoy Das, and Matthew Miller, Kansas State University, USA|
|title||PerfTrack: Scalable Application Performance Diagnosis for LInux Clusters|
|author(s)||Rashawn Knapp, Kathryn Monror, Aaron Amauba, Karen KaravanicPortland State University, USA; Thomas Conerly, Caitlin Gaebel High School, USA; Abraham Neben, Wilson High Scholl, USA; John May, Lawrence Livermore National Laboratory, USA|
|Resource Management, Networks, & Power|
|title||Anatomy of Ethernet Resiliency and Scalibility for Cluster Computing|
|author(s)||Debbie Montano, Force 10, USA|
|title||Grids for the Real World: Addressing Sovereignty and Ease of Use|
|author(s)||David Jackson, Cluster Resources, USA|
|title||How Long Can You Go?|
|author(s)||Wade Vinson, HP, USA|
|author(s)||Egan Ford, IBM, USA|
|title||Benefits of Centralized Service Processor Management in Clustered Environments|
|author(s)||Ivan Passos, Avocent, USA|
|title||Best Practices in Cluster Management|
|author(s)||Richard Friedman, Scali, USA|
|abstract||This session will discuss the impact that MPI technology can have in overall system performance, with a particular focus on how MPI can help optimize performance in multi-core based systems. Additionally, experience with Scali MPI Connect on various customer examples will be used to illustrate the impact an MPI can have in overall system efficiency and effectiveness.|
|title||OCS and LSF HPC: An Integrated Solution for System and Workload Management|
|author(s)||Mehdi Bozzo-Rey, Platform Computing, USA|
|Resource Management & Networks|
|title||An Architecture for Dynamic Allocation of Compute Cluster Bandwidth|
|author(s)||John Bresnahan and Ian Foster, University of Chicago, USA|
|title||Experiences Deploying a 10-Gigabit Ethernet Computing Environment to Support Regional Computational Science|
|author(s)||Jason Cope, Theron Voran and Matthew Woitaszek, University of Colorado at Boulder, USA; Adam Boggs, Sean McCreary, and Michael Oberg, National Center for Atmospheric Research, USA|
|title||The Application Level Placement Scheduler (ALPS)|
|author(s)||Michael Karo, Richard Lagerstrom and Carl Albing, Cray Inc., USA|
|Performance Analysis & Applications|
|title||A Case Study in Using Local I/O and GPFS to Improve Simulation Scalability|
|author(s)||Vincent Bannister, Microsoft, USA; Gary Howell and Eric Sills, HPC/ITD NCSU, USA; Tim Kelley and Qianyi Zhang, NCSU, USA|
|abstract||Many optimization algorithms exploit parallelism by calling multiple independent instances of the function to be minimized, and these function in turn may call off-the-shelf simulators. The I/O load from the simulators can cause problems for an NFS file system. In this paper we explore efficient parallelization in a parallel program for which each processor makes serial calls to a MODFLOW simulator. Each MODFLOW simulation reads input files and produces output files. The application is "embarassingly" parallel except for disk I/O. Substitution of local scratch as opposed to global file storage ameliorates synchronization and contention issues. An easier solution was to use the high performance global file system GPFS instead of NFS. Compared to using local I/O, using a high performance shared file system such as GPFS requires less user effort.|
|title||Visualizing I/O Performance during the BGL Deployment|
|author(s)||Andrew Uselton and Brian Behlendorf, Lawrence Livermore National Laboratory, USA|
|title||Load Balancing in Pre-Processing of Large-Scale Distributed Sparse Computing|
|author(s)||Olfa Hamdi-Larbi and Zaher Mahjoub, Faculty of Sciences of Tunis, TN; Nahid Emad, Univeristy of Versailles, FR|
|title||Linux Kernel Improvement: Toward Dynamic Power Management of Beowulf Clusters|
|author(s)||Fengping Hu and Jeffrey Evans, Purdue University, USA|
|title||HPC System Call Usage Trends|
|author(s)||Terry Jones, Lawrence Livermore National Laboratory, USA; Andrew Tauferner and Todd Inglett, IBM, USA|
|title||Compute Node Linux (CNL): From Capability to Capacity|
|author(s)||Kevin Peterson, Cray Inc., USA|
|title||Starting with Linux: A System Design Case Study|
|author(s)||John Goodhue and Win Treese, SiCortex Inc. USA|
|title||Intel Woodcrest: An Evaluation for Scientific Computing|
|author(s)||Philip Roth and Jeffery Vetter, Oak Ridge National Laboratory, USA|
|presenter||Philip Roth and Jeffery Vetter|
|title||The PeakStream Platform: High-Productivity Software Development for GPUs|
|author(s)||Matthew Papakipos, PeakStream, Inc. USA|
|abstract||The emerging world of multi-core processors
and massively parallel systems requires a programming model
that scales to the new generation of computing architectures.
Existing codes written for single-core CPUs are not likely
to take full advantage of multi-core technology without modification.
In this talk, we will show you the PeakStream Virtual Machine
and how it provides automatic parallelization of programs
written in C/C++ so that developers can focus on their application
- and not the intricate details of parallelizing the application. During this session, learn how we are improving development time on such computationally intense applications as synthetic aperture imaging, computed tomography scans, Monte-Carlo simulation and Black-Scholes option pricing.
|title||Effective Use of Commodity Mutli-Core Systems in HPC|
|author(s)||Kent Milfeld, Kazushige Goto, Avi Purkayastha, Chona Guiang, and Karl Schulz, University of Texas at Austin, USA|
|title||The Future of Storage: Commodity Clusters and Parallel I/O|
|author(s)||Dave Fellinger and Alex Sayyah, DDN, USA|
|Technical Briefs: I/O|
|title||Considerations for Scalable Environmental Sciences Applications on Conventional HPC Linux Platforms|
|author(s)||Stan Posey, Panasas Inc., USA|
|abstract||Research organizations continue to increase
their investments in computational environmental sciences
and related applications, as they face growing demands of
computational scientists who continue to expect more from
scalable computer system platforms. Typical demands of the
application workflow often include rapid single simulation
job turnaround and multi-job throughput capability for users
with diverse application requirements in a high-performance
computing (HPC) infrastructure.
For today’s economics of HPC, the required resources of CPU cycles, large memory, system bandwidth and scalability, storage and I/O, and file and data management – must all deliver the highest levels of user productivity and reliability that is possible from leverage of conventional systems based on commodity HPC Linux clusters. As the popularity of Linux clusters and distributed memory computing has grown, a new class of storage cluster technology has been developed that are designed to extend conventional cluster capability. These storage systems offer a large single resource of shared addressable storage to provide an improved balance between capability and capacity for effective scalability of HPC application software.
HPC and Scalable Environmental Modeling
Rapid progress in computational environmental application performance has been influenced by advanced developments in application software algorithms, balanced graph partitioning for domain parallel schemes, and HPC cluster systems. By far the most important HPC advancement in recent years for such application software is the parallel scalability that is possible with geometry domain decomposition and distributed memory parallel through explicit message passing. Most environmental modeling software employ this technique today, owing to its potential for scalability on the complete range of HPC systems currently available.
From an additional HPC perspective, many environmental modeling software differ in their discretization schemes and algorithms. That is, some application software use a structured mesh discretization vs. unstructured, and/or perhaps an explicit vs. implicit algorithm – characteristics of which influence HPC performance behaviour on current microprocessor and system architectures, and I/O systems. Performance and scalability can be achieved when all these factors are considered in an overall applications and HPC evaluation.
Parallel scalability of fluid flow algorithms, in theory, are independent of discretization choice, although current methods in graph partitioning efficiency (e.g. METIS) for distributed memory parallel favour unstructured-mesh vs. structured-mesh. This partitioning or domain decomposition seeks to balance the computational “work” among each partition and minimize the amount of information that will “pass” between partition boundaries. This communication requirement between partitions is often what affects the level of parallel scalability for different system architectures.
The choice of an explicit algorithm vs. implicit can influence these communication requirements that affect scalability. Explicit algorithms typically scale better since they rely on a local computational stencil for time integration, which can minimize the exchange of information at domain interfaces. Implicit schemes involve solution of a linear system whose neighbour-dependency is larger, meaning parallel algorithms experience a “delay” of communications that must propagate among several domain interfaces.
An additional topic of parallel consideration are simulations of transient fluid flow behaviour, whereby the amount of I/O required for saving data at each time step during the simulation can severely limit a system architecture’s ability to scale the run. Recent HPC storage cluster technology have developed breakthroughs on high bandwidth over commodity interconnects through software and I/O performance features that remove this bottleneck to separate this I/O requirement from the solver for overall run turn-around on conventional Linux clusters.
This presentation examines application-driven HPC workflow efficiencies for relevant applications in computational environmental sciences. Modeling parameters such as model size, solution schemes, range of scales (and coupled scales), and a variety of simulation conditions can produce a wide range of computational behavior and large-scale data management requirements, such that careful consideration should be given to how HPC resources are configured and balanced to satisfy increasing user requirements.
|title||Performance Reliabilty and Operational Issues for High Performance NAS Storage|
|author(s)||Matthew O'Keefe, Cray Inc., USA|
|title||Experiences with Parallel Commodity Storage|
|author(s)||David Chaffin, Texas Tech HPCC, USA|
|title||Architecting High Performance, Scalable & Highly Available Cluster Storage with Best-of-Breed Storage Software and DDN S2A Technology|
|author(s)||Jeff Danworth & Bob Woolery, DataDirect, USA|
|presenter||Jeff Danworth & Bob Woolery|
|title||HPC I/O Roadmap: The Next Five Years|
|author(s)||Gary Grider, Los Alamos National Laboratory, USA|
The talk will include a brief history of developments in scalable I/O, file systems, and storage networks for high-performance computing system,; a survey of current work in the area,; and an overview of work being started to address future needs. Additionally, how parallel applications use I/O in HPC systems will be summarized. The presentation will also give a survey of many of the more difficult current and emerging issues for the HPC I/O and file systems area.
|title||Operations at Scale - Lessons to Be Remembered|
|author(s)||Robert Ballance, Sandia National Laboratories, USA|
|title||Tales of Tahoe|
|author(s)||Don Lane, U.S. Forest Service, USA|
Don Lane says about his presentation, " You've seen its [Tahoe] beauty, now learn about the region's rich history from the earliest period through today. Hear short tales of the colorful characters who inhabited this magnificent area from the discovery of gold through today."
|Network & Cluster Optimization|
|title||Data-Intensive Cluster Optimization|
|author(s)||Benoit Marchand, eXludus, USA|
|abstract||The rapid adoption of clusters
as the predominant architecture for HPC yields clear benefits
for cost-effective scaling across a broad range of application
disciplines. While commodity computational nodes are very
cost-effective and both throughput serial workloads and many
parallelized applications scale well on clusters, there are
many other application workloads which scale less effectively,
and can benefit from additional optimization. As the SMP
architecture was “blown up” into clusters, slower
network connections replaced shared memory crossbar connections
and file servers became more remote. This creates challenges
for optimizing workload performance, especially for data-intensive
In this case the input data must originate from the remote file server and traverse the network; as requests increase beyond a certain level, queuing conflicts and bandwidth limitations can become severe bottlenecks. Large volumes of output data must be restored to the central file server when simulations are completed, this can also present a challenge with many large files being written simultaneously. We present a toolkit of automated cluster optimization capabilities, which can accelerate data input and output into clusters, and more effectively optimize compute node performance as well. The software toolkit provides: Parallel File Serving Asynchronous Results Transfer Schedule Optimizer Meta Language Processor The Parallel file serving module allows us to provide shared data to all nodes in a cluster simultaneously. This data replication provides much higher effective aggregate bandwidth across commodity (e.g. Gigabit Ethernet) network switches, and scales in a highly linear fashion with the number of nodes. The Asynchronous Results Transfer (ART) allows output data to be restored from a large number of compute nodes to the central file server in the background; this establishes a more efficient processing pipeline. The compute nodes are freed up to get back to their compute-intensive simulation work. The output data in question can be final results, intermediate results, or even checkpoint-restart files. The same technique can be used to prestage data on the input side and alert workload mangers to send jobs to nodes which are already “data-hot”.
The Schedule Optimizer allows us to enhance the performance of existing workload managers (PBS, SGE, LSF, Torque, etc.) by dynamically optimizing the number of processes per core. The software is fault-tolerant and transparent to install and operate, in conjunction with existing file systems (e.g. NFS, Lustre, Panasas, ..) and existing workload managers, and without modification to existing application source or binaries. The modules may be used selectively on an application by application basis, as appropriate. Modifications are typically limited to minor changes in existing job scripts, and our MLP provides a user interface to implement modified scripts. Installation on a large number of nodes can be completed in a matter of minutes, and the software is designed to be “set and forget” from a system administration standpoint. Performance results on customer clusters with real-world applications using these cluster optimization modules will be presented. Performance enhancements of a factor of 3 or more are achieved with parallel file serving and the schedule optimizer module, and a factor of 1.5 or more with ART.
|title||Myri-10G: The Technically Superior HPC Interconnect|
|author(s)||Tom Leinberger, Myricom, USA|
Myri-10G is 10-Gigabit Ethernet from Myricom, and more. In HPC clusters using the kernel-bypass MX-10G software, Myri-10G exhibits 2.3 microsecond MPI PingPong latency, 1.2 GByte/s PingPong data rate, 2.4 GByte/s SendRecv data rate, very high application availability (very low host-CPU load) with MPI and sockets, a small and constant memory footprint in the host, and wire-speed interoperability between 10-Gigabit Ethernet and 10-Gigabit Myrinet. These benchmarks are for clusters of any size; they are not just marketing benchmarks on small clusters. MX 10G and Myri-10G NICs operate with either Myrinet or Ethernet switch networks, and carry IP traffic efficiently along with MPI traffic. The Lustre and PVFS2 cluster file systems have native MX support, and are blazingly fast with MX-10G. Myri-10G switches are available with a mix of Myrinet-protocol and Ethernet-protocol ports. In June, Myricom will start shipping a new series of Myri-10G switches with up to 512 host ports from a single switch enclosure.
|title||Open Fabrics Enterprise Edition (OFED) Update|
|author(s)||Jamie Riotto, Cisco, USA|
An overview of the Open Fabrics Association (OFA) Open Fabrics Enterprise Edition (OFED) Software. OFED is an open-source fabric agnostic set of host driver software supporting RDMA technologies over InfiniBand and Ethernet. This talk will give an update of the new OFED 1.2 release, as well as cover what it contains, who were the contributors, and where it is being used in both enterprise and HPC. Also, a roadmap will be presented on future OFED releases and the new technologies they will contain.
|Software for Clustered Systems|
|title||Optimizing Application Performance on x64 Processor-based Systems with PGI Compilers and Tools|
|author(s)||Douglas Miles, PGI, USA|
|abstract|| PGI Fortran, C and C++ compilers
and tools are available on many Intel and AMD x64 processor-based
Linux clusters. Optimizing performance of x64 processors
often depends on maximizing SSE vectorization, ensuring alignment
of vectors, and minimizing the number of cycles the processors
are stalled waiting on data from main memory. The PGI compilers
support a number of directives and options that allow the
programmer to control and guide optimizations including vectorization,
parallelization, function inlining, memory prefetching, interprocedural
optimization, and others.
In this talk we provide detailed examples of the use of several of these features as a means for extracting maximum single-node performance from Linux clusters using PGI compilers and tools.
|title||Debugging and Optimizing Applications for Multicore MPP Architectures|
|author(s)||Michael Rudgyard, Allinea, USA|
As two, four and potentially eight-core processors become the norm, the defacto HPC architecture is tending towards large clusters of modest 8 16 core shared-memory servers, potentially with co-processing devices (eg. GPGPUs, FPGAs, Clearspeed). Programming these machines optimally presents a number of challenges, and applications that use a mixed programming models are now becoming commonplace.
In this presentation we will discuss the challenges facing today's HPC application developers, and the need for simple tools that can address mixed programming models. We will present new multicore features of Allinea's Distributed Debugging Tool (DDT) and Optimisation and Profiling Tool (OPT), and discuss our aims to provide a consolidated, scalable, yet intuitive framework for HPC developers.
|title||Improving System Performance with Scali MPI Connect|
|author(s)||Rick Friedman, Scali, USA|
|title||Adaptive Computing in HPC Today|
|author(s)||David Jackson, Cluster Resources, USA|
|abstract||Clusters that heal themselves?
Clusters that learn and improve scheduling over time? Clusters
that actively coordinate compute, storage, and network resources
to optimize total performance? Clusters that dynamically
grow and shrink and even customize themselves based on workload?
Think this is science fiction? Think again!
In this presentation, we will discuss how Moab Utility Computing Suite is enabling customers to accomplish these objectives today on HPC clusters, grids, and data centers. We will discuss the benefits of systems that can dynamically adapt their resources, jobs, and policies to meet changing objectives and environmental conditions. Furthermore, we will cover how advanced high-level policies can harness these capabilities to improve both utilization and response time, as well as deliver on QOS/SLA agreements in a way never before possible.
The magic to this solution is a virtualized batch layer that enables the use of technology to accomplish all these objectives. On the outside, users only see the familiar interfaces for submitting and managing workload on a cluster that grows, shrinks and adapts on command.
This presentation will discuss capabilities and a number of case studies on real-world sites that have been successfully utilizing this technology for years. We will also discuss industry trends that are moving adaptive computing into the mainstream.
|Upcoming Hardware Technology|
|title||HPC Technologies from Intel|
|author(s)||David Barkai and David Lombard, Intel, USA|
|presenter||David Barkai and David Lombard|
|title||Cool, Tight, Fast, Reliable HPC Clustering with Blades and InfiniBand|
|author(s)||Kent Koeninger, HP, USA|
|abstract||Processors speed alone is not the primary driver for HPC clusters, nor are HPC clusters just for academic science and industrial engineering. In addition to delivering TFLOPS they need to run with minimum power and cooling, fit in minimum spaces, optimize the total cost of ownership, and scale to large tightly-connected clusters at minimum cost. Enter blades with InfiniBand. This combination is opening new enterprise-oriented HPC markets, including Financial Services and On-Line gaming. In addition to the requirements above, these markets demand enterprise reliability, availability, and serviceability (RAS), including redundant configurations for resiliency. This talk will highlight the roles of HP BladeSystem c-Class clusters and InfiniBand in meeting these advancing HPC requirements.|
|title||Trends in High Performance Computing Commodity Clusters|
|author(s)||Jay Urbanski, IBMl, USA|
This talk will examine trends in HPC clusters from both a hardware and software perspective. Areas of discussion will include processor technologies and implications of multi-core, heterogenous computing, interconnect directions, systems management and scalability, and power, cooling, and packaging optimization.
|title||High Performance for Big Science|
|author(s)||Kevin Noreen, Dell, USA|
|title||The Transition to Multi-core: Is Your Software Ready?|
|author(s)||Matthew Papakipos, PeakStream, USA|