Tutorials
| 2005 Tutorials
I Towards
Highly Available, Scalable,
and Secure HPC Clusters
with HA-OSCAR Last updated: 24 February
2005 | ||
| ||
Title |
Towards Highly Available, Scalable, and Secure HPC Clusters with HA-OSCAR |
|
Presenters |
Ibrahim Haddad, Box Leangsuksun, Stephen L. Scott |
|
Overview |
March 2004 was a major milestone for the HA-OSCAR Working Group. It marked the announcement of the first public release of the HA-OSCAR software package. HA-OSCAR is an Open Source project that aims to provide a combined power of high availability and performance computing. HA-OSCAR enhances a Beowulf cluster system for mission critical grade applications with various high availability mechanisms such as component redundancy to eliminate this single point of failure, self-healing mechanism, failure detection and recovery mechanisms, in addition to supporting automatic failover and fail-back. The first release (version 1.0) supports new high availability capabilities for Linux Beowulf clusters based on the OSCAR 3.0 release from the Open Cluster Group. In this release of HA-OSCAR, we provide an installation wizard graphical user interface and a web-based administration tool, which allows intuitive creation and configuration of a multi-head Beowulf cluster. In addition, we have included a default set of monitoring services to ensure that critical services, hardware components, and important cluster resources are always available at the control node. HA-OSCAR also supports new tailored services that can be configured and added via a WebMin-based HA-OSCAR administration tool. This tutorial will address in detail all the design and implementation issues related to building HA Linux Beowulf clusters and using Linux and Open Source Software as the base technology. In addition, the focus of the tutorial is HA-OSCAR. We will present the architecture of HA-OSCAR, review of new features of the latest release, discuss how we implemented the HA and security features, and discuss our experiments covering modeling, and testing performance and availability on real systems. Background on HA-OSCAR: The HA-OSCAR project’s
primary goal is to improve the existing Beowulf
architecture and cluster management systems
(e.g. OSCAR, ROCKS, Scyld etc) while providing
high-availability and scalability capabilities
for Linux clusters. HA-OSCAR introduces several
enhancements and new features to OSCAR, mainly
in the areas of availability, scalability,
and security. The new features in the initial
release are head node redundancy, self-recovery
for hardware, service, and application outages.
HA-OSCAR has been tested to work with several
OSCAR distributions. HA-OSCAR should work
with OSCAR 2.3, 2.3.1, 3.0 based on Red Hat
9.0 and OSCAR 4.0 based on Fedora core 2.
The first version (1.0) was released on March
22, 2004, brining over than 5000 hits to
HA-OSCAR site within 48 hours. The announcement
was featured in the LinuxWorld magazine,
on O’Reilly, HPC Wire, ClusterWorld,
and Slashdot.net. |
|
Outline |
Introduction
Challenges in Designing and Prototyping HA/HPC Clusters
OSCAR
HA-OSCAR
Demonstration (with 4 laptops running the latest research release of HA-OSCAR) Conclusion
|
|
| Schedule |
All Day |
|
|
||
II |
||
Title |
Architectures and Configuration of Linux Cluster I/O Solutions |
|
Presenter(s) |
Lee Ward |
|
Overview |
|
|
Outline |
General cluster/parallel FS requirements
Architectures
Hardware
General Q&A
|
|
Schedule |
Half-day morning. |
|
|
||
III |
||
Title |
Programming Linux Clusters Using the Message Passing Interface (MPI) |
|
Presenter(s) |
Purushotham V. Bangalore, Vijay Velusamy |
|
Overview |
Linux Clusters continue to dominate the Top500 list of the world’s fastest supercomputers. Most of the programs that run on these clusters are developed using the Message Passing Interface (MPI) standard Application Programming Interface (API). The proposed half-day tutorial covers the basic knowledge required to write parallel programs using the message passing programming model and is directly applicable to almost every parallel computer architecture. Efficient Parallel programming involves solving a common task by co-operation between processes. The programmer is required to not only define the tasks that will be executed by the processors, but also how these tasks are to synchronize and exchange data with one another. In the message-passing model the tasks are separate processes that send each other explicit messages to communicate and synchronize. All these parallel operations are performed via calls to message-passing interfaces that are directly responsible for interfacing with the system infrastructure connecting the actual processors together. This tutorial covers the following features supported by the MPI-1.2 standard: point-to-point communication (blocking and non-blocking operations), collective communication, primitive datatypes, and language bindings for C/C++ and Fortran. Although most of the functionality required
for parallel programming in clusters may
be achieved using the basic features of the
MPI-1.2 standard, much more flexibility and
performance could be achieved using some
of the advanced features of the MPI-2 standard.
These interfaces are designed not only to
benefit from performance enhancing features
that may be available in the parallel programming
environment, but also simplify the programming
involved. This half-day tutorial focuses
on the advanced MPI-1.2 functionality and
features of the MPI-2 standard. While the
MPI-1.2 standard relied on a static process
model, many applications require dynamic
process management support and these features
were added to the MPI-2 standard. In this
tutorial we present different approaches
to using dynamic process management capabilities
provided by the MPI-2 standard. Also, simple
MPI applications rely on a single process
reading and writing data through files. In
order to take advantage of emerging parallel
file systems (e.g., Lustre, PVFS, Panasas)
higher level abstractions are provided by
the MPI-2 standard. This tutorial presents
the various APIs and illustrative examples
to describe the various features available
to access files in parallel using derived
datatypes. |
|
Outline |
The tutorial includes formal lectures, programming examples from real-world applications, and informal discussions. This tutorial is intended for any application developer or designer involved in developing cluster applications.
|
|
Schedule |
Half-day morning. |
|
|
||
IV |
||
Title |
||
Presenter(s) |
Ying Zhang, Shirley Moore |
|
Overview |
SvPablo is a performance analysis and visualization system for both sequential and parallel platforms that provides a graphical environment for instrumenting scientific applications written in C and Fortran and for analyzing their performance at the source code level. SvPablo supports both interactive and automatic instrumentation for C, Fortran 77, and Fortran 90 programs. During execution of the instrumented code, the SvPablo library captures and computes statistical performance data for each instrumented construct on each processor. Because it only maintains the statistical data, rather than detailed event traces, the SvPablo library can characterize the performance of programs that run for days and on hundreds of processors. SvPablo captures both software performance data (such as duration and number of calls) and hardware metrics (such as floating point operations, cycles, cache misses) at run time. At the end of execution, the SvPablo library generates summary files for each processor or MPI task. The per-task performance files can be merged into one performance file post- mortem via a utility program. This file then can be loaded and displayed in the SvPablo GUI. Performance data are presented in the SvPablo GUI via a hierarchy of color-coded performance displays, including a high- level procedure profile and detailed line-level statistics. The GUI provides a convenient means for the user to identify high-level bottlenecks (e.g. procedures) and then explore in increasing levels of detail (e.g. identifying a specific cause of poor performance at a source code line executed on one of many processors). In addition, when the program is executed on different numbers of processors, SvPablo’s graphical scalability analysis tool calculates scalability efficiencies for each instrumented construct based on the performance files for each execution. Via a colored display and line graph, it presents to the users the overall scaling trends and limits for their applications. It also helps them to identify the code sections that have best or worst scalability and discover the key areas in the application for optimization, which is extremely important when porting applications to large scale clusters. KOJAK is a suite of performance tools that collect and analyze performance data from high performance applications written in Fortran 77/90/95 or C/C++. Performance data are collected automatically using a combination of source code annotations or binary instrumentation and hardware counters. The analysis tools use pattern recognition to convert the raw performance data into information about performance bottlenecks relevant to developers. Once profiling and scalability analysis have been carried out using a profiling tool such as SvPablo, KOJAK can be configured to automatically instrument problem areas of the application code to generate a detailed trace file using the EPILOG library. The event traces generated by EPILOG capture MPI point-to-point and collective communication as well as OpenMP parallelism change, parallel constructs, and synchronization. In addition, data from hardware counters accessed using the PAPI library can be recorded in the event traces. KOJAK’s EXPERT tool is an automatic trace analyzer that attempts to identify specific performance problems. Internally, EXPERT represents performance problems in the form of execution patterns that model inefficient behavior. These patterns are used during the analysis process to recognize and quantify inefficient behavior in the application. Each pattern calculates a (call path, location) matrix containing the time spent on a specific behavior in a particular (call path, location) pair, where a location is a process or thread. Thus, EXPERT maps.the (performance problem, call path, location) space onto the time spent on a particular performance problem while the program was executing in a particular r call path at a particular location. After the analysis has been finished, the mapping is written to a file and can be viewed using the CUBE display tool. The CUBE display consists of three coupled tree browsers, representing the metric, the program, and the location dimensions. At any given time, there are two nodes selected, one in the metric tree and one in the call tree. Each node is labeled with a severity value. A value shown in the metric tree represents the sum of a particular metric for the entire program, that is, across all call paths and all locations. A value shown in the call tree represents the sum of the selected metric across all locations for a particular call path. A value shown in the location tree represents the selected metric for the selected call path and a particular location. To help identify metric/resource combinations with a high severity, values are ranked using colors. Thus, the CUBE display visually identifies performance bottlenecks that have been detected by EXPERT and relates them to specific source code and process/thread locations. Recent work on KOJAK has used a rapid prototyping capability written in Python to define new types of patterns, such as a specialized pattern for detecting communication bottlenecks in wavefront algorithms. SvPablo and KOJAK have been used in the
performance analysis of large scientific
applications on major parallel systems including
Linux clusters, IBM SP, NEC SX, HP/Compaq
Alpha, etc. The applications studied include
DOE SciDAC applications being analyzed by
the Performance Evaluation Research Center
(PERC). This tutorial will provide an overview
of SvPablo, introduce tool usage with an
emphasis on scalability analysis, and presents
results obtained using SvPablo in scalability
analysis of a scientific application on Linux
clusters (up to 1024 processors.) It will
also introduce a new feature in SvPablo temperature
measurement and profiling -- which helps
users to identify the “hot spots” in
their applications that may present problems
for the system and affect the system’s
state of health. Following the SvPablo presentation,
the tutorial will provide an overview of
KOJAK, demonstrate how profile data obtained
using SvPablo can be used to select portions
of source code for EPILOG instrumentation,
and explain and demonstrate the use of EXPERT
and CUBE. Finally, instruction will be provided
on how to prototype new EXPERT analysis patterns. |
|
Outline |
Outline of tutorial:
|
|
Schedule |
Half-day morning. |
|
|
||
V |
||
Title |
||
Presenter(s) |
David Nagle |
|
Overview |
The last few years have seen
significant advanced in cluster-based storage
with many new systems embracing object-based
storage to provide the scalability, performance
and fault-tolerance necessary to meet the demands
of cluster applications. Products adopting
the object-model, include those from Panasas,
Lustre, Ibrix, Centera, and Isilon. This tutorial
will present the fundamentals of object-based
storage including both the underlying architectural
principals and how various products have adapted
those principals into their product designs.
The tutorial will begin with an overview of
object-based storage device (OSD) interface
as defined by the ANSI/T10 standard. Topics
will include the object-model, the OSD command
set, and OSD security. Will we then describe
the decoupled data/metadata storage architecture
commonly found in cluster storage systems and
how the OSD interface, security model, networking
and RAID play critical roles in the performance
and fault-tolerance of these systems. Finally
we will perform an in-depth comparison of the
various object-based storage systems available
today. |
|
Outline |
|
|
Schedule |
Half-day afternoon. |
|
|
||
VI |
||
Title |
||
Presenter(s) |
Eric Kretzer, Dan Lapine, Peter Enstrom |
|
Overview |
Torque is a resource manager providing control over batch jobs and distributed nodes. Maui is an advanced scheduler commonly used in conjunction with Torque and other resource managers. Torque continues to offer features that rival commercially available packages including support for scaling to multi-teraflop clusters. Maui offers a variety of scheduling policies that include priority escalation, advance reservations, fairshare, and backfill capabilities. The Torque/Maui tandem can maximize the cluster's overall utilization at no cost, and therefore making Torque/Maui appealing to a majority of cluster administrators. This tutorial aims to introduce participants
to the resource manager, Torque, and its
accompanying advanced scheduler, Maui. The
tutorial will provide a foundation on the
basics of configuring the server and client
portions of Torque. Then building on that
foundation, we will discuss detailed information
on leveraging the advance capabilities of
Maui to enforce specific resource policies
such as ensuring a particular set of users
are guaranteed a specific set of resources. |
|
Outline |
|
|
Schedule |
Half-day afternoon. | |
|
||
VII |
||
Title |
Productive Application Development with the Intel Cluster Toolkit |
|
Presenters |
Mario Deilmann |
|
Overview |
In the introduction, we provide
an overview of prerequisites, terminology,
performance issues and the nature of a parallel
development approach. The main part of the
tutorial will be an intermediate example of
how the performance of an MPI application can
be optimized using the Intel MPI and Intel® Trace
Collector. Advanced topics are advanced MPI
tracing and some sophisticated functionality
like binary instrumentation. A short introduction
using the Intel® Cluster Math Kernel Library
and The Intel® MPI benchmark follows. Finally
we throw a glance at the upcoming new version
of the Intel® Trace Analyzer, which is
the first version with a complete new software
design and a new graphical user interface.
This tutorial aims to introduce detailed technical
information along with a basic understanding
on how different factors contribute to the
overall application performance and how to
address them with the different tools of the
Intel® Cluster Toolkit. |
|
Outline |
This tutorial introduces the Intel Cluster Toolkit (ICT) 1.0, a toolkit for cluster programmers. This toolkit addresses the main stages of the parallel development process, enabling developers to achieve optimized performance for Intel processor-based cluster systems. Features:
Performance (This tutorial will introduce the following tools that address the above issues.):
Figure 1 illustrates how the different components interact with a user’s application.
2 Intel Trace Analyzer was formerly known as Vampir. 3 Intel MPI Benchmarks were formerly known as PMB. |
|
Schedule |
Half-day afternoon. |
|
|
||


