|
2006 Tutorials
I Cancelled
II Advanced
Cluster and Grid Management with Moab Cluster Suite
III Towards Highly Available, Scalable, and Secure Computer
Clusters with HA-OSCAR
IV Cancelled
V Object-Based Cluster Storage
VI Resource Management Using SLURM
VII Machine Room Design
VIII Intel Cluster Tools
IX HPC and MPI:Open MPI Tuning
X OpenIB
Last updated: April 26, 2006
I
|
Title |
|
Cancelled |
Presenters |
|
NA |
Overview |
|
NA
|
Outline |
|
NA
|
Schedule |
|
NA |
|
|
II
|
Title |
|
Advanced
Cluster and Grid Management with Moab Cluster Suite |
Presenter(s) |
|
Dave Jackson |
Overview |
|
Organizations are demanding more out of their
clusters. Expectations
for efficiency are rising. Management and reporting requirements
are
becoming more advanced. In every way, clusters are expected to
be more
professionally managed, more flexible, and better performing
from day 1.
In this tutorial, we will overview the high-level architecture
of the
Moab Workload Manager and discuss why it has found so much success
amongst Top 500 Cluster systems. We will discuss in detail many
of the
more common requirements and vexing problems of the modern cluster
and
how Moab facilities can help manage these issues effectively.
Issues
will include general optimization, end-user empowerment, delivering
targeted levels of service, improving overall cluster availability
and
up time, and supporting dynamic services and automated cluster
re-
provisioning.
Coverage will also be given for full system policy integration
allowing
orchestration and management of not just compute resources, but
also
license, storage, and network resources towards those objectives
most
important to the organization. This includes importing
real-time
information from other services into high-level scheduling systems
and
allowing the cluster management system to control and export
information
back into other local and grid services.
Finally, this tutorial will cover the use of Moab in creating
powerful
and flexible grids connecting multiple clusters into a more manageable
entity. We will describe the creation of management and information
grids in which analysis, control, and reporting is centralized
but jobs
remain local within each cluster as well as the more common job
grid in
which job and data migration is automatically managed. Lastly,
we will
discuss creation and management of virtualized and on-demand
clusters
and how Moab allows sites to either host or utilize these dynamically
configured and dynamically allocated resources. |
Outline |
|
8:30 Introduction
- Attributes of a Basic Cluster
- Evolution of a Cluster Environment over Time
- Sources of Complexity and Waste within a Matured Cluster
Environment
9:00 Moab Workload Manager Architecture
- High Level Overview of Job, Node, and User Management
Facilities
- Generalized Resource Management Interfaces
9:30 General Cluster Management Tasks
- Optimizing Cluster Performance and Availability
- Managing Politics and Providing Service Level Agreements
- Handling Transient Needs with Reservations
- Off-loading Staff through User Empowerment
- Orchestrating Compute, Network and Storage Usage
- Diagnosing Job, Node, and Policy Issues
10:00 Break
10:30 Advanced Cluster Management Tasks
- Integrating Moab Management Facilities with Peer Services
- Capacity Planning and Reporting with Historical Statistics
- Customizing Management with Generic Metrics, Properties
and Consumable Resources
- Enabling Charging and Allocation Management Facilities
- Automatically Responding to Arbitrary Events with Triggers
11:15 Grid Management
- Enabling a Grid in 60 Seconds
- Controlling Information Access and Job and Data Flow
Policies
- Credential and Security Management
- Enabling Information Services
- Reducing Staff Overhead with Control/Management Grids
- Improving Statistics with Information Grids
11:40 On Demand - Utility Computing
- Architectural Overview
- Enabling Moab to Seamlessly Utilize an Internal/External
Resource Hosting Center
- Becoming a On Demand Hosting Center
11:55 Questions and Answer
|
Schedule |
|
Half-day morning. |
|
|
III
|
Title |
|
Towards Highly
Available, Scalable, and Secure Computer Clusters with HA-OSCAR |
Presenter(s) |
|
Ibrahim Haddad, Box Leangsuksun, Stephen L.
Scott |
Overview |
|
March 2004 was a major milestone for the HA-OSCAR
Working Group. It marked the announcement of the first public
release of the HA-OSCAR software package. HA-OSCAR is an Open
Source project that aims to provide a combined power of high
availability and performance computing. HA-OSCAR enhances a
Beowulf cluster system for mission critical grade applications
with various high availability mechanisms such as component
redundancy to eliminate this single point of failure, self-healing
mechanism, failure detection and recovery mechanisms, in addition
to supporting automatic failover and fail-back.
The first release (version 1.0) supports new high availability
capabilities for Linux Beowulf clusters based on the OSCAR
3.0 release from the Open Cluster Group. In this release
of HA-OSCAR, we provide an installation wizard graphical
user interface and a web-based administration tool, which
allows intuitive creation and configuration of a multi-head
Beowulf cluster. In addition, we have included a default
set of monitoring services to ensure that critical services,
hardware components, and important cluster resources are
always available at the control node. HA-OSCAR also supports
new tailored services that can be configured and added via
a WebMin-based HA-OSCAR administration tool.
This tutorial will address in detail all the design and
implementation issues related to building HA Linux Beowulf
clusters and using Linux and Open Source Software as the
base technology. In addition, the focus of the tutorial is
HA-OSCAR. We will present the architecture of HA-OSCAR, review
of new features of the latest release, discuss how we implemented
the HA and security features, and discuss our experiments
covering modeling, and testing performance and availability
on real systems. |
Outline |
|
Introduction
- Introduction to Beowulf and HPC clusters
- Introduction to HA clusters
- Various levels of HA
- Linux: the commodity component of the cluster stack
- Software and hardware system architecture
Challenges in Designing and Prototyping HA/HPC
Clusters
- Booting the cluster
- Storage
- Building the disks
- Installing application servers
- Traffic distribution mechanisms
- Load balancing mechanisms
- Building redundancy at various levels in the cluster:
- Ethernet redundancy
- DHCP/TFTP/NTP/NFS servers’ redundancy
- Data redundancy using software RAID
- Automating network installations
- Automatic network RAID setups
- File systems for HA Linux clusters
OSCAR
- Introduction
- Cluster Computing Overview
- OSCAR - "The Beginning" - Overview / Strategy
- OSCAR Components (Functional areas)
- Core, Admin/Config, HPC Services
- Core Components: SIS, C3, Switcher, ODA, OPD
HA-OSCAR (40%)
- HA-OSCAR overview
- HA-OSCAR architecture and components
- HA-OSCAR comparison with Beowulf architecture
- HA features
- Multi-head builder and Self-configuration
- Monitoring
- Service monitoring
- Hardware monitoring
- Resource monitoring
- Self-healing and recovery mechanism
- Test environment
- Installation Steps
- Experiments
- Availability moldering, analysis and uptime improvement
study between Beowulf and HA-OSCAR
- Test results
- Applications and feasibility studies
- Grid-enable HA cluster
- HA-OSCAR and Distributed Security Infrastructure integration
- HA-OSCAR and OpenMosix/LVS feasibility study
- Transparent Job Queue fault tolerance based on TORQUE
Demonstration
With 4 laptops running latest research release of HA-OSCAR
Conclusion
- HA-OSCAR Roadmap
- Advanced research
- Questions and answers
. |
Schedule |
|
Half-day morning. |
|
|
IV
|
Title |
|
Cancelled |
Presenter(s) |
|
N/A |
Overview |
|
N/A |
Outline |
|
N/A |
Schedule |
|
N/A |
|
|
V
|
Title |
|
Object-Based Cluster
Storage |
Presenter(s) |
|
David Nagle and Brent Welch |
Overview |
|
AThe last few years have seen significant advances
in cluster-based storage with new systems embracing object-based
storage to provide the scalability, performance and fault-tolerance
necessary to meet the demands of cluster applications. Products
adopting the object-model include Panasas, Lustre, and Centera.
This tutorial will present the fundamentals of object-based
storage including the underlying architectural principals and
how various products have adapted those principals into their
product designs.
The tutorial will begin with an overview of the object-based
storage device (OSD) interface as defined by the ANSI/T10
standard. Topics will include the object-model, the OSD command
set, and OSD security. We will then describe the decoupled
data/metadata storage architecture commonly found in cluster
storage systems and how the OSD interface, security model,
networking and RAID play critical roles in the performance
and fault-tolerance of these systems. Finally, we will perform
an in-depth comparison of the various object-based storage
systems available today.
|
Outline |
|
Available soon.
|
Schedule |
|
Half-day afternoon. |
|
|
VI
|
Title |
|
Resource Management
Using SLURM |
Presenter(s) |
|
Morris Jette |
Overview |
|
SLURM has become a very popular resource manager.
Some of its important characteristics include: high scalability
portability, security and fault-tolerance. It is also open
source and available under the GNU Public License.
A complete build, installation and configuration of SLURM
will be performed. Attendees with a Linux laptop will be
able to do their own installation using the supplied CD.
SLURM can emulate a sizable Linux cluster, including a BlueGene
system, entirely within a laptop computer. This will allow
presentation demonstration of realistic configuration and
use issues.
More information about SLURM is available at http://www.llnl.gov/linux/slurm. |
Outline |
|
- The role of a resource manager
- Design issues for resource management on large-scale clusters
- SLURM architecture
- SLURM commands and their use
- SLURM configuration
- Demonstration of SLURM build, installation, configuration
and use
|
Schedule |
|
Half-day afternoon. |
|
|
VII
|
Title |
|
Machine Room
Design |
Presenters |
|
Timothy Thomas |
Overview |
|
This tutorial will give the attendee an immersion
in the world of putting together new, modern data centers,
as wel asl some pointers on refurbishing old ones to deal with
the new world order. Topics will include the following: background
("theory") of power (demand and supply) and heat transport;
advanced design principles; example designs; overview of new
component product classes; practical direction in cooling techniques;
room layout, networking, power distribution, cabling, networking;
logic of interconnected systems (electrical, fire detection,
fire and security alarms, fire suppression; insurance issues;
working with facilities planning and physical plant departments;
budgeting; the nature and meaning of blueprints; in-the-trenches
experience from one or more recent projects. |
Outline |
|
Available soon.
|
Schedule |
|
Half-day afternoon. |
|
|
VIII
|
Title |
|
Intel Cluster
Tools |
Presenter(s) |
|
Werner Krotz-Vogel |
Overview |
|
This tutorial is about optimizing performance
with the Intel Cluster Toolkit on Linux Clusters. Attendees
will learn how to use Intel Trace Analyzer & Collector
for performance optimization of MPI applications. After an
in-depth introduction, attendees will have to solve several
tasks, which will guide them through the main features and
functionalities of the tool set.
|
Outline |
|
Available soon.
|
Schedule |
|
Half-day afternoon. |
|
|
X
|
Title |
|
OpenIB |
Presenters |
|
Stephen Poole |
Overview |
|
Available soon. |
Outline |
|
Available soon.
|
Schedule |
|
Half-day morning. |
|
|
IX
|
Title |
|
HPC and MPI: OpenMPI
Tuning |
Presenters |
|
Graham E. Fagg |
Overview |
|
To familiarize the audience with some of the
advanced extensions of Open MPI. All examples and demos will
be given using the latest stable release of the Open MPI library,
a full fledged MPI2 library based on the cumulative
work of the LAM/MPI, LAMPI, and FTMPI implementations.
We will be focusing on two distinctive topics:
- Maximizing performance in heterogeneous and multinetwork
clusters by using the dynamic component architecture of
the Open MPI implementation of the MPI standard.
- Increasing the overall performance of collective communications
based on multiple choices of runtime parameters. An
overview of the algorithmic choices of di ff erent
collective communications will be presented, as well as
their advantages and/or drawbacks. A quick introduction
to communication modeling will be provided in order to
allow those unfamiliar with the concept to gain deeper
understanding of the challenges we’re presenting.
. |
Outline |
|
- MPI background
- Generic nature of MPI interface
- Pointtopoint communications
- Collective communications
- Handson installation of Open MPI on tutorial hardware
- The Open MPI component architecture
- Open MPI capabilities to be explored
- Dynamic loading of MCA components
- Runtime component selection
- Tunable runtime parameters
- In Open MPI, how to:
- Add an MCA module to an existing installation
- Allow automatic discovery of modules
- Use manual module selection
- Set runtime tunable parameters
- Select pointtopoint module(s)
- Select collective module(s)
- Instructor demos; attendee handson demos
- Heterogeneous Environments
- MPI background
- Heterogeneous aspects of Open MPI implementation
- Multinetwork (including striping across multiple
interfaces)
- Multiple endpoints (host architectures)
- Endianness
- Instructor demos; attendee handson demos
Collectives
- Relevant MPI background
- In the context of Open MPI, how to:
- Configure the pointtopoint communications
- Configure collectives communications
- Instructor demos; attendee handson demos
|
|
|
Half-day afternoon. |
|
|
|
|