low end system defined standard TFTPs primitive flow control results bandwidth capture
Advances in the Scyld Beowulf System: The Third Generation Donald Becker Scyld Computing Corporation becker@scyld.com Presented with MagicPoint Car Story A recently purchased E30 325is Bill Carlson's garage... Loose ball joint that couldn't be removed... Solution? If it can't be fixed with a hammer... Or a very large wrench... Use a cut-off wheel What are we trying to Implement? Just clusters? No Scalable systems. Why? Because everything is now a "cluster" Broader Approach Cluster: Independent computers Combined into a unified system Through software and networking Cellular Multiprocessor: Coupled computers run as subsystem "cells" Presented as a unified system Through software and interconnect Previous Generation Solutions How have cluster problems been addressed in the past? Full OS installation on all nodes Supports user login on any node Administration by scripts and replicated remote commands Multiple consistency and synchronization tools Unification with a limited GUI Second Generation Solution -- Scyld Beowulf "2000" Full OS installation on a single "master" Compute nodes designed as a computational resource Multistage boot Single point administration installation and updates BProc-based single process space view Centralized monitoring and job control Why Change? Previous generation was a well-design innovation BUT New functionality was not one-to-one replacement Users resist change Too much focus on scalable single applications Increasing use of parametric execution Shared use of compute nodes Used for balancing and monitoring application servers Single point of failure concerns Single master provided all services Third Generation Scyld System Multiple masters Shared or isolated administrative domains Multiple servers for replication or redundancy Direct PXE boot Legacy BeoBoot protocol for existing installations Abstracted VMA services "Pluggable" memory region transport Use of underlying file system Continuum of file system support Multiple state management systems Several different of process initiation/control mechanisms Less Exciting Third Generation Features Range of configuration descriptions Single text file for simple deployment Directory of node definitions SQL database Specific, descriptive error reporting Extensive performance counters Nodes log system messages to masters What has changed in the world? Ubiquitous PXE network boot Multiple instruction set architectures IA64®, Opteron®, perhaps even Power-N Distributed file systems Match application semantic needs More candidates Harder choices More SAN storage options IPMI Experience with previous solutions Lessons Learned ("Thing you only talk about in retrospect") BeoBoot BeoBoot is just converting everything to a network boot Linux used in stage 1 for its Extensive network driver set Reliable TCP PXE is a obvious replacement BProc BProc combines separate concepts that should be isolated Directed process migration Unified process table Library copying Node state Cluster membership / node failure detection Other Lessons Learned ("What were we thinking?") Never deploy multicast as default Lossy switches Flawed host implementations Undebuggable performance loss No native support on non-Ethernet systems Incompatible with mainstream advances Myrinet-only boot was spiffy, but pointless Boot discovery awkward Diagnostics problematic Do not put node assignment in the GUI Support everything e.g. PERL, Java, and rexec on clients Provide examples Other Lessons Learned ("What were we thinking?") Don't mix process control with Node state ("Booting" Thing we will not change Zero-base node boot Diskless administration No configuration on nodes Simple compute nodes Full Linux install on master BeoNSS: Cluster-specific Name Service Scalability Performance ...but we now provide a function for memorizing users MPI and PVM integration Direct execution (no mpirun) Scheduling hooks Providing an internal queuing system Platform Changes Why PXE Ethernet Boot is Good Implementation driven by broader market Vendors are highly motivated to implement it Broad NRE recovery results in low cost It is everywhere Ubiquitous on server systems Common on other systems Trivial cost to add to existing or low-end system It is a defined standard Protocol anticipates Multiple servers Multiple client architectures Common implementation flaws can be overcome Ugliness can be forgotten after boot Cluster PXE requires great care Common implementation ISC DHCP daemon TFTP server pxe-linux or elilo This combination results in Bad scalability Many failure points No failure traceability / reportability DHCP boot rather than a true PXE service Poor control of node assignment Precludes multicast-TFTP Integrated PXE server Issue: Unreliable boots Designed for workstations, not clusters PXE clients halt rather than reboot on timeouts TFTP's primitive flow control results in bandwidth capture Key element: loss-based flow control Slow booting clients to avoid fatal timeout Defer initial response and reply to discovery Delay Combined Node assignment Node state update Boot information service Boot file service (TFTP) IPMI -- Intelligent Platform Management Interface What do we get? Power control independent of OS BIOS setup over Ethernet Boot process monitoring Consistent hardware monitoring Why do we care? Standard Inexpensive ($23+) |
See all the reviews