Record:   Prev Next
作者 Dreicer, Jared Samuel
書名 High performance computing spare replacement hardware fault tolerance
國際標準書號 0496774700
book jacket
說明 241 p
附註 Source: Dissertation Abstracts International, Volume: 65-04, Section: B, page: 1950
Chair: Arthur B. Maccabe
Thesis (Ph.D.)--The University of New Mexico, 2004
The use of spare replacement hardware and checkpoint rollback software fault tolerance on multiple-instruction-multiple-data (MIMD) architecture was investigated. New performance results are presented for spare node replacement after simulated failure and migration onto spare node prior to simulated failure. Spare replacement and migration onto spare were implemented for application parameter characterization runs on 32 nodes and scaling runs from 8--128 nodes on a MIMD cluster. The CUMULVS system was used for fault tolerant and control features. We evaluated the spare node replacement and migration onto spare node approaches using runtime to quantify performance and demonstrate viability of the approaches
The principal new results of this study are that: (1) Spare node replacement provides good performance at a small cost in runtime; (2) Migration onto a spare provides even better performance at a small cost in runtime; and (3) A runtime breakeven point dependent on system scale is identified for both approaches relative to traditional approaches
Results were quantified for empirical studies on 8--128 nodes. These studies investigated applications characterized by various computation-communication ratios, work patterns (steady, accumulate, disperse, hill, and hole), and various topologies (ring, one-to-all, and near neighbor). The decrease in the cost of commodity hardware enables strategies that can efficiently use a spare as a general means of dynamic redundancy. The gain resulting from these approaches is that because of decreased recovery time (given immediate access to a spare), the mean time to repair (MTTR) is reduced. Checkpoint and rollback overhead is still incurred, but for migration onto a spare, checkpoint overhead can be dramatically reduced. The scale of distributed memory MIMD architectures continue to grow as a result of user requests for greater performance, their increased computational requirements for finer resolution, and the decreasing cost of commodity hardware. However, these larger architectures experience an increasing frequency of component failures and subsequent loss of availability. Fault tolerance and availability are therefore important issues for high performance computing systems executing long-running applications. Our research indicates that utilizing spare replacement enhances scalability and availability of MIMD architectures and that further research will pay important dividends
School code: 0142
Host Item Dissertation Abstracts International 65-04B
主題 Computer Science
Alt Author The University of New Mexico
Record:   Prev Next