|
Next
Previous
Contents
- Q:
RAID can help protect me against data loss. But how can I also
ensure that the system is up as long as possible, and not prone
to breakdown? Ideally, I want a system that is up 24 hours a
day, 7 days a week, 365 days a year.
A:
High-Availability is difficult and expensive. The harder
you try to make a system be fault tolerant, the harder
and more expensive it gets. The following hints, tips,
ideas and unsubstantiated rumors may help you with this
quest.
- IDE disks can fail in such a way that the failed disk
on an IDE ribbon can also prevent the good disk on the
same ribbon from responding, thus making it look as
if two disks have failed. Since RAID does not
protect against two-disk failures, one should either
put only one disk on an IDE cable, or if there are two
disks, they should belong to different RAID sets.
- SCSI disks can fail in such a way that the failed disk
on a SCSI chain can prevent any device on the chain
from being accessed. The failure mode involves a
short of the common (shared) device ready pin;
since this pin is shared, no arbitration can occur
until the short is removed. Thus, no two disks on the
same SCSI chain should belong to the same RAID array.
- Similar remarks apply to the disk controllers.
Don't load up the channels on one controller; use
multiple controllers.
- Don't use the same brand or model number for all of
the disks. It is not uncommon for severe electrical
storms to take out two or more disks. (Yes, we
all use surge suppressors, but these are not perfect
either). Heat & poor ventilation of the disk
enclosure are other disk killers. Cheap disks
often run hot.
Using different brands of disk & controller
decreases the likelihood that whatever took out one disk
(heat, physical shock, vibration, electrical surge)
will also damage the others on the same date.
- To guard against controller or CPU failure,
it should be possible to build a SCSI disk enclosure
that is "twin-tailed": i.e. is connected to two
computers. One computer will mount the file-systems
read-write, while the second computer will mount them
read-only, and act as a hot spare. When the hot-spare
is able to determine that the master has failed (e.g.
through a watchdog), it will cut the power to the
master (to make sure that it's really off), and then
fsck & remount read-write. If anyone gets
this working, let me know.
- Always use an UPS, and perform clean shutdowns.
Although an unclean shutdown may not damage the disks,
running ckraid on even small-ish arrays is painfully
slow. You want to avoid running ckraid as much as
possible. Or you can hack on the kernel and get the
hot-reconstruction code debugged ...
- SCSI cables are well-known to be very temperamental
creatures, and prone to cause all sorts of problems.
Use the highest quality cabling that you can find for
sale. Use e.g. bubble-wrap to make sure that ribbon
cables to not get too close to one another and
cross-talk. Rigorously observe cable-length
restrictions.
- Take a look at SSI (Serial Storage Architecture).
Although it is rather expensive, it is rumored
to be less prone to the failure modes that SCSI
exhibits.
- Enjoy yourself, its later than you think.
Next
Previous
Contents
|