|
Next
Previous
Contents
- Q:
I've created a RAID-0 device on
/dev/sda2 and
/dev/sda3 . The device is a lot slower than a
single partition. Isn't md a pile of junk?
A:
To have a RAID-0 device running a full speed, you must
have partitions from different disks. Besides, putting
the two halves of the mirror on the same disk fails to
give you any protection whatsoever against disk failure.
- Q:
What's the use of having RAID-linear when RAID-0 will do the
same thing, but provide higher performance?
A:
It's not obvious that RAID-0 will always provide better
performance; in fact, in some cases, it could make things
worse.
The ext2fs file system scatters files all over a partition,
and it attempts to keep all of the blocks of a file
contiguous, basically in an attempt to prevent fragmentation.
Thus, ext2fs behaves "as if" there were a (variable-sized)
stripe per file. If there are several disks concatenated
into a single RAID-linear, this will result files being
statistically distributed on each of the disks. Thus,
at least for ext2fs, RAID-linear will behave a lot like
RAID-0 with large stripe sizes. Conversely, RAID-0
with small stripe sizes can cause excessive disk activity
leading to severely degraded performance if several large files
are accessed simultaneously.
In many cases, RAID-0 can be an obvious win. For example,
imagine a large database file. Since ext2fs attempts to
cluster together all of the blocks of a file, chances
are good that it will end up on only one drive if RAID-linear
is used, but will get chopped into lots of stripes if RAID-0 is
used. Now imagine a number of (kernel) threads all trying
to random access to this database. Under RAID-linear, all
accesses would go to one disk, which would not be as efficient
as the parallel accesses that RAID-0 entails.
- Q:
How does RAID-0 handle a situation where the different stripe
partitions are different sizes? Are the stripes uniformly
distributed?
A:
To understand this, lets look at an example with three
partitions; one that is 50MB, one 90MB and one 125MB.
Lets call D0 the 50MB disk, D1 the 90MB disk and D2 the 125MB
disk. When you start the device, the driver calculates 'strip
zones'. In this case, it finds 3 zones, defined like this:
Z0 : (D0/D1/D2) 3 x 50 = 150MB total in this zone
Z1 : (D1/D2) 2 x 40 = 80MB total in this zone
Z2 : (D2) 125-50-40 = 35MB total in this zone.
You can see that the total size of the zones is the size of the
virtual device, but, depending on the zone, the striping is
different. Z2 is rather inefficient, since there's only one
disk.
Since ext2fs and most other Unix
file systems distribute files all over the disk, you
have a 35/265 = 13% chance that a fill will end up
on Z2, and not get any of the benefits of striping.
(DOS tries to fill a disk from beginning to end, and thus,
the oldest files would end up on Z0. However, this
strategy leads to severe filesystem fragmentation,
which is why no one besides DOS does it this way.)
- Q:
I have some Brand X hard disks and a Brand Y controller.
and am considering using
md .
Does it significantly increase the throughput?
Is the performance really noticeable?
A:
The answer depends on the configuration that you use.
- Linux MD RAID-0 and RAID-linear performance:
If the system is heavily loaded with lots of I/O,
statistically, some of it will go to one disk, and
some to the others. Thus, performance will improve
over a single large disk. The actual improvement
depends a lot on the actual data, stripe sizes, and
other factors. In a system with low I/O usage,
the performance is equal to that of a single disk.
- Linux MD RAID-1 (mirroring) read performance:
MD implements read balancing. That is, the RAID-1
code will alternate between each of the (two or more)
disks in the mirror, making alternate reads to each.
In a low-I/O situation, this won't change performance
at all: you will have to wait for one disk to complete
the read.
But, with two disks in a high-I/O environment,
this could as much as double the read performance,
since reads can be issued to each of the disks in parallel.
For N disks in the mirror, this could improve performance
N-fold.
- Linux MD RAID-1 (mirroring) write performance:
Must wait for the write to occur to all of the disks
in the mirror. This is because a copy of the data
must be written to each of the disks in the mirror.
Thus, performance will be roughly equal to the write
performance to a single disk.
- Linux MD RAID-4/5 read performance:
Statistically, a given block can be on any one of a number
of disk drives, and thus RAID-4/5 read performance is
a lot like that for RAID-0. It will depend on the data, the
stripe size, and the application. It will not be as good
as the read performance of a mirrored array.
- Linux MD RAID-4/5 write performance:
This will in general be considerably slower than that for
a single disk. This is because the parity must be written
out to one drive as well as the data to another. However,
in order to compute the new parity, the old parity and
the old data must be read first. The old data, new data and
old parity must all be XOR'ed together to determine the new
parity: this requires considerable CPU cycles in addition
to the numerous disk accesses.
- Q:
What RAID configuration should I use for optimal performance?
A:
Is the goal to maximize throughput, or to minimize latency?
There is no easy answer, as there are many factors that
affect performance:
- operating system - will one process/thread, or many
be performing disk access?
- application - is it accessing data in a
sequential fashion, or random access?
- file system - clusters files or spreads them out
(the ext2fs clusters together the blocks of a file,
and spreads out files)
- disk driver - number of blocks to read ahead
(this is a tunable parameter)
- CEC hardware - one drive controller, or many?
- hd controller - able to queue multiple requests or not?
Does it provide a cache?
- hard drive - buffer cache memory size -- is it big
enough to handle the write sizes and rate you want?
- physical platters - blocks per cylinder -- accessing
blocks on different cylinders will lead to seeks.
- Q:
What is the optimal RAID-5 configuration for performance?
A:
Since RAID-5 experiences an I/O load that is equally
distributed
across several drives, the best performance will be
obtained when the RAID set is balanced by using
identical drives, identical controllers, and the
same (low) number of drives on each controller.
Note, however, that using identical components will
raise the probability of multiple simultaneous failures,
for example due to a sudden jolt or drop, overheating,
or a power surge during an electrical storm. Mixing
brands and models helps reduce this risk.
- Q:
What is the optimal block size for a RAID-4/5 array?
A:
When using the current (November 1997) RAID-4/5
implementation, it is strongly recommended that
the file system be created with mke2fs -b 4096
instead of the default 1024 byte filesystem block size.
This is because the current RAID-5 implementation
allocates one 4K memory page per disk block;
if a disk block were just 1K in size, then
75% of the memory which RAID-5 is allocating for
pending I/O would not be used. If the disk block
size matches the memory page size, then the
driver can (potentially) use all of the page.
Thus, for a filesystem with a 4096 block size as
opposed to a 1024 byte block size, the RAID driver
will potentially queue 4 times as much
pending I/O to the low level drivers without
allocating additional memory.
Note: the above remarks do NOT apply to Software
RAID-0/1/linear driver.
Note: the statements about 4K memory page size apply to the
Intel x86 architecture. The page size on Alpha, Sparc, and other
CPUS are different; I believe they're 8K on Alpha/Sparc (????).
Adjust the above figures accordingly.
Note: if your file system has a lot of small
files (files less than 10KBytes in size), a considerable
fraction of the disk space might be wasted. This is
because the file system allocates disk space in multiples
of the block size. Allocating large blocks for small files
clearly results in a waste of disk space: thus, you may
want to stick to small block sizes, get a larger effective
storage capacity, and not worry about the "wasted" memory
due to the block-size/page-size mismatch.
Note: most ''typical'' systems do not have that many
small files. That is, although there might be thousands
of small files, this would lead to only some 10 to 100MB
wasted space, which is probably an acceptable tradeoff for
performance on a multi-gigabyte disk.
However, for news servers, there might be tens or hundreds
of thousands of small files. In such cases, the smaller
block size, and thus the improved storage capacity,
may be more important than the more efficient I/O
scheduling.
Note: there exists an experimental file system for Linux
which packs small files and file chunks onto a single block.
It apparently has some very positive performance
implications when the average file size is much smaller than
the block size.
Note: Future versions may implement schemes that obsolete
the above discussion. However, this is difficult to
implement, since dynamic run-time allocation can lead to
dead-locks; the current implementation performs a static
pre-allocation.
- Q:
How does the chunk size (stripe size) influence the speed of
my RAID-0, RAID-4 or RAID-5 device?
A:
The chunk size is the amount of data contiguous on the
virtual device that is also contiguous on the physical
device. In this HOWTO, "chunk" and "stripe" refer to
the same thing: what is commonly called the "stripe"
in other RAID documentation is called the "chunk"
in the MD man pages. Stripes or chunks apply only to
RAID 0, 4 and 5, since stripes are not used in
mirroring (RAID-1) and simple concatenation (RAID-linear).
The stripe size affects both read and write latency (delay),
throughput (bandwidth), and contention between independent
operations (ability to simultaneously service overlapping I/O
requests).
Assuming the use of the ext2fs file system, and the current
kernel policies about read-ahead, large stripe sizes are almost
always better than small stripe sizes, and stripe sizes
from about a fourth to a full disk cylinder in size
may be best. To understand this claim, let us consider the
effects of large stripes on small files, and small stripes
on large files. The stripe size does
not affect the read performance of small files: For an
array of N drives, the file has a 1/N probability of
being entirely within one stripe on any one of the drives.
Thus, both the read latency and bandwidth will be comparable
to that of a single drive. Assuming that the small files
are statistically well distributed around the filesystem,
(and, with the ext2fs file system, they should be), roughly
N times more overlapping, concurrent reads should be possible
without significant collision between them. Conversely, if
very small stripes are used, and a large file is read sequentially,
then a read will issued to all of the disks in the array.
For a the read of a single large file, the latency will almost
double, as the probability of a block being 3/4'ths of a
revolution or farther away will increase. Note, however,
the trade-off: the bandwidth could improve almost N-fold
for reading a single, large file, as N drives can be reading
simultaneously (that is, if read-ahead is used so that all
of the disks are kept active). But there is another,
counter-acting trade-off: if all of the drives are already busy
reading one file, then attempting to read a second or third
file at the same time will cause significant contention,
ruining performance as the disk ladder algorithms lead to
seeks all over the platter. Thus, large stripes will almost
always lead to the best performance. The sole exception is
the case where one is streaming a single, large file at a
time, and one requires the top possible bandwidth, and one
is also using a good read-ahead algorithm, in which case small
stripes are desired.
Note that this HOWTO previously recommended small stripe
sizes for news spools or other systems with lots of small
files. This was bad advice, and here's why: news spools
contain not only many small files, but also large summary
files, as well as large directories. If the summary file
is larger than the stripe size, reading it will cause
many disks to be accessed, slowing things down as each
disk performs a seek. Similarly, the current ext2fs
file system searches directories in a linear, sequential
fashion. Thus, to find a given file or inode, on average
half of the directory will be read. If this directory is
spread across several stripes (several disks), the
directory read (e.g. due to the ls command) could get
very slow. Thanks to Steven A. Reisman
<
sar@pressenter.com> for this correction.
Steve also adds:
I found that using a 256k stripe gives much better performance.
I suspect that the optimum size would be the size of a disk
cylinder (or maybe the size of the disk drive's sector cache).
However, disks nowadays have recording zones with different
sector counts (and sector caches vary among different disk
models). There's no way to guarantee stripes won't cross a
cylinder boundary.
The tools accept the stripe size specified in KBytes.
You'll want to specify a multiple of if the page size
for your CPU (4KB on the x86).
- Q:
What is the correct stride factor to use when creating the
ext2fs file system on the RAID partition? By stride, I mean
the -R flag on the
mke2fs command:
mke2fs -b 4096 -R stride=nnn ...
What should the value of nnn be?
A:
The -R stride flag is used to tell the file system
about the size of the RAID stripes. Since only RAID-0,4 and 5
use stripes, and RAID-1 (mirroring) and RAID-linear do not,
this flag is applicable only for RAID-0,4,5.
Knowledge of the size of a stripe allows mke2fs
to allocate the block and inode bitmaps so that they don't
all end up on the same physical drive. An unknown contributor
wrote:
I noticed last spring that one drive in a pair always had a
larger I/O count, and tracked it down to the these meta-data
blocks. Ted added the -R stride= option in response
to my explanation and request for a workaround.
For a 4KB block file system, with stripe size 256KB, one would
use -R stride=64 .
If you don't trust the -R flag, you can get a similar
effect in a different way. Steven A. Reisman
<
sar@pressenter.com> writes:
Another consideration is the filesystem used on the RAID-0 device.
The ext2 filesystem allocates 8192 blocks per group. Each group
has its own set of inodes. If there are 2, 4 or 8 drives, these
inodes cluster on the first disk. I've distributed the inodes
across all drives by telling mke2fs to allocate only 7932 blocks
per group.
Some mke2fs pages do not describe the [-g blocks-per-group]
flag used in this operation.
- Q:
Where can I put the
md commands in the startup scripts,
so that everything will start automatically at boot time?
A:
Rod Wilkens
<
rwilkens@border.net>
writes:
What I did is put ``mdadd -ar '' in
the ``/etc/rc.d/rc.sysinit '' right after the kernel
loads the modules, and before the ``fsck '' disk check.
This way, you can put the ``/dev/md? '' device in the
``/etc/fstab ''. Then I put the ``mdstop -a ''
right after the ``umount -a '' unmounting the disks,
in the ``/etc/rc.d/init.d/halt '' file.
For raid-5, you will want to look at the return code
for mdadd , and if it failed, do a
ckraid --fix /etc/raid5.conf
to repair any damage.
- Q:
I was wondering if it's possible to setup striping with more
than 2 devices in
md0 ? This is for a news server,
and I have 9 drives... Needless to say I need much more than two.
Is this possible?
A:
Yes. (describe how to do this)
- Q:
When is Software RAID superior to Hardware RAID?
A:
Normally, Hardware RAID is considered superior to Software
RAID, because hardware controllers often have a large cache,
and can do a better job of scheduling operations in parallel.
However, integrated Software RAID can (and does) gain certain
advantages from being close to the operating system.
For example, ... ummm. Opaque description of caching of
reconstructed blocks in buffer cache elided ...
On a dual PPro SMP system, it has been reported that
Software-RAID performance exceeds the performance of a
well-known hardware-RAID board vendor by a factor of
2 to 5.
Software RAID is also a very interesting option for
high-availability redundant server systems. In such
a configuration, two CPU's are attached to one set
or SCSI disks. If one server crashes or fails to
respond, then the other server can mdadd ,
mdrun and mount the software RAID
array, and take over operations. This sort of dual-ended
operation is not always possible with many hardware
RAID controllers, because of the state configuration that
the hardware controllers maintain.
- Q:
If I upgrade my version of raidtools, will it have trouble
manipulating older raid arrays? In short, should I recreate my
RAID arrays when upgrading the raid utilities?
A:
No, not unless the major version number changes.
An MD version x.y.z consists of three sub-versions:
x: Major version.
y: Minor version.
z: Patchlevel version.
Version x1.y1.z1 of the RAID driver supports a RAID array with
version x2.y2.z2 in case (x1 == x2) and (y1 >= y2).
Different patchlevel (z) versions for the same (x.y) version are
designed to be mostly compatible.
The minor version number is increased whenever the RAID array layout
is changed in a way which is incompatible with older versions of the
driver. New versions of the driver will maintain compatibility with
older RAID arrays.
The major version number will be increased if it will no longer make
sense to support old RAID arrays in the new kernel code.
For RAID-1, it's not likely that the disk layout nor the
superblock structure will change anytime soon. Most all
Any optimization and new features (reconstruction, multithreaded
tools, hot-plug, etc.) doesn't affect the physical layout.
- Q:
The command
mdstop /dev/md0 says that the device is busy.
A:
There's a process that has a file open on /dev/md0 , or
/dev/md0 is still mounted. Terminate the process or
umount /dev/md0 .
- Q:
Are there performance tools?
A:
There is also a new utility called iotrace in the
linux/iotrace
directory. It reads /proc/io-trace and analyses/plots it's
output. If you feel your system's block IO performance is too
low, just look at the iotrace output.
- Q:
I was reading the RAID source, and saw the value
SPEED_LIMIT defined as 1024K/sec. What does this mean?
Does this limit performance?
A:
SPEED_LIMIT is used to limit RAID reconstruction
speed during automatic reconstruction. Basically, automatic
reconstruction allows you to e2fsck and
mount immediately after an unclean shutdown,
without first running ckraid . Automatic
reconstruction is also used after a failed hard drive
has been replaced.
In order to avoid overwhelming the system while
reconstruction is occurring, the reconstruction thread
monitors the reconstruction speed and slows it down if
its too fast. The 1M/sec limit was arbitrarily chosen
as a reasonable rate which allows the reconstruction to
finish reasonably rapidly, while creating only a light load
on the system so that other processes are not interfered with.
- Q:
What about ''spindle synchronization'' or ''disk
synchronization''?
A:
Spindle synchronization is used to keep multiple hard drives
spinning at exactly the same speed, so that their disk
platters are always perfectly aligned. This is used by some
hardware controllers to better organize disk writes.
However, for software RAID, this information is not used,
and spindle synchronization might even hurt performance.
- Q:
How can I set up swap spaces using raid 0?
Wouldn't striped swap ares over 4+ drives be really fast?
A:
Leonard N. Zubkoff replies:
It is really fast, but you don't need to use MD to get striped
swap. The kernel automatically stripes across equal priority
swap spaces. For example, the following entries from
/etc/fstab stripe swap space across five drives in
three groups:
/dev/sdg1 swap swap pri=3
/dev/sdk1 swap swap pri=3
/dev/sdd1 swap swap pri=3
/dev/sdh1 swap swap pri=3
/dev/sdl1 swap swap pri=3
/dev/sdg2 swap swap pri=2
/dev/sdk2 swap swap pri=2
/dev/sdd2 swap swap pri=2
/dev/sdh2 swap swap pri=2
/dev/sdl2 swap swap pri=2
/dev/sdg3 swap swap pri=1
/dev/sdk3 swap swap pri=1
/dev/sdd3 swap swap pri=1
/dev/sdh3 swap swap pri=1
/dev/sdl3 swap swap pri=1
- Q:
I want to maximize performance. Should I use multiple
controllers?
A:
In many cases, the answer is yes. Using several
controllers to perform disk access in parallel will
improve performance. However, the actual improvement
depends on your actual configuration. For example,
it has been reported (Vaughan Pratt, January 98) that
a single 4.3GB Cheetah attached to an Adaptec 2940UW
can achieve a rate of 14MB/sec (without using RAID).
Installing two disks on one controller, and using
a RAID-0 configuration results in a measured performance
of 27 MB/sec.
Note that the 2940UW controller is an "Ultra-Wide"
SCSI controller, capable of a theoretical burst rate
of 40MB/sec, and so the above measurements are not
surprising. However, a slower controller attached
to two fast disks would be the bottleneck. Note also,
that most out-board SCSI enclosures (e.g. the kind
with hot-pluggable trays) cannot be run at the 40MB/sec
rate, due to cabling and electrical noise problems.
If you are designing a multiple controller system,
remember that most disks and controllers typically
run at 70-85% of their rated max speeds.
Note also that using one controller per disk
can reduce the likelihood of system outage
due to a controller or cable failure (In theory --
only if the device driver for the controller can
gracefully handle a broken controller. Not all
SCSI device drivers seem to be able to handle such
a situation without panicking or otherwise locking up).
Next
Previous
Contents
|