This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
[file this blog page at: digg del.icio.us Technorati]
There are several difficulties with storage setups, and many of them arise from unwise attempts to be too clever. One of these is the attempt by drive manufacturers to compensate for MS-Windows issues and for inflated expectations by customers.
One of these is the assumption that storage devices are faultless and never lose data or capacity, and making redundant copies is quite unnecessary. As a result most storage devices are set to retry failed operations for a long time in the hope of eventually forcing them through, and amazingly this applies to writes too, as many block layers and file system designs cannot easily mark some portion of a storage area as faulty and then use another one which is not faulty.
The real, fallible, storage device is then virtualized into an (pretend) infallible one which however has rather different properties, in particular as to latency from real storage devices: in particular higher latencies when a sector that has been virtualized (usually because of a fault) to another location is accessed, and enormouly higher latencies.
Because some parts of Linux have been designed by people who
are used to storage being unvirtualized, and as a result do
their own retries of failing operations, and since it often
happens that failing operations are clustered in time (for
example a cable becomes faulty) or in space (for example a
small part of the surface of a platter
becomes faulty) this can involve very long periods in which
the system becomes unresponsive. This can be dozens of
minutes, as each failing operation is first retried for 1 or 2
minutes by the storage device and then that's repeated several
times by the device driver or the block layer in the kernel,
or even by overeager applications.
The far better options overall is to use redundancy rather than retries to cope with failures, and acknowledge failure early, rather than over-relying on crude attempts at dataq recovery.
This particularly matters if further virtualization layers are used; for example a hardware RAID HA, or a software virtual HA in a VM.
These layers often have their own fault recover logic and lower level recovery just makes things longer. Even worse they may have their own IO operation timeouts which may be triggered by long retry times in lower levels of access; for example timeouts on a single operation might result in some layer to consider the whole device as faulty, when instead it is almost entirely working well.
One of the major issues are the very long retries done by most storage devices in an attempt to work around the limitations as to error handling in MS-Windows.
Fortunately somewhat recently an extension to some common storage protocol has become somewhat popular, and it is ERC as part of the SCT portions of SATA/SAS. This is somewhat limited as changes to the retry settings are not permanent, and as a rule the parmanent defaults can only be change, in a few cases, with timeouts on a single operation might result in some layer to consider the whole device as faulty, when instead it is almost entirely working well.
SAS and enterprise
grade SATA drives have
reduced retry timeouts by default, but these are still often
pretty long being typically 7 seconds. Some SATA
consumer
grade drives have had their SCT
ERC settings blocked to prevent them being substituted for the
far more expensive enterprise level drives (where the
different SCT ERC default is the only functional difference,
the others being quality of manufacturing ones that are
difficult to demonstrate).
What is a suitable retry timeout is a then a good question, and many SAS and enterprise grade drives have it set to 7 seconds which seems way too long to me. Surely more sensible than the 1-2 minutes for many consumer grade SATA drives, but still incredibly long: repeated read or write operations to the same area of the disk will usually incur only rotational latency (transfer time being neglible), and a typical 7200RPM drives will do 120RPS for perhaps around 100 retries per second. I think then that 1-2 seconds are plenty for a retry timeout, especially on storage systems with redundancy built in like most RAID setups, but also on desktop or laptop drives.
Regrettably there are now quite a few high capacity drives with 4096 physical sectors, and this requires better alignment of data and filesystems on those drives than in the past. Large alignment granules are also suggested or required by other storage technologies, from parity RAID (not a good idea except in a few cases) to flash-memory based drives that have huge erase-blocks.
The ideal technology to achieve this is the new
GPT
partioning scheme implemented notably by recent versions of
parted
and
gdisk
which by default align partitions to largish boundaries like
1MiB. I prefer to set the granule of alignment and allocation
to 1GiB for various reasons.
However there is still a case for using the old MBR partitioning scheme on drives with less than 2TB capacity, for example because most versions of GNU GRUB don't work with the GPT scheme.
In the MBR scheme there are several awkward legacy issues, mostly that:
The way I have chosen after consulting various sources to reconcile using traditional MBR partitions with modern storage technology is to adopt these conventions:
c
setting of fdisk
and
the extended command b
, to set the LBA starting
address of each partition to another coarser power-of-2
alignment, 512, which gives a 256KiB alignment.The very draft version of the script is:
#!/usr/bin/perl use strict; use integer; my $KiB = 1024; my $MiB = 1024*$KiB; my $GiB = 1024*$MiB; my $SectB = 512; my $HeadB = 63*$SectB; my $CylB = 255*$HeadB; my $alPartB = 32*$KiB; my $alFsysB = 256*$KiB; # The first byte, origin-0, is 512, as bytes 0-511 contain # the MBR. my $startB = 1*$SectB; # Now we have as parameters either <= 4 primary partition sizes # or >=5 sizes of which the first 3 are primary partitions and # the rest are logical partitions. For each we want to print # the first and last cylinders and sectors, and the start of data # within it, given a specific *usable* size (that is, excluding # the start of data alignment). sub partCalc() { my ($startB,$resB,$sizeB) = @_; my $startC = (($startB+$CylB-1)/$CylB); $startC += 1 while (($startC*$CylB) % $alPartB != 0); my $startDataB = (($startC*$CylB+$resB+$alFsysB-1)/$alFsysB)*$alFsysB; my $endC = ($startDataB+$sizeB+$CylB-1)/$CylB; $endC += 1 while (($endC*$CylB) % $alPartB != 0 && (($endC-$startC+1)*$CylB) >= $sizeB); return ($startC,$startDataB/$SectB,$endC); } my $startB = 0; my $ARGC = 1; foreach my $ARG (@ARGV) { if ($ARG eq '') { if ($ARGC <= 4) { printf "%2d:\n",$ARGC; } else { printf "%2d: %s\n",$ARGC,"(logical partition cannot be void)"; } } else { my $resB = ($ARGC == 1 || $ARGC >= 5) ? 63*$SectB : 0; my ($startC,$startDataS,$endC) = &partCalc($startB,$resB,($ARG+0)*$GiB); printf "%2d: %6dc to %6dc (%6dc) start %10ds\n", $ARGC,$startC+1,$endC,($endC-$startC),$startDataS; $startB = $endC*$CylB; } $ARGC++; }
This is how I used it in a recent partitioning of a 2TB drive I recently got:
tree% perl calcAlignedParts.pl 0 25 25 '' 5 409 464 928 1: 1c to 64c ( 64c) start 512s 2: 65c to 3328c ( 3264c) start 1028608s 3: 3329c to 6592c ( 3264c) start 53464576s 4: 5: 6593c to 7296c ( 704c) start 105900544s 6: 7297c to 60736c ( 53440c) start 117210624s 7: 60737c to 121344c ( 60608c) start 975724032s 8: 121345c to 242496c (121152c) start 1949391872sand this is the resulting partioning scheme in CHS and LBA terms:
% fdisk -l /dev/sdc Disk /dev/sdc: 2000.3 GB, 2000398934016 bytes 255 heads, 63 sectors/track, 243201 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00084029 Device Boot Start End Blocks Id System /dev/sdc1 65 3328 26218080 7 HPFS/NTFS /dev/sdc2 3329 6592 26218080 7 HPFS/NTFS /dev/sdc4 6593 242496 1894898880 5 Extended /dev/sdc5 6593 7296 5654848 82 Linux swap / Solaris /dev/sdc6 7297 60736 429256608 7 HPFS/NTFS /dev/sdc7 60737 121344 486833664 83 Linux /dev/sdc8 121345 242496 973153184 83 Linux % fdisk -l -u /dev/sdc Disk /dev/sdc: 2000.3 GB, 2000398934016 bytes 255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors Units = sectors of 1 * 512 = 512 bytes Disk identifier: 0x00084029 Device Boot Start End Blocks Id System /dev/sdc1 1028160 53464319 26218080 7 HPFS/NTFS /dev/sdc2 53464320 105900479 26218080 7 HPFS/NTFS /dev/sdc4 105900480 3895698239 1894898880 5 Extended /dev/sdc5 105900544 117210239 5654848 82 Linux swap / Solaris /dev/sdc6 117210624 975723839 429256608 7 HPFS/NTFS /dev/sdc7 975724032 1949391359 486833664 83 Linux /dev/sdc8 1949391872 3895698239 973153184 83 Linux
This was achieved first by using fdisk
to create
the partitions with the usual n
command with the
given CHS start and end (after setting the c
option just in case) and then adjusting the LBA starting
sector of each partition with the b
extended
command.