This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
[file this blog page at: digg del.icio.us Technorati]
There are various way to achieve resilience through link redundancy, and these include somewhat explicit technologies like VRRP and CARP at the routing level or SMLT at the network level, and usually one has to choose between resilience and load balancing and fast switchover. But there is an alternative that uses nothing more than OSPF and basic routing to achieve most of all these objectives, and it seems fairly obscure, so perhaps it is useful to document it here, as I have been using it with excellent results for a while, especially in a medium high performance network recently (it is an idea that has been discovered probably many times).
The basic idea is that as OSPF will automagically create a global routing state from status exchanges among adjacent routers, that multiple routes to the same destination will offer load balancing (on a per connection basis) if ECMP is available, and that OSPF will also happily propagate host routes to virtual, canonical IP addresses for hosts.
The basic setup is very simple:
That does not take a lot of work. The effects are however quite interesting, if one refers to each such host or router not by the address of any of its network interfaces, but by that of the virtual interface. The reason is that OSPF will propagate not just routes to each physical network, but also so each virtual interface, via all the networks its node is connected to.
If ECMP is enabled, and some of these routes will have the same cost, load balancing will occur, across any of the number of routes that have the same cost; if any of routes become invalid, traffic to the virtual IP address will be instantly rerouted by whichever route is still available. Instantly because when an interface fails routes to it are withdrawn, and any higher (or equal) cost route will then immediately be used for the nextpacket.
It is also very easy to use anycast addresses or other forms of host routes to distribute services in a resilient way; the canonical addresses of routers and important servers are similar to anycast addresses.
As long as connections are from one virtual IP address to another virtual IP address, they will eventually arrive as OSPF creates and reshapes then set of routes across nthe various networks.
This technique has some drawbacks, mainly:
As an example of a particular setup, imagine a site with:
canonicaladdress and router id in the 10.0.3.0/24 subnet), and an address on the Internet.
In the above discussion a canonical
address is not indispensable, but very useful. The idea is that
a given router or server cannot be referred to with any of the
addresses of its physical interfaces, as in most such systems or
routers when a link die its associated interface disappears and
any address bound to it vanishes as well. Therefore each system
or router needs an
IP address bound to a virtual interface
(dummy
under Linux, circuitless
for some Nortel routers, loopback
in the
CISCO and many other cultures) in order to be always reachable
no matter which particular links and interfaces are active.
For each such system there will be a host route published
for its canonical address, but in most networks with dozens or
hundreds of subnet routes, some more dozen or hundreds of host
routes are not a big deal, with most routers being able to
handle thousands or dozens of thousands of routes.
This scheme is rather more reliable and simpler than the use of floating router IP addresses using VRRP or CARP or other load balancing or redundancy solutions, as it does not rely on tricks with mapping between IP and Ethernet addresses. It also can be extended to fairly arbitrary topologies, and with the use of BGP and careful publication of routers it can be extended beyond a single AS.
It has also some interesting properties, for example:
I was just comparing some 1TB drive with a 500GB drive and both perform pretty well. The 1TB one (Hitachi) can do over 100MB/s through the file system:
.... Using O_DIRECT for block based I/O Writing with putc()... done: 61585 kB/s 82.2 %CPU Rewriting... done: 29108 kB/s 1.4 %CPU Writing intelligently... done: 101481 kB/s 2.0 %CPU Reading with getc()... done: 7787 kB/s 14.4 %CPU Reading intelligently... done: 104591 kB/s 2.3 %CPU Seek numbers calculated on first volume only Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done... ---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek- -Per Char- -DIOBlock- -DRewrite- -Per Char- -DIOBlock- --04k (03)- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU base.t 4* 400 61585 82.2101481 2.0 29108 1.4 7787 14.4104591 2.3 170.9 0.3and the 500GB one (Western Digital) can do over 60MB/s:
.... Using O_DIRECT for block based I/O Writing with putc()... done: 52489 kB/s 70.8 %CPU Rewriting... done: 25733 kB/s 1.3 %CPU Writing intelligently... done: 62586 kB/s 1.2 %CPU Reading with getc()... done: 7946 kB/s 13.4 %CPU Reading intelligently... done: 63825 kB/s 1.3 %CPU Seek numbers calculated on first volume only Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done... ---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek- -Per Char- -DIOBlock- -DRewrite- -Per Char- -DIOBlock- --04k (03)- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU base.t 4* 400 52489 70.8 62586 1.2 25733 1.3 7946 13.4 63825 1.3 216.0 0.4The file system used is the excellent JFS which is still my favourite even if the news that XFS will become part of RHEL 5.4 may tilt the balance of opportunity towards it. Anyhow even if the files used by the test above are large at 400MB, the filesystems used are fairly full, and JFS achieves transfer rates very close (80-90%) to the speed of the underlying devices for both reading and writing.
I had somehow missed that intel is now selling a line of interesting RAID host adapters based on their own ARM architecture CPUs for an IO processor and dispatches and LSI MegaRAID chips. They cover quite a range of configurations, and usually intel takes more care in engineering and documenting their products than most other chip and card suppliers.
As to Atom CPUs I have been wondering how power efficient they are, given that they seem to be around 3 times slower than an equivalent mainling CPU. So I found this article reporting a test of the amount of energy (in watt-hours) consumed by some servers to run the same benchmarks, and it turns out that among intel CPUs the Core2Duo consumes the least energy; while its power draw is higher, it is faster, and this more than compensates for the higher power draw. What this tells me is that the Atom is better for mostly-idle systems, that is IO or network bounds, and the Core2Duo is better for mostly-busy ones, that is CPU-bound.
Well, I often think that several mysterious computer issues are due to PSU faults. Indeed recently one of my older PCs stopped working reliably: it would boot, and often allow installing some sw or hardware, but also stop working abruptly, seemingly because of a hard disk issue, as IO the disk would stop and the IO activity light stay locked on.
Having tried some alternative hard disks and
some alternative hard disk host adapter cards with the same
results I reckonged that not all could be faulty, so I checked
the voltage on one of the berg
style
connectors and I was rather surprised to see that the 12V line
was actually deliverng 13V and most fatally the 5V line was
actually delivering 4.5V, which is probably rather insufficient.
I wonder why; the PSU was not a super-cheap one (they can catch
fire) but a fairly high end Akasa
one.
This is one of the stranger PSu failures that I have seen so far, where voltage on one rail actually rises and on the other weakens to just too low, without actually failing.
It had to happen, yet I was still a bit surprised to see an intel Atom CPU based rackmount server as the most notable characteristic is that it is half depth. That ssort of makes sense as one can then mount them without rails and one in the from and one in the back of a rack.
It is also a bit disappointing to see that the hard disk is not hot pluggable and is a 3.5" one. The funniest detail however is that the motherboard chipset is active cooled but not the CPU.
The design logic seems to be for a disposable
unit for rent-a-box web server companies, where most such
servers are used fo relatively low traffic and low size sites,
and anyhow the main bottleneck is the 1Gb/s network interface,
and such servers are often bandwidth-limited to well less than
that, typically 10-100Mb/s.
At the other end of the range the same manufacturer have announced another interesting idea, 1U and 2U server boxes with the recent i7 class Xeon 5500 CPU, configured as a 1U/2U blade cabinet. That is mildy amusing, and seems to me the logical extension of Supermicro putting two servers side by side into a 1U box, as a fixed configuration.
While reading the blog of a CentOS I had a huge surprise reading a comment about recently discovered Red Hat plans for RHEL5.4:
That is amazing as it represents a really large change for Red Hat's strategy, which was based on in-place upgrade from
ext3
to
ext4
RHEL6, and not introducing major new functionality in
stable'releases. Some factors that I suppose might have influenced the decision:
ext4
has made it into the mainline kernel, but
only in a release that will be part of RHEL6, and that has kept
slipping a lot, currently to sometime (late) next year.ext3
and hate losing Oracle
certification because of that.The Red Hat sponsorship of XFS shifts a bit my preferences; I
have been using for a while
JFS
as my default
filesystem as it is very stable
and covers a very wide range of system sizes and usage patterns
pretty well, I might (with regrets) move to XFS then.
CentOS has had some things like XFS available in the CentOSPlus repository, and XFS has been a standard part of Scientific Linux 5 too (which also includes OpenAFS).