Software and hardware annotations 2009 May

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

090521 Thu Resilience without VRRP/UCARP or SMLT

There are various way to achieve resilience through link redundancy, and these include somewhat explicit technologies like VRRP and CARP at the routing level or SMLT at the network level, and usually one has to choose between resilience and load balancing and fast switchover. But there is an alternative that uses nothing more than OSPF and basic routing to achieve most of all these objectives, and it seems fairly obscure, so perhaps it is useful to document it here, as I have been using it with excellent results for a while, especially in a medium high performance network recently (it is an idea that has been discovered probably many times).

The basic idea is that as OSPF will automagically create a global routing state from status exchanges among adjacent routers, that multiple routes to the same destination will offer load balancing (on a per connection basis) if ECMP is available, and that OSPF will also happily propagate host routes to virtual, canonical IP addresses for hosts.

The basic setup is very simple:

Ensure that each host or network routers for which load balanced resilience is multihomed on different backbones.
Assign an IP address to each such host or router from a reserved subnet, on a virtual interface, that is one not connected to any actual network; it is convenient to make this address the OSPF router id.
Configure OSPF on each such host or router, and instruct it to publish not just routes to the networks it is connected to, but also a host (/32 prefix) route to that IP address.
Enabled ECMP on each such host or router.

That does not take a lot of work. The effects are however quite interesting, if one refers to each such host or router not by the address of any of its network interfaces, but by that of the virtual interface. The reason is that OSPF will propagate not just routes to each physical network, but also so each virtual interface, via all the networks its node is connected to.

If ECMP is enabled, and some of these routes will have the same cost, load balancing will occur, across any of the number of routes that have the same cost; if any of routes become invalid, traffic to the virtual IP address will be instantly rerouted by whichever route is still available. Instantly because when an interface fails routes to it are withdrawn, and any higher (or equal) cost route will then immediately be used for the nextpacket.

It is also very easy to use anycast addresses or other forms of host routes to distribute services in a resilient way; the canonical addresses of routers and important servers are similar to anycast addresses.

As long as connections are from one virtual IP address to another virtual IP address, they will eventually arrive as OSPF creates and reshapes then set of routes across nthe various networks.

This technique has some drawbacks, mainly:

It works only for IP with OSPF and ECMP. It is not multiprotocol, and one must have available ECMP and OSPF. This does not seem to me a big issue; for GNU/Linux and various BSD flavours there are fairly good implementations, and many if not most routers suppport OSPF and ECMP (sometimes for an hefty extra fee as for the Nortel routers I like).
It creates an extra (host) route per each node to which load balanced resilient access is desired. This is unlikely to be limiting, given that an OSPF network usually show not grow beyond a few hundred routers, and that routers can handles thousands of routes, and that not all nodes need to be given resoilient multihomed load balanced access. Most nodes will only need to be single homed on a multihomed router. The greatest drawback of this technique in this respect that listing routes becomes more verbose. But then this provides also better information, as the detail of which routes lead to a virtual address gives valuable information on the state of links and connectivity of a network.
In order to achieve the full benefits of load balancing and resilience services need to bind to the virtual address, and ideally only to the virtual address. This is very rarely an issue, and indeed it is an advantage, as binding services to physical network interface addresses makes it more difficult to achieve resilience and establish access control anyhow.

As an example of a particular setup, imagine a site with:

Two 10Gb/s backbones being fibre based LANs each centred on a high end router, on subnets 10.0.1.0/24 and 10.0.2.0/24, each router having a connection to the rest of the Internet, and publishing their default routes via OSPF. Backbone routers would have 3 addresses; for example for the second backbone 10.0.2.1 (on that baxkbone), 10.0.3.2 (as a canonical address and router id in the 10.0.3.0/24 subnet), and an address on the Internet.
Every other LAN can be created as a router with connections to each backbone, the router configured with OSPF and ECMP. Each LAN router will have four addresses, for example 10.0.1.70 (first backbone), 10.0.2.70 (second backbone), 10.0.70.1 (its own LAN) and 10.0.3.70 (its canonical address and router id).
Important servers can be connected directly to both backbones and will run an OSPF daemon and be a router itself, with three addresses, two on the backbones, and one its canonical address in the router id subnet. If referred to using this address it will be reachable no matter how the topology of the network evolves, as long as it can be reached.

In the above discussion a canonical address is not indispensable, but very useful. The idea is that a given router or server cannot be referred to with any of the addresses of its physical interfaces, as in most such systems or routers when a link die its associated interface disappears and any address bound to it vanishes as well. Therefore each system or router needs an IP address bound to a virtual interface (dummy under Linux, circuitless for some Nortel routers, loopback in the CISCO and many other cultures) in order to be always reachable no matter which particular links and interfaces are active. For each such system there will be a host route published for its canonical address, but in most networks with dozens or hundreds of subnet routes, some more dozen or hundreds of host routes are not a big deal, with most routers being able to handle thousands or dozens of thousands of routes.

This scheme is rather more reliable and simpler than the use of floating router IP addresses using VRRP or CARP or other load balancing or redundancy solutions, as it does not rely on tricks with mapping between IP and Ethernet addresses. It also can be extended to fairly arbitrary topologies, and with the use of BGP and careful publication of routers it can be extended beyond a single AS.

It has also some interesting properties, for example:

In the topology described above there is no direct communication between the two backbones, and this reduces the chance of common modes of failure. Indeed this requires having a separate spanning tree per backbone (because of loops). Even better, no spanning tree on the backbone networks. Am amusing detail is what happens when a router with a link to just one backbone sends traffic to a router on just the other backbone: dual homed non-backbone routers will forward the traffic between the two backbones.
If one has two backbone routers, it is fairly easy to configure BGP on them so that external connections are also resilient and load balanced.
A simple numbering plan makes it easier to remember what is what.

090518 Mon Pretty good disk and JFS bulk IO performance

I was just comparing some 1TB drive with a 500GB drive and both perform pretty well. The 1TB one (Hitachi) can do over 100MB/s through the file system:

....
Using O_DIRECT for block based I/O
Writing with putc()...         done:  61585 kB/s  82.2 %CPU
Rewriting...                   done:  29108 kB/s   1.4 %CPU
Writing intelligently...       done: 101481 kB/s   2.0 %CPU
Reading with getc()...         done:   7787 kB/s  14.4 %CPU
Reading intelligently...       done: 104591 kB/s   2.3 %CPU
Seek numbers calculated on first volume only
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
              ---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
              -Per Char- -DIOBlock- -DRewrite- -Per Char- -DIOBlock- --04k (03)-
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
base.t 4* 400 61585 82.2101481  2.0 29108  1.4  7787 14.4104591  2.3  170.9  0.3

and the 500GB one (Western Digital) can do over 60MB/s:

....
Using O_DIRECT for block based I/O
Writing with putc()...         done:  52489 kB/s  70.8 %CPU
Rewriting...                   done:  25733 kB/s   1.3 %CPU
Writing intelligently...       done:  62586 kB/s   1.2 %CPU
Reading with getc()...         done:   7946 kB/s  13.4 %CPU
Reading intelligently...       done:  63825 kB/s   1.3 %CPU
Seek numbers calculated on first volume only
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
              ---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
              -Per Char- -DIOBlock- -DRewrite- -Per Char- -DIOBlock- --04k (03)-
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
base.t 4* 400 52489 70.8 62586  1.2 25733  1.3  7946 13.4 63825  1.3  216.0  0.4

The file system used is the excellent JFS which is still my favourite even if the news that XFS will become part of RHEL 5.4 may tilt the balance of opportunity towards it. Anyhow even if the files used by the test above are large at 400MB, the filesystems used are fairly full, and JFS achieves transfer rates very close (80-90%) to the speed of the underlying devices for both reading and writing.

090510b Sun New line of intel RAID host adapters

I had somehow missed that intel is now selling a line of interesting RAID host adapters based on their own ARM architecture CPUs for an IO processor and dispatches and LSI MegaRAID chips. They cover quite a range of configurations, and usually intel takes more care in engineering and documenting their products than most other chip and card suppliers.

090510 Sun Power consumption per task instead of instantaneous

As to Atom CPUs I have been wondering how power efficient they are, given that they seem to be around 3 times slower than an equivalent mainling CPU. So I found this article reporting a test of the amount of energy (in watt-hours) consumed by some servers to run the same benchmarks, and it turns out that among intel CPUs the Core2Duo consumes the least energy; while its power draw is higher, it is faster, and this more than compensates for the higher power draw. What this tells me is that the Atom is better for mostly-idle systems, that is IO or network bounds, and the Core2Duo is better for mostly-busy ones, that is CPU-bound.

090509b Sat Strange PSU fault

Well, I often think that several mysterious computer issues are due to PSU faults. Indeed recently one of my older PCs stopped working reliably: it would boot, and often allow installing some sw or hardware, but also stop working abruptly, seemingly because of a hard disk issue, as IO the disk would stop and the IO activity light stay locked on.

Having tried some alternative hard disks and some alternative hard disk host adapter cards with the same results I reckonged that not all could be faulty, so I checked the voltage on one of the berg style connectors and I was rather surprised to see that the 12V line was actually deliverng 13V and most fatally the 5V line was actually delivering 4.5V, which is probably rather insufficient. I wonder why; the PSU was not a super-cheap one (they can catch fire) but a fairly high end Akasa one.

This is one of the stranger PSu failures that I have seen so far, where voltage on one rail actually rises and on the other weakens to just too low, without actually failing.

090509 Sat Small Atom based rackmount servers

It had to happen, yet I was still a bit surprised to see an intel Atom CPU based rackmount server as the most notable characteristic is that it is half depth. That ssort of makes sense as one can then mount them without rails and one in the from and one in the back of a rack.

It is also a bit disappointing to see that the hard disk is not hot pluggable and is a 3.5" one. The funniest detail however is that the motherboard chipset is active cooled but not the CPU.

The design logic seems to be for a disposable unit for rent-a-box web server companies, where most such servers are used fo relatively low traffic and low size sites, and anyhow the main bottleneck is the 1Gb/s network interface, and such servers are often bandwidth-limited to well less than that, typically 10-100Mb/s.

At the other end of the range the same manufacturer have announced another interesting idea, 1U and 2U server boxes with the recent i7 class Xeon 5500 CPU, configured as a 1U/2U blade cabinet. That is mildy amusing, and seems to me the logical extension of Supermicro putting two servers side by side into a 1U box, as a fixed configuration.

090508 Fri Amazing file system news from Red Hat

While reading the blog of a CentOS I had a huge surprise reading a comment about recently discovered Red Hat plans for RHEL5.4:

We discovered xfs was coming to RHEL,

That is amazing as it represents a really large change for Red Hat's strategy, which was based on in-place upgrade from ext3 to ext4 RHEL6, and not introducing major new functionality in stable' releases. Some factors that I suppose might have influenced the decision:

ext4 has made it into the mainline kernel, but only in a release that will be part of RHEL6, and that has kept slipping a lot, currently to sometime (late) next year.
A lot of Red Hat customers, especially large Oracle users, use XFS already instead of ext3 and hate losing Oracle certification because of that.
Red Hat hired Eric Sandeen, one of the main ex-sgi developers of XFS, and sgi is now disappearing and sponsorship of XFS is up for grabs.
Oracle have bought Sun, and Sun have launched their new hot storage servers based on an update to the Thumper line, with ZFS OpenSolaris.

The Red Hat sponsorship of XFS shifts a bit my preferences; I have been using for a while JFS as my default filesystem as it is very stable and covers a very wide range of system sizes and usage patterns pretty well, I might (with regrets) move to XFS then.

CentOS has had some things like XFS available in the CentOSPlus repository, and XFS has been a standard part of Scientific Linux 5 too (which also includes OpenAFS).