Software and hardware annotations q4 2006

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

December 2006

061231 Per-node servers and interposition, and the DNS

So I just argued that it is a nice idea to run per-node server dæmons even on client nodes. The reason is that they act as firebreaks in the case of configuration changes: all applications can just refer to 127.0.0.1 as the server address, and only the local server has to be re-configured and/or restarted.
This argument is in part supported by the idea that several applications may use the same service. This happens indeed, at least in my use: a DNS resolver is used by almost every application, and since I use several browser style applications an HTTP proxy service is also widely required. Consider the latter case again: sure, the details of the web proxy server can be configured in a proxy.pac file, or for simpler applications in the environment variable http_proxy; but it is already two places, and at least in the latter case applications have to be restarted if the server contact details change. The overall principle is that applications don't really need to know which server provides the service, just the address of some intermediary that knows. And here there is a chance to talk about a bigger story, interposition, and in particular the transparent one.
Interposition happens when between some computer entity (process, program ...) can read or rewrite the communication between two other computer entities. Reading is not as important as writing, and monitoring does not necessarily require interposition, for example for a broadcast communication medium. If at last of the original entities must or can be made aware of the interposition then it is not transparent, if the interposition cannot be detected (at least semantically) by the original entities then it is transparent.
The value of interposition is that it allows run-time enrichment of the communication, either semantically (by extending the functionality provided by one of the two entities) or pragmatically (by adding to the usefulness of the communication). Many tools require interposition, for example debuggers and profilers. In programming terms the classic technique is called advising typically implemented in some Lisp variants (or also in TECO, which apparently is the only language with fully general entry and exit hooks for functions). Another form of advising can be done by using the GNU Binutils LD_PRELOAD with the dlsym(RTLD_NEXT,...) function call.
However the best form of transparent interposition that is most appealing is that which can be done with opaque capabilities and semantics similar to those of the become: method of the recently mentioned Smalltalk-80 language: Given two capabilities it swaps them everywhere. Therefore what can be done is to create a capability, say I to process ADVISOR, the entity to be interposed, and then exchange: it with the capability, say S to SERVER, the entity to be interposed to. The result is that S will then denote ADVISOR and then I will denote SERVER, with the result that all operations on S intended for SERVER go to ADVISOR and the latter has I denoting SERVER to pass-on, if necessary, such operations to the originally denoted entity.
Interposition, especially transparent and dynamic interposition, is an application of the principle that one extra level of indirection can give a lot of flexibility; it increases the degrees of freedom in programming and deployment. Even the non-transparent and semi-static variety described previously in which even client/leaf nodes of a network run (usually proxy or caching) servers, so that as many applications as possible can be configured to have 127.0.0.1 as the server address; in this way only the per-node servers need to be reconfigured, and they are often less numerous and easier to reconfigure than the applications. These per-node servers are an extra level of indirection interposed between the applications and the resources they need, and they buy some valuable degrees of freedom (at the cost of some small extra load on each node).
The same role can be performed by per-workgroup (per-LAN) servers instead of per-node, and this was perhaps more advisable in times where PCs were less powerful. In an excess of zeal at some point I used to configure each workgroup with the same private address range, so that the addresses of the workgroup interpository servers could be the same in all workgroups, and to have NAT applied to each workgroup, under the plausible assumption that the clients in the workgroup did not need to be addressable from outside it. This practice still has merit for various reasons, but is no longer necessary.
Currently I tend towards a similar but more flexible practice for workgroup services: to use DNS and DHCP to add a level of indirectness. First I define per-workgroup DNS aliases like SMTP.group1.Example.com or IPP.group1.Example.com, and then I configure per-workgroup DHCP servers with a DNS server list like

DNS.group1.Example.com DNS.Example.com

and a resolver search path such as

group1.Example.com Example.com

and then I configure the per-node server dæmons with relative names like SMTP and proxy and IPP (this last not strictly necessary as IPP works well with broadcasts). This achieves some desirable extra flexibility and resiliency, for example consider SMTP:

All mail applications just use localhost as their mail server name. The local server receives each message and queues it.
The local server than tries to connect to SMTP as the smart-host. This will resolve to the workgroup server or to the site server depending on what is available.
If the PC is moved from one workgroup to another, the SMTP relative domain name will just resolve to the one for the new workgroup without any changes.
If the smart-host is not contactable, the local mail server can just hold the mail in a queue until one becomes available.

This simple use of the DNS to add an extra level of indirection is not exactly original; for example in the DNS the NAPTR and SRV resource records are used to provide applications with pointers to which servers provide which services, but these resource records are not widely supported by servers or applications. The simple naming convention I use by default can be used anyhow, and I have written some time ago a nice example of using that and the new resource records to complement non-transparent interposition.

061230b Fighting with YumEx and Yum

While turning a fairly powerful laptop in a rather overloaded network testing and monitoring system I am having a long, long fight with Yum and specially with one of its GUI front-ends, YumEx, both of them written in Python. YumEx is giving me a lot of trouble, both because it is very, very slow and it has a series of rather annoying misfeatures, among them:

The package table cannot be sorted by architecture.
The package size column can be used for sorting, but it is sorted alphabetically (or whatever), so the order is hardly usable.
Every time there is some error in processing a queue of actions to perform, the queue is deleted, and has to be recreated by hand.
Even when installation of the queued actions has succeeded, the table of packages one returns to becomes sorted apparently randomly, even if the column headers claim it sorted both my name and repository name, and resorting it of course is rather slow...

But then it mostly, often, works, somehow, and one cannot expect much attention to detail from someone who has written the YumEx startup script (/usr/share/yumex/yumex) as follows:

#!/bin/bash
#
# Run yumex main python program.
#
/usr/bin/python /usr/share/yumex/yumexmain.pyc $*

in which there are no less than three (and arguably four) quite objectionable entry-level misfeatures.
Also, to be quite usable by advanced users, Yum (which is fairly slow itself, even if not to the fabulous levels attained by YumEx) virtually requires some of its plugins like the ones to skip broken packages, lock the version of some, protect the base repositories and/or assign priorities to them.

061230 Amazing networking performance

Thinking again about amazing modern performance I realized that I haven't mentioned yet the rather stunning results of a simple network transfer performance test that I have performed recently on a copper-fibre 1gb/s network with pretty nice Nortel Baystack routing switches. The PCs and server involved are massive state-of-the-art ones (details later). The first test involves transmitting from PC1 to another PC an ISO disc image:

PC1$ ls -ld grml_0.8.iso
-rw-r--r--  1 pcuser01 pcuser01 730511360 Oct 17 16:59 grml_0.8.iso

using both UDP:

PC1$   time /usr/bin/nc -u -v -v -v 10.0.22.22 3333 < grml_0.8.iso
PC2.Example.com [10.0.22.22] 3333 (?) open
 sent 679460864, rcvd 0

real    0m9.025s
user    0m0.060s
sys     0m2.258s

and TCP:

PC1$ time /usr/bin/nc -v -v -v 10.0.22.22 3333 < grml_0.8.iso
PC2.Example.com [10.0.22.22] 3333 (?) open
 sent 730511360, rcvd 0

real    0m6.846s
user    0m0.056s
sys     0m1.219s

So we have like 75.3MB/s in UDP (and the transmissions ends early, as probably some error happened) and a remarkable 106.6MB/s in TCP. By using strace it appears that in both cases the sender and receiver are using 8KiB blocks. I was running vmstat 1 on both PCs. For TCP on the sender the relevant section looks like

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  2   2432 375644  55096 319564    0    0 62244    56 14726  2070  1  6 57 36
 0  3   2432 298972  55172 395928    0    0 76616    36 17552  2363  1  8 14 77
 0  1   2432 225884  55248 469172    0    0 73032     4 16978  2187  1  7 21 72
 0  1   2432 151516  55328 543192    0    0 74312    16 17227  2145  0  7 57 36
 0  1   2432  75484  55404 619296    0    0 76108     0 17674  2185  1  8 57 34
 0  1   2432  16860  38384 694816    0    0 77004     0 17856  2212  1  9 58 33
 0  1   2432  15452  19060 715700    0    0 75592    76 17531  2940  1 10 57 33
 0  1   2432  16484  17980 717820    0    0 70724     0 16492  2134  1  9 56 34
 0  1   2432  15844  17968 721472    0    0 77260    12 18013  2309  1  9 58 33

That looks fine, reading about 70-75MB/s from the disk (PC1 has a rather quick SAS hard disc) and a fairly low (considering...) 14,000 interrupts/s and 2,100 context switches/s, taking about 9% CPU time, on a Core 2 at 1.9GHz. On the sender there is a bit more strain:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0    236 320828  10352 455568    0    0     0     0 2184  1275  1  1 97  0
 0  0    236 320828  10352 455568    0    0     0     0 32286  5385  2 30 67  0
 1  0    236 320828  10352 455568    0    0     0     0 34059  5898  2 35 64  0
 1  0    236 320828  10360 455560    0    0     0    16 31124  4904  1 29 71  0
 1  0    236 320828  10368 455552    0    0     0    28 33264  5366  1 31 68  0
 0  0    236 320828  10376 455544    0    0     0    28 32893  5058  1 30 68  0
 0  0    236 320828  10376 455544    0    0     0     0 33104  5279  1 31 69  0
 1  0    236 320828  10376 455544    0    0     0     0 34030  6114  2 33 65  0
 1  0    236 320828  10384 455536    0    0     0    12 31005  4931  2 28 71  0
 0  0    236 320828  10384 455536    0    0     0     0 33149  5293  1 31 68  0
 0  0    236 320952  10384 455536    0    0     0     0 16770  2976  3 15 82  0

That is about 33,000 interrupts/s and 5,000 context switches/s and that takes about 30% system CPU time on a Pentium 4 at 3.2 GHz.
Then I switched to /dev/zero as data source between PC1 and SRV1 (which is a small enterprise level Dell server) and got some more fairly impressive and interesting results. When sending from PC1 to SRV1 (time and vmstat 1 output first on PC1 and then on SRV1:

PC1$  dd bs=4k count=250000 if=/dev/zero | time /usr/bin/nc -v -v -v SRV1 3333
SRV1.Example.com [10.0.5.120] 3333 (?) open
250000+0 records in
250000+0 records out
 sent 1024000000, rcvd 0
0.15user 2.57system 0:09.32elapsed 29%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+269minor)pagefaults 0swaps

there is this vmstat 1 output on PC1 (sender):

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0   2432  24068  38792 713128    0    0     0     0 1263   757  1  1 98  0
 1  0   2432  23956  38792 713128    0    0     0     0 5563 29552  2 10 89  0
 1  0   2432  23956  38792 713128    0    0     0     0 9601 57596  2 23 75  0
 1  0   2432  23764  38792 713128    0    0     0     0 9765 58627  3 23 75  0
 0  0   2432  23828  38792 713128    0    0     0     0 9717 57641  2 23 75  0
 0  0   2432  23860  38792 713128    0    0     0     0 9718 58734  2 24 74  0
 0  0   2432  23860  38808 713112    0    0     0    80 9807 58052  3 23 72  2
 0  0   2432  23860  38808 713112    0    0     0     0 9788 58260  3 23 74  0
 0  0   2432  23828  38808 713112    0    0     0     0 9844 59433  3 23 74  0
 0  0   2432  23828  38808 713112    0    0     0     0 9777 59181  2 23 74  0
 0  0   2432  24084  38808 713112    0    0     0     0 6380 34628  2 13 85  0
 0  0   2432  24084  38808 713112    0    0     0     0 1375  1195  1  1 98  0

and on SRV1 (receiver):

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0      0 990992 510860 130560    0    0     0     0 1617  3213  0  0 99  0
 0  0      0 991000 510860 130560    0    0     0     0 7987 15598  0  4 96  0
 1  0      0 991000 510860 130560    0    0     0     0 9032 17733  1  5 95  0
 0  0      0 991000 510860 130560    0    0     0     4 9000 17638  0  5 95  0
 1  0      0 991000 510860 130560    0    0     0     0 9020 17658  1  5 95  0
 0  0      0 991000 510860 130560    0    0     0     0 8999 17496  1  5 95  0
 1  0      0 991000 510860 130560    0    0     0     0 9017 17548  1  5 95  0
 1  0      0 990872 510860 130560    0    0     0   312 9010 17806  1  4 95  0
 0  0      0 990936 510860 130560    0    0     0   264 9037 17821  0  5 95  0
 0  0      0 990936 510864 130556    0    0     0   224 9018 18008  1  5 95  0
 0  0      0 990936 510864 130556    0    0     0     0 2356  4447  0  1 99  0
 0  0      0 990872 510864 130556    0    0     0   276 1082  2105  0  0 100  0

The same for sending data from SRV1 to PC1:

SRV1$ dd bs=4k count=250000 if=/dev/zero | time /usr/bin/nc -v -v -v PC1 3333
PC1.Example.com [10.0.10.1] 3333 (?) open
250000+0 records in
250000+0 records out
 sent 1024000000, rcvd 0
0.17user 2.75system 0:08.73elapsed 33%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+209minor)pagefaults 0swaps

with vmstat 1 output for PC1 (receiver):

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0   2432  28516  38960 708280    0    0     0     0 1195  1107  0  1 99  0
 0  0   2432  28548  38960 708280    0    0     0     0 1169   447  1  0 99  0
 1  0   2432  28548  38960 708280    0    0     0     0 6455 10919  1  4 96  0
 1  0   2432  28532  38960 708280    0    0     0     0 17553 33503  2 13 85  0
 1  0   2432  28524  38960 708280    0    0     0     0 17605 33397  2 12 86  0
 1  0   2432  28524  38960 708280    0    0     0     0 17693 34191  3 14 84  0
 1  0   2432  28524  38960 708280    0    0     0     0 17508 33054  1 12 87  0
 1  0   2432  28524  38960 708280    0    0     0     0 17424 32889  1 12 87  0
 1  0   2432  28524  38960 708280    0    0     0     0 17466 32963  1 12 87  0
 1  0   2432  28588  38960 708280    0    0     0     0 17446 32952  1  9 90  0
 1  0   2432  28588  38960 708280    0    0     0     0 17484 32994  1 11 89  0
 0  0   2432  28716  38960 708280    0    0     0     0 7439 12942  1  5 94  0
 0  0   2432  28716  38960 708280    0    0     0     0 1152   342  0  0 100  0

and SRV1 (sender):

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0      0 991000 510904 131036    0    0     0     0 1062  2135  0  0 100  0
 0  0      0 991064 510904 131036    0    0     0     0 1035  2097  0  0 100  0
 2  0      0 990616 510904 131036    0    0     0     0 7950 55000  1 11 88  0
 0  0      0 990680 510904 131036    0    0     0     0 9024 63327  1 13 86  0
 0  0      0 990616 510904 131036    0    0     0     0 9029 63309  1 12 87  0
 2  0      0 990552 510904 131036    0    0     0    32 9028 63410  1 12 86  0
 0  0      0 990488 510904 131036    0    0     0   328 9063 63548  1 13 86  0
 0  0      0 990488 510912 131028    0    0     0    72 9025 63361  1 12 87  0
 2  0      0 990488 510916 131024    0    0     0   460 9071 63884  1 12 87  0
 2  0      0 990488 510916 131024    0    0     0   128 9055 63380  1 13 85  0
 0  0      0 990680 510916 131024    0    0     0     0 7875 54301  1 11 88  0
 0  0      0 990680 510916 131024    0    0     0     0 1020  2143  0  0 100  0

In both case around 1GB is transmitted in about 9s, for over 100MB/s which is pretty amazing. Interestingly high CPU and interrupt numbers though:

On PC1 (CPU: 2x Core 2@1.9GHz, NIC: Broadcom BCM5754) it takes nearly 9,000 interrupts/s to send (and 58,000 context switches/s) and 23% of CPU, the 17,000 interrupts/s and 32,000 context switches/s for receiving consume only 12% of CPU time.
On SRV1 (CPU: 2x Xeon@3GHz, NIC: 2x Intel 82541GI/PI) receiving consumes only 5% of CPU time, with 9,000 interrupts/s and 17,000 context switches/s while sending costs 12% of CPU time with 9,000 interrupts second (same as for receiving) and 63,000 context switches/s.

In other words, current CPUs and NICs are so fast (under GNU/Linux at least, with its famously quick context switching) that they can transmit at wire speed even if the amazingly inefficient case of 1500 byte frames on a 1gb/s Etherner, and jumbo frames no longer are indispensable to achieve decent bandwidth, as modern CPUs and systems can sustain amazingly high interrupt and context-switch rates; also some network interfaces seem to have fairly effective offloading and interrupt coalescing. But jumbo frames are still rather useful for 10gb/s Ethernet especially if lower CPU utilization and latency are both desired. Note also that such numbers are unlikely to be achievable with consumer grade equipment.

061229 Amazing modern performance, and not

Well, I just had a funny moment: I was logged into a laptop on my puny 100mb/s LAN via OpenSSH and VNC and I could move large (megapixel) windows without apparent lag in opaque mode (the whole contents of the window moves). This amused me for a couple of reasons:

I can remember when only the latest and greatest graphics card could move windows in opaque mode locally...
The window was that for YumEx that was struggling with a list of just a few thousand available packages on that laptop, which has got a Core 2 Duo 2Ghz/4MiB L2 cache CPU and quite fast 1GiB of memory.

As to the latter point, the unresponsiveness was probably mostly due to networking through my minuscule ADSL line, but then it is supposed to be multithreaded... In any case, amusing status via top:

top - 23:04:36 up  6:53,  5 users,  load average: 0.88, 0.34, 0.15
Tasks: 150 total,   3 running, 144 sleeping,   0 stopped,   3 zombie
Cpu0  : 32.2%us,  6.6%sy,  0.0%ni, 60.5%id,  0.0%wa,  0.0%hi,  0.7%si,  0.0%st
Cpu1  : 52.7%us, 23.0%sy,  0.0%ni, 24.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1033328k total,   985904k used,    47424k free,        0k buffers
Swap:  5140760k total,       32k used,  5140728k free,   577660k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6967 root      25   0  302m 209m  15m R   99 20.8  15:47.10 python
 8931 root      15   0  168m  38m  18m S    0  3.8   0:28.68 firefox-bin
 6891 root      16   0  122m  14m  10m S    0  1.5   0:05.84 nautilus
 6977 root      15   0  112m 7380 5968 S    0  0.7   0:00.05 trashapplet
 2790 nscd      16   0  106m 1048  772 S    0  0.1   0:00.09 nscd
 6975 root      16   0 96996  10m 7172 S    1  1.0   0:07.24 wnck-applet
 6889 root      15   0 86220  14m 9452 S    0  1.4   0:04.87 gnome-panel
 6994 root      15   0 84116  10m 7580 S    0  1.0   0:02.99 mixer_applet2
 6947 root      15   0 81704 8796 7316 S    0  0.9   0:00.08 nm-applet
 3025 root      15   0 70324  43m  12m S    0  4.3   0:01.01 httpd
 3114 apache    19   0 70324  31m  816 S    0  3.2   0:00.00 httpd
 3115 apache    18   0 70324  31m  808 S    0  3.2   0:00.00 httpd
 3116 apache    18   0 70324  31m  800 S    0  3.2   0:00.00 httpd
 3117 apache    18   0 70324  31m  800 S    0  3.2   0:00.00 httpd
 3120 apache    18   0 70324  31m  800 S    0  3.2   0:00.00 httpd
 3121 apache    18   0 70324  31m  800 S    0  3.2   0:00.00 httpd
 3122 apache    18   0 70324  31m  800 S    0  3.2   0:00.00 httpd
 3123 apache    20   0 70324  31m  800 S    0  3.2   0:00.00 httpd
 6909 root      15   0 41788 2576 1912 S    0  0.2   0:00.07 bonobo-activati
 7043 root      15   0 41476  16m 7928 R   12  1.6   2:40.36 gnome-terminal
 6925 root      15   0 40448 6784 5624 S    0  0.7   0:00.02 eggcups
 6868 root      15   0 31544 6796 5532 S    0  0.7   0:00.91 gnome-settings-
 6992 root      19   0 28580 9616 6652 S    0  0.9   0:02.90 clock-applet
 6828 root      15   0 23528  18m 4180 S    1  1.8   2:59.41 Xvnc

Since I am configuring this laptop it is still starting X by default in GNOME, but the virtual and resident memory sizes are still quite impressive, even for GNOME that is (the python process is the one running YumEx).

061227 Difference between Smalltalk-72 and -76, why it matters

Having recently argued that in the switch from Smalltalk-72 on how it came about that actor model terminology was retained even if the execution model was switched to a traditional procedural one, one might ask why does this matter, after all very few programmers have heard of Smalltalk, never mind the differences between the 72 and 76 languages with that word as part of their name. But I think that matters for various direct and indirect reasons:

It is one of my pet peeves.
Smalltalk-72 was the language described in all the early literature published by Xerox PARC, and greatly fascinated programmers for its non procedural approach (I was astonished by its eye-shaped operator which indicated the next token in each object's transcripts, and the cloning approach, even more so as I knew Simula-67 and its very different approach), even if Smalltalk-76 was in use by the time of the legendary article in Scientific American" or the equally legendary issue of BYTE on the Xerox PARC research into early personal computers.
Using actor model terminology for procedural OO somehow confuses OO and the Actor model. OO is not based on a metaphor of program execution as communication among a community of agents, it is rather a (static) program decomposition paradigm, and mistaking the one for another is rather ineffective.
That confusion obscures the merits of the Actor model itself, which may see soon a resurgence.

As to the latter, research on the Actor model was based on the idea that it was ideal when cheap mass hardware parallelism would become available (rather plausibly), and that this would happen soon (rather wistful). As the past 20-30 years of hardware evolution have shown, ever increasing chip silicon budgets have been used not to put hundreds or thousands or dozens of thousands of Z8000 style CPUs on a single chip, but on implementing ever more complex single CPUs, exploiting parallelism within instructions or within basic blocks rather than among threads. But single CPU chips seem to have hit the end of their evolution, there are already chips with several CPUs and there may be chips with very many CPUs in the future, and Actor model programming may yet become relevant.

061226b Rapid or less painful switching of service provider

I just wrote that quite often it is not hard to switch servers with suitable precautions. Well, that's another interesting topic. Ideally all libraries and applications would use something like paths (like $PATH) to find resources, and scan them until they find it. Dreaming even more such paths would be dynamic, that is they would be produced on demand.
Indeed even in cases where paths are available, they are often static, which means that things are not quite as good as they can be. Consider for example DNS service: it is usually configured via a path in /etc/resolv.conf, which is read by each application as it starts. This means that if the path needs changing, applications should be restarted. Also in many cases clients for a service don't use paths, but servers do, often as part of a master-slave, or cached, or similarly clustered service relationship. Also, server unavailability often hangs the client, which is another annoyance. In general, client service access libraries are designed with less resilience and tolerance for errors than server dæmons.
Which suggests a practice which I use which often enough gives considerable flexibility: running server instances, configured as caching, slave or even standalone, on every client. For example as a rule my /etc/resolv.conf contains just nameserver 127.0.0.1, and I run a DNS server instance on each client. This instance is configured to use a workgroup DNS server as a forwarder, and this is similarly configured as a slave for the site DNS servers. Similarly for NTP, NIS (and even LDAP directories where useful), SMTP, HTTP proxying, and of course IPP (where servers send each other browser lists and client based servers thus autoconfigure). For MSA style server for protocols like POP3 and/or IMAP4 once can configure things so that instead of having e-mail delivered to a single site or even inter-site wide mailbox repository, it is delivered to per-group or even per-PC servers. For example the MTA for an organization could, on receiving a message for user@example.com forward it for delivery to a workgroup server as user@acct.example.com which could either store it locally or forward it again to the user's PC as user@pc12.acct.example.com.
The same can be done for version control by choosing a peer-to-peer repository system which caches the whole repository or a subset local and runs a server locally that synchronized periodically with a workgroup or a site server (systems like Arch, GIT, Bazaar, SVK, Monotone). The same could be done for file service, using similarly designed file systems (like Ficus, Coda and their descendants) that similarly cache files on the local system to allow for independent operation. But unfortunately that is a lot trickier, and sometimes a peer-to-peer version control system is an acceptable substitute.
The general idea is to configure applications, as much as possible, to use 127.0.0.1 as the server address, and then the local server can be reconfigured and restarted with minimal disruption of existing applications. This is of course useful not just to minimize the impact of service disruptions, but also for intentionally disconnected usage, for example for laptops.

061226 Looking again at centralized vs. localized servers: why, where

Well, I have been thinking again at the issue of good structures for providing network services to largish user communities on a site. This involves both services and resilience, the latter usually achieved via redundancy of some sort, or via short time to repair.
In general given N services and a desired degree of redundancy of M one can imagine a table with N×M elements, in which to put nodes (routers or switches for connectivity services, or servers for other services).
However using N×M nodes is not necessarilily cheap, and neither is just one node for all services. What one can do is to try and simplify the table, using these techniques:

Divide the table in sub-tables: This idea is based on the assumption that isolation of services is profitable, usually because there are different communities of users with different continuity requirements. For example a group of programmers working on a time critical project may continue to work if their services are available even if they are not available to a group of accountants in the same company.
Put multiple services on each node (organize the table by rows).: This is based on the assumption that som services are independent of each other, and in particular their usefulness is independent. For example Internet connectivity is a service that may be useful independently of the availability of e-mail service.
Put the same service on multiple nodes (organize the table by columns).: This is based on the assumption that services are dependent on each other, and that it is futile to have redundancy across services that cannot function independently.

All three methods can be used together of course, and in different degrees, usually depending on the cost of providing redundancy for a service, and the dependencies among services and between services and user communities. It is interesting also to see what kind of services may be involved and whose failure impacts the user experience, for example:

Infrastructure services	Electrical power Ambient cooling. Network cabling.
Basic services	Node-local storage. Node-local processing. Internal network bridging and routing. External network routing (hopefully not intersite bridging!). Network firewall.
System services	DHCP and/or BOOTP, TFTP NIS or LDAP. NTP. and DNS Backup. Updates.
Basic user services	NFS and/or SMB SMTP, POP3 and/or IMAP4 CVS and/or SVN
Optional user services	DBMS access. Jabber and/or IRC, NNTP. LPR and/or IPP VoIP
Remote access services	FTP, RSYNC HTTP, HTTP proxying SSH, VPN

Now looking at these services (categorized in a somewhat arbitrary way, but also so that services depend on those listed in rows above them) they can be also categorized in another way, according to their scope:

Services needed to run a single client workstation.: Things like power, cooling, local storage, boot configuration.
Services needed for a group of people.: Things like DNS, shared storage, version control, and probably nowadays HTTP.
Services needed by everybody at a site.: Like e-mail, Internet access, backup.

Usually the importance of redundancy or of quick repair diminishes the wider the scope of a service: some work can be done without Internet access (in some places only without it :->), but not much work can continue if power is missing. At the same time though the impact is larger: e-mail can be postponed, but if nobody at a site can send e-mail for a long period of time this can impact a lot of activities. I would think therefore that for smaller scope services redundancy and isolation are more important (preventing widespread interruptions), for wider scope ones it is quick time to repair (making interruptions shorter).
As mentioned previously my general inclination is to take full advantage of the ability to partition service scope, and to partition the service-vs-redundancy table in blocks. In terms of the discussion above this is because I assume that:

It is worthwhile to localize service interruptions, only a minority (hopefully) of users at a time are affected by interruptions.
It is possible to localize services, as relatively small infrastructures can deliver most of the services that a single user or group need to do something useful.
Many services depend on other services, or some services are pointless if some other service is not available; for example being able to print without being able to access a remotely mounted home directory is not very useful.
It is not hard to configure each local group with 2-3 servers that have all services on them, and switch quickly the group from one server to another if one fails.

Some services however do deserve their own nodes, because they are truly independent, or are used by all groups. For example Internet access. So in general my idea is to proceed like this to define a service infrastructure:

First list the services to provide.
Then identify the communities of users that need each service.
Then partition those communities into subgroups of a suitable size, such that each group either works as a whole or not.
Assign servers for such groups such that there is a degree of redundancy and/or a speed of repair that matches the user needs.

All this without forgetting that some redundancy is pointless: for example having servers for a group in a machine room with a long duration UPS is somewhat pointless if the clients are not on one either. A good question at this point is where to put group boundaries, and I reckon that usually is answered by looking both at usage clusters, and at the basic infrastructure, power, cooling, cabling.
It used to be that I would instinctively see groups as nodes connected by the same wire, or nowadays by the same router (or switch in less fortunate situations), and of course on the same power and cooling system; for typical sites this means a floor, or part of a floor or a couple of floors, usually involving 20-40 people.

061224 Message passing and overloading

Luckily I have just found a page containing several articles on Smalltalk-like languages which contains online copies of some interesting and important works, in particular several chapters from Smalltalk-80: The Language and Its Implementation and Design Principles Behind Smalltalk The Smalltalk-76 Programming System Design and Implementation. These works are important not only for their content but because they document one of the most damaging confusions of terminology to have afflicted software engineering in the past few decades, the use of the term message passing to indicate both the passing of messages, as opposed to the sharing of state, in distributed programming, its proper meaning, and dynamic overloading in ordinary, shared state procedure calls.
Why ever would one confuse message passing with dynamic overload resolution? The reason is simple: the Smalltalk-72 language was based on a message passing model, but its successors Smalltalk-76 and Smalltalk-80 were based on a conventional dynamically overloaded procedure call model, yet the terminology was not changed.
The Smalltalk-72 model was one that inspired Carl Hewitt to define it as the Actor model, where computation proceeds as sequences of messages sent and received within communities of self-contained, clonable, active entities, potentially executing in parallel, a metaphor of a group of cooperating people. Such a metaphor is a powerful and familiar one, and not surprisingly Smalltalk-72, which was oriented towards introducing computing to children and towards simulation, was based on such a fascinating model.
Unfortunately it turned out that such a model is very expensive to support on conventional computers, and in part I suspect as a result of influence from the nearby Interlisp research, Smalltalk-76 switched to a conventional model in which computation proceeds as a series of procedure calls within threads sharing common state as records instantiated from data type descriptions called classes.
However the language pertaining to the community metaphor was retained, even if the semantics changed completely, as follows:

Term	Smalltalk-72	Smalltalk-76
object	A self contained potentially active entity that has an input stream (the transcript) it can parse and that can send token to the input stream of other objects and that can be cloned. Each object potentially interprets its input stream differently and in parallel.	A passive value that is an instance of a type description called a class which defines a set of procedures that can be applied to values instantiated from it.
script	A set of rules defining how to parse the object's input stream and react to them.	Not part of the language.
method	Not part of the language.	A procedure body bound to a particular name.
message	A stream of tokens copied from one active object to another.	A procedure name (called selector) and list of parameters, following a passive value to which the procedure is to applied.
message send	Copying a sequence of tokens to another object.	Finding the class of an object, looking up the name of the procedure in the class table of methods, and callin the method if one exists with the same name.

In more technical terms, Smalltalk-72 was an actor and prototype based OO language, and Smalltal-76 instead procedural and class based OO language.
One vital difference is that in Smalltalk-72 there is no common structures to messages being sent to an object: it is up to the object's script to parse each message, communication is stream oriented. In Smalltalk-76 instead each procedure call has a standard syntax, where the selector follows the value and precedes the arguments (infix syntax).
In some sense the Smalltalk-72 model is one of independent server processes communicating via pipes/streams, while the Smalltalk-76 model is one of a single process with multiple threads each executing nested, dynamically overloaded, procedure calls on class instance values, as these quotes clearly show:

The Benefits of the Message Discipline
Adding a new class of data to a programming system is soon followed by the need to print objects of that class. In many extensible languages, this can be a difficult task at a time when things should be easy. One is faced with having to edit the system print routine which (a) is difficult to understand because it is full of details about the rest of the system, (b) was written by someone else and may even be in another language, and (c) will blow the system to bits if you make one false move. Fear of this often leads to writing a separate print routine with a different name which then must be remembered. In our object-oriented system, on the other hand, printing is always effected by sending the message printon: s (where s is a character stream) to the object in question. Therefore the only place where code is needed is right in the new class description. If the new code should fail, there is no problem; the existing system is unmodified, and can continue to provide support.

The class organization and message discipline ensure that if the original message protocol is supported, then all code outside the class will continue to work without even recompiling. Moreover, the only changes required will all be within the class whose representation is being changed.
Modularity is not just an issue of "cleanliness." If N parts depend on the insides of each other, then some aspect of the system will be proportional to N-squared.
Such interdependencies will interfere with the system's ability to handle complexity. The phenomenon may manifest itself at any level: difficulty of design, language overgrown with features, long time required to make a change or difficulty of debugging. Messages enforce an organization which minimizes interdependencies.

The quotes above clearly describe the advantages of dynamic overloading as opposed to using case or switch statements and more weakly those of the module-based decomposition paradigm. But look a bit further down and the confusion is apparent:

Another benefit of leaving message interpretation up to the target objects is type independence of code.

This quote shows that the implementar of the system himself is still under the delusion that in Smalltalk-76 each object independently parses an input stream, where instead selectors are strictly resolved by looking them up in class based tables, returning method addresses.

In the Rectangle example, the code will work fine if the coordinates are Integer or FloatingPoint, or even some newly-defined numerical type. This allows for much sharing of code, and the ability to have one object masquerade as another.

The above quote instead is about genericity of code, where a bit of code is generic with respect with all the classes that implement the same procedure names.
Later on in the book about Smalltalk-80 (an extended version of Smalltalk-76) the delusion that somehow

When a send bytecode is encountered, the interpreter finds the CompiledMethod indicated by the message as follows:

Find the message receiver. The receiver is below the arguments on the stack. The number of arguments is indicated in the send bytecode.

Access a message dictionary. The original message dictionary is found in the receiver's class.

Look up the message selector in the message dictionary. The selector is indicated in the send bytecode.

If the selector is found, the associated CompiledMethod describes the response to the message.

If the selector is not found, a new message dictionary must be searched (returning to step 3). The new message dictionary will be found in the superclass of the last class whose message dictionary was searched. This cycle may be repeated several times, traveling up the superclass chain.

Even more clearly here:

Methods correspond to programs, subroutines, or procedures. Contexts correspond to stack frames or activation records. The final structure described in this section, that of classes, is not used by the interpreter for most languages but only by the compiler. Classes correspond to aspects of the type declarations of some other languages. Because of the nature of Smalltalk messages, the classes must be used by the interpreter at runtime.

It is just unfortunate that the Smalltalk-72 terminology is so misleading when applied to Smalltalk-76 and its successors. By the way, the other great terminological confusion is that in C++ the word object means memory area and not class instance, but that's another story...

061223 OO, overloading and genericity

Some scientists I know have been,learning to use Jython have been fairly amused to discover that they have been in effect doing OO programming (a bit like the merchant in Moliere's comedy).
But they were confusing a bit OO with the class-and-methods style of programming, which is not necessarily the same as OO, which is a decomposition paradigm, not an assemblage of language features, as I have argued previously. But what about the language features that are common to most OO languages and are often mistaken for OO? Well, they are actually useful on their own, and indeed many non-OO languages have them in one form or another, even if they are vastly misunderstood. In partricular I am thinking of overloading and genericity.
Overloading can be a property of symbols usually and happens when the same symbol resolves to different values in different contexts. For example in this C-like code fragment:

void f()
{
  int i;

  {
    float i;

    i = 1.0f;
  }

  {
    char i;

    i = '2';
  }

  i = 3;
}

the symbol i used the name of three distinct variables is overloaded and is bound to three different values (all three of reference type, but with different type constraints). Overloading gives rise no ambiguity, as in the example above, if context makes clear which particular binding of the symbol is intended.
In the example above resolving the overloading of the variable name is by position: different bindings are unambiguous in different areas of the text. Different resolution rules could be used, and for example for symbols overloaded with different procedures this usually is done by different means: usually by looking at the type of arguments, sometimes by looking at the value of the arguments (as in many functional languages). The resolution is performed by matching the type (or value) of the arguments in the procedure declaration with the type (or value) of the procedure call.
An important detail here is the vital distinction between a procedure name, which is a symbol, and a procedure, which is a value. Overload resolution (also called, confusingly, dispatching) from a symbol to the specific value intended for that context can be static, where the types or values of the arguments is manifest, as in

int sqr(const int x) { return x*x; }
float sqr(const float x) { return x*x; }

void f()
{
  int i = sqr(2);
  float j = sqr(2.0f);
}

but it can also be dynamic, when the type (or the value) is not manifest. Now in way of principle static or dynamic overload resolution can be applied in way of principle to any one or all arguments; however in many if not most OO languages static overloading applies to all parameter, but dynamic resolution can only apply to the first, or distinguished, parameter.
Another important detail here is that almost all OO languages which use a class structure tend to have a special invocation syntax for class methods which is usually entirely futile, as an invocation of the form

objectRef.methodName(argsList)

is in effect an infix form just equivalent to the usual prefix form

methodName(objectRef,argsList)

(except of course in Actor systems). As mentioned above the one peculiarity of the infix from is that it distinguishes syntactically the first argument, which is dinstuished semantically in most language by being the only one on which dynamic overload resolution can be done.
The reason why the latter is usually the case is performance and simplicity: multiple argument overload resolution requires a table search, which is not onerous at compile-time, but can be expensive at run-time; if it is known in advance that dynamic overload resolution can only involve the first argument some important simplifications can be done to the table search, as in Objective-C or Smalltalk-80. If the resolution is also know to be bounded (that is, the set of possible overloading is known at compile-time) further simplifications are possible, as in C++ vtable.
Now overloading leads to genericity, which is a property of a part of program text (and not of a symbol like overloading or of a program like semantics): a program text is generic if the intersection of the overloadings of the symbols contained in it is not a single interpretation. So for example this program text:

return (x > 0) ? x : -x;

is generic as long as the return, >, unary - and ? : operators all have more than one binding, and the intersection is multiple, for example when they are all defined for int and float.
The above program text is implicitly generic, as it happens to be generic in any suitable context, and it also has unbounded genericity, as there is no constraint on the types over which the procedures or operators used within it can be used; x can be of any type for which the operators listed above are defined. The C++ language has almost unbounded, implicit genericity: the only bounds in a C++ template can be to restrict a free variable to be a template, which is required by obscure syntactical reasons.
Now, what are the uses of overloading and genericity? Well, the promote code and even algorithm reuse, as there are many algorithms and program text fragments that have the same shape (and /or similar combination rules) irrespective of the details of the algebra in which they are deployed. For example:

max(a,b) { return (a > b) ? a : b; }

is valid and useful for many types of values, all those for which a partial ordering operator is defined. However both overloading and genericity also have a downside: that they can be abused when the same symbol or program text is used even if the semantics are very different, thus inviting confusion. For example it is all right if in the max example above the operator symbol > is overloaded with respect to integers and floating point values, but it is rather less so if the overloading is extended to binary trees, to indicate inclusion, as that strains similarity of semantics that makes overloading and genericity useful as a shorthand.

061216 Tomcat 5 configuration for multiple ports

Some time ago I was asked to help configure a Tomcat server for multiple top level webapps, with the added complication that they be accessed from an Apache server used as reverse proxy. One could achieve this simply with some Apache configuration like (note the two distinct ServerNames):

<VirtualHost	*:80>
  ServerName		author.Example.com
  ServerAdmin		webadm@Example.com
  DocumentRoot		"/usr/share"

  ProxyPass		/	"ajp://localhost:8071/author"
  ProxyPassReverse	/ 	"ajp://localhost:8071/author"

<VirtualHost	*:80>
  ServerName		WWW.Example.com
  ServerAdmin		webadm@Example.com
  DocumentRoot		"/usr/share"

  ProxyPass		/	"ajp://localhost:8071/publish"
  ProxyPassReverse	/ 	"ajp://localhost:8071/publish"

that is forwarding two different domains to two different webapps on the same Tomcat domain and port. But this is impossible if the local part of a URL must be exactly the same in the Apache front-end and in the Tomcat back-end, as when the webapps generate absolute paths in their responses.
Thus one has to forward requests from the front-end to the back-end at the top level, for example as follows:

<VirtualHost	*:80>
  ServerName		author.Example.com
  ServerAdmin		webadm@Example.com
  DocumentRoot		"/usr/share"

  ProxyPass		/	"ajp://localhost:8071/"
  ProxyPassReverse	/ 	"ajp://localhost:8071/"
  ProxyPassReverseCookiePath	"/" "/"
  ProxyPassReverseCookieDomain	"localhost" "localhost"

  <Location />
    Options		+IncludesNoExec +Indexes
  </Location>
</VirtualHost>

<VirtualHost	*:80>
  ServerName		WWW.Example.com
  ServerAdmin		webadm@Example.com
  DocumentRoot		"/usr/share"

  ProxyPass		/	"ajp://localhost:8072/"
  ProxyPassReverse	/	"ajp://localhost:8072/"
  ProxyPassReverseCookiePath	"/" "/"
  ProxyPassReverseCookieDomain	"localhost" "localhost"

  <Location />
    Options		+IncludesNoExec
  </Location>
</VirtualHost>

The reason why requests for both domains are forwarded to different ports on localhost is that the Tomcat 5 is the back-end for an Apache front-end, and it is good to have it bound just to localhost.
How to have different default webapps for the same domain on different ports with Tomcat is then somewhat unobvious because Tomcat has a slightly odd design and anyhow its documentation is a bit confusionary as to which bits and pieces fit together and match which parts of the incoming request URL.
Tomcat's architecture takes the point of view that a web server is essentially a kind of RPC engine, a URL being a procedure call. By contrast Apache is based around the idea that it is more of a Gopher document server, and URLs are paths to documents. Of course these are just points of view, and one can do both document and RPC style service with either. But the flavour is unmistakably different: Tomcat requires special configuration to serve documents, and Apache requires it to serve RPC style calls.
Normally Tomcat-served URLs are forwarded to Java servlets, where usually the servlet is identified by the fist component of the local part of the URL. Things become a little more complicated if there are multiple real or virtual hosts, and/or different ports on the same domain name.
Tomcat configuration has the following structure:

A service contains multiple connectors, and exactly one engine.
Each connector can match a protocol (HTTP, HTTPS, AJP) on a specific port.
Each engine can have zero or more valves and realms, and at least one host.
A valve provides auxiliary processing, like compression or logging.
A realm defines an authentication database.
A host specifies a domain name and a directory containing the webapps (procedure name) bound to that host.

Given a URL like

protocol://domain:port/first/rest

its resolution to a servlet goes like this:

First the port is matched with a connector in some service. This selects a particular service for further processing, and thus a particular set of valves and realms.
Then the domain is matched with that of a particular host in that service. If there is no HTTP/1.1 Host: header to indicate a name based virtual host, then the default host in that service is matched.
Once a host is identified, its appBase directory is selected.
Then the first component is matched against a webapp in the selected directory, which is instantiated as a servlet and passed the following rest of the local part of the URL as its parameters. A particular webapp may be instantiated into several different flavors of servlet by using a different context with different parameters for the instantiation.

Now the crucial thing here is that the default webapp (the one used for URLs that do not match any other webapp's name) must be called ROOT and there can be only one per appBase directory, no matter how many addresses or ports the connectors are bound to.
So there seems to be a limitation that all IP based or virtual hosts must share the same ROOT webapp, but that is not really the case because there can be a different bound port and a different engine but with hosts with the same name per engine. Thus binding distinct appBase directories on the same domain to different incoming ports requires multiple services, for example as follows:

  <Service name="Backend-1">

    <Connector debug="0"
      protocol="AJP/1.3" port="8071"/>

    <Connector debug="0"
      protocol="HTTP/1.1" port="8081" redirectPort="8091"
      acceptCount="10" connectionTimeout="20000"
      minProcessors="1" maxProcessors="5"
      compression="on" compressionMinSize="2048"
      compressableMimeType="text/html,text/xml,text/plain"
      noCompressionUserAgents="gozilla,traviata"
      scheme="http" secure="false"
      enableLookups="true" useURIValidationHack="false">
    </Connector>

    <Engine debug="0"
      name="Engine-1" defaultHost="localhost">

      <Valve className="org.apache.catalina.valves.FastCommonAccessLogValve"
	directory="/var/log/tomcat5" prefix="catalina_" suffix="_1.log"
	pattern="combined" resolveHosts="false"/>

      <Realm debug="0"
	className="org.apache.catalina.realm.UserDatabaseRealm"
	resourceName="UserDatabase"/>

      <Host debug="0"
	name="localhost" appBase="webapps1"
	unpackWARs="true" autoDeploy="true">
      </Host>
    </Engine>
  </Service>

  <Service name="Backend-2">

    <Connector debug="0"
      protocol="AJP/1.3" port="8072"/>

    <Connector debug="0"
      protocol="HTTP/1.1" port="8082" redirectPort="8092"
      acceptCount="10" connectionTimeout="20000"
      minProcessors="1" maxProcessors="5"
      compression="on" compressionMinSize="2048"
      compressableMimeType="text/html,text/xml,text/plain"
      noCompressionUserAgents="gozilla,traviata"
      scheme="http" secure="false"
      enableLookups="true" useURIValidationHack="false">
    </Connector>

    <Engine debug="0"
      name="Engine-2" defaultHost="localhost">

      <Valve className="org.apache.catalina.valves.FastCommonAccessLogValve"
	directory="/var/log/tomcat5" prefix="catalina_" suffix="_2.log"
	pattern="combined" resolveHosts="false"/>

      <Realm debug="0"
	className="org.apache.catalina.realm.UserDatabaseRealm"
	resourceName="UserDatabase"/>

      <Host debug="0"
	name="localhost" appBase="webapps2"
	unpackWARs="true" autoDeploy="true">
      </Host>
    </Engine>
  </Service>

The above two services (meant to be the authoring and publication side of a web CMS like Magnolia) are selected depending on a port match for the relevant connector, for ports bound to the localhost address in both cases; but the two engines have hosts with different appBases for the same domain name, into which different webapps are symbolically linked in as ROOT.
For ease of testing each service is bound to two ports, one for the AJP protocol for the Apache front-end, and one for the HTTP protocol for direct access bypassing the Apache front-end. Note also that they log to different files, but share the same authentication database.

061212 More on large bridged network and centralized servers

I was chatting with someone about my comments on the merits of large bridged networks, and centralized servers, and it became clear that I have not been clear enough in particular because of ambiguously terminology. The term VLAN is used for two very different but complementary techniques:

Creating a single virtual LAN out of multiple physical ones (the effect of bridging, either directly or via a tunnel).
Defining multiple virtual LANs sharing the same physical LAN (the effect of tagging packets and ports with 802.1Q VLAN identifiers).

They are complementary because they are used together, especially in for large multisite bridged networks: VLAN bridging across sites makes the LANs at the multiple sites into a single virtual LAN, and VLAN tagging is used to partition that single huge VLAN into multiple subsets, usually corresponding to the original physical LANs or sites.
Sounds hard to believe both ways, that people bridge LANS in different countries or even continents into a single bridged VLAN, and then alleviate the consequences by partitioning it back into multiple tagged VLANs that correspond to the original countries or continents.
The reason for the partitioning into multiple tagged VLANs is that each such is an independent broadcast domain, and thus packets at one site are not needlessly bridged over to the other site (the bridges tell each other which ports they have are tagged with which VLANs).
It may be harder to see why one would want to bridge LANs in different countries or continents to start with, but that may be due as mentioned to lack of confidence and skills as to setting up and managing a routed infrastructure (it requires a fair bit more thought than linking bridges and expecting them to just work thanks to automagic spanning tree buildup). Or it may be due to scope creep, where it is very tempting to just slap together LANs together temporarily by bridging (for example when corporations merge), and little by little a monster arises.
Another reason has been in the past that some popular non-IP network protocols (Apple, Novell) were in effect not very routable, as they relied heavily on broadcasts for autoconfiguration. Another reason may be that for whatever reason centralized servers are desired across countries or sites, and it is reckoned that even with IP server/service discovery is easier on a bridged VLAN (and the server would be part of all tagged VLANs). Given the increasing popularity of broadcast based autoconfiguration mechanisms I expect the temptation to bridge will increase too.
If this sounds strange too, well, I agree, but it happens, and even I was surprised to see a company with offices across Europe keep many/most of their servers, including their web proxy in just one of those countries (the one with the weakest Internet connection too). At least that company had a (rickety) routed infrastructure...
There are companies that have both centralized servers and widely distributed (across many sites) bridged VLANs with tagged VLAN subsets, and indeed there is even a significant market for long-distance LAN bridges with optimizations and VLAN tagging.
There was sort of an argument for bridging over routing for local sites: that bridges could pass frames a lot faster than routers could packets, because of the greater CPU overhead of processing IP headers. However for the past few years routing hardware is about as fast as bridging hardware, even for 10gb/s links (and larger frames and packets are anyhow a good idea), and anyhow inter-site links very rarely have the sort of speed at which the difference mattered even when it existed, and anyhow ideally even on a single site traffic should be pretty localized. Also, one day, when IPv6 is widely adopted, it will require less powerful hardware as its packet header is designed for fast routing.
What about single sites? Well, I still prefer routed LANs fronted by LAN-local servers as most resilient and opportune, but there is an argument that a higher speed backbone might be usefully be a bridged VLAN instead of routed segments, as it carries global traffic, more or less across random LANs. But tagged VLANs on such a backbone as these are really mostly useful (optimistically) as a simple security device, and thus at the leaf ports of a network, not at the backbone.

061210b TCP: data reliable, signaling not reliable

In the past few weeks I have been asked to debug two similar issues where some TCP based client server application would hang. In both cases this was due to a a mistake in the logic used by client, based on a common misunderstanding: that TCP provides reliable communications. This is not quite true: the TCP stack provides reliable data transport, in that it will retransmit data until it is received. But it does not provide reliable sessions: if the TCP connection breaks, it will not be reopened by the TCP stack, it must be explicitly reopened by the application.
One of the problems was about NFS-over-TCP: for various reasons the connection would become half-closed, most likely as described here:

> >> FreeBSD 6.0 NFS client seems keep trying NFS access forever after NFS
> >> TCP connection is (half?) closed.

as in closed by the server but not by the client, and the client would freeze waiting for replies. The freeze was quite long, like 15-20 minutes, because the Linux TCP stack uses a very long timout for retries:

tcp_retries2 (integer; default: 15)
The maximum number of times a TCP packet is retransmitted in established state before giving up.
The default value is 15, which corresponds to a duration of approximately between 13 to 30 minutes, depending on the retransmission timeout.
The RFC 1122 specified minimum limit of 100 seconds is typically deemed too short.

This long Linux default was quite inappropriate in the case because all NFS-over-TCP connessions were on the same site, with high bandwidth and low latency. If one of the connection ends is closed for whatever reason, the connection at the other end should be closed too rather sooner, without so many retries.
The wider issue here is that TCP is considered to be a reliable protocol, but that only applies to data transport not to connection signaling: while the TCP stack will transparently retransmit data if it does not reach the other end, it will not transparently re-establish a connection that gets closed.
So my suggestion was to switch to NFS-over-UDP, as the NFS-over-UDP client is designed to deal with the unreliability of both the signaling and transport of UDP, and thus is in effect more resilient than the NFS-over-TCP client (until the latter gets fixed).
Then last night I was helping someone on IRC who had a logging application, meant to work over WiFi, but to switch to local logging if WiFi were to be unavailable, dynamically. His problem was detecting that the logging target were unavailable, and not being well versed in application design. The correct approach would have been end-to-end testing of the reachability of the server, but this person really wanted to do it improperly, and rely on notification by TCP. Setting tcp_retries would

IP_RECVERR (defined in <linux/errqueue.h>)
Enable extended reliable error message passing. When enabled on a datagram socket all generated errors will be queued in a per-socket error queue. When the user receives an error from a socket operation the errors can be received by calling recvmsg(2) with the MSG_ERRQUEUE flag set.

as that for example should reflect ICMP unreachability notifications to the application level.

061210 Large bridged networks with centralized servers

Well, the more I think about the vast almost totally bridged network infrastructure I have to deal with the more I feel I feel I would have done things differently. My feeling is that bridging is a hack that should be used in exceptional cases, not a fundamental architectural feature for a vast infrastructure.
Typical traffic and usage patterns are pretty localized, and usually there is little traffic between workgroups. My general feeling is that good performance and reliability are supported by having multiple, independent, localized, redundant infrastructures. My preference is for infrastructures like this:

Workgroup servers for 10-30 people, each on its own LAN and switch. Each server has a full complement of services, including routing, DHCP, NIS/LDAP, DNS, SMTP, POP3/IMAP, FTP/HTTP, NFS/SMB, print, web proxy, CVS/SVN, etc.; if the group of users has a high availability requirement, two such servers (and LAN and switch) clustered/mirrored.
Most clients should have themselves a full complement of servers/services, including SMTP, FTP/HTTP, NFS/SMB, print, web proxy, etc.; these servers/services should then be configured with the workgroup servers as parents. In this way users can configure threir applications to refer to localhost as the server for all those services.
A backbone (or two for high availability) for site-wide communications and services. Servers/routers/switches (as a rule mirrored or clustered) for global services, like Internet access, SMTP, FTP/HTTP, backup, ...

With careful use of DNS and search in /etc/resolv.conf, in particular when using DHCP it is possible to make the configuration of group servers and clients almost identical and quite flexible, minimizing maintenance effort, and maximizing resilience and upgradeability. Sometimes I like to have the same private address range in each workgroup, with SNAT at the workgroup router; conceivably it might even be the 169.254.0.0/16 link local subnet (discussed also in RFC 3297).
Mostly centralized infrastuctures, both as to networks and systems, can be made resilient, but is much harder to make easy to upgrade: because often upgrades require restarting a service, and this either results in an interruption in service for everybody, or a global window of vulnerability. The usual consequence is that system maintenance and upgrading becomes very difficult, as it can then only be done when it would not impact anybody (as lots of user groups can then wield a veto).
There are two interesting and related difficulties even with this kind of structure: home directories and laptops. While I think that it is rather inappropriate to have home directories on global, per-site servers, it is not wholly obvious whether they should be on workgroup servers (dataless clients) or on the clients themselves.
That depends a bit on circumstances of course, but ideally they should be on both, which is not easy to achieve. The reasons why they should be on both are that it is much easier to backup a workgroup server, and that it is much faster to have them on clients. Both points even more important when one considers laptops. Dataless clients are increasingly less feasible given that many if not most clients nowadays are laptops.
Ideally one should be using a file system that supports local caching for detached operation and transparent resynchronization, like Ficus (1, 2, 3) or the later AFS (including OpenAFS) and Arla) or others inspired by it such as Coda or InterMezzo.
If one however is reluctant to use such non-standard technology, I feel that the best option is to have both a per-client and per-group file area, with the main one being the per-client one, and to take on the annoyance of backing up the per-client file area using something like Amanda or even just plain RSYNC (or systems based on it like rsnapshot).
For workgroups made of developers the choice is even easier: a home directory on the local client for most work, which does not usually need much backup as almost its entire content is checked out from a repository server, plus a remote area for truly persistent, but less often used, data, like e-mail archives.
Going back to networking, bridging creates excessively large broadcast domains, which is a bit of a pity given that many protocols that help with autoconfiguration rely on broadcasts, and never mind multicasting. One of the most amusing idea is then the use of VLAN tagging to restrict visibility of traffic to particular subsets of the infrstructure, even if load of that traffic is not so restricted (or at least no as much). Anyhow, caring about traffic visibility is somewhat pointless: what matters is access control to resources and anyhow as to traffic encryption is entirely feasible and reasonable, especially by using simple VPN packages to route (not bridge!) traffic between LANs that need such connections, which are the exception rather than the rule.
Usually bridged infrastructures and VLANs emerge from the expansion of an existing smaller infrastructure, and bridging and VLANing is the easiest and quickest way to scale it up without refactoring it, often because of lack of time or skills or confidence. It is a bit more surprising when a fully new infrastructure is like that, even if perhaps familiarity can be an overriding consideration, as in better the devil you know. And I just found this fabulous example of VLAN advocacy:

These virtual networks offered a new networking solution that could offer QoS due to its priority scheme and cost-saving solutions for businesses. With VLANs, there was no longer a need to physically move around office employees to connect them to the same network as their counterparts on different floors. With the help of a few Ethernet switches, the employees on the tenth floor could get all their traffic tagged with the same VLAN ID as the employees on the second floor. This method allowed them to be on the same network "virtually", and thus eliminate the expensive moving costs.

which to me sounds quite peculiar (to say the least), as what matters is access to services, not to LANs, and the impression I get is reinforced by this fascinating argument:

Better performance than a routed network
A network that uses VLAN technology is a switched network and therefore will perform better than a routed network, mainly because of the routing overhead required.
In a switched network, when using a shared link, the size of each VLAN can be decreased, which results in fewer collisions because each VLAN is an independent collision domain as far as the network layer is concerned.
Furthermore, it is also possible to group a large LAN to smaller VLANs and reduce broadcast traffic overall as each broadcast will be sent on to the relevant VLAN only.

Apart from the the routing overhead required which seems to confuse latency with bandwith, and that routing delivers much better locality of traffic, somehow it is forgotten that essentially all current Ethernet networks are based on switching and full duplex links, and there are no collisions.
Almost as funny this other argument:

A VLAN is an administratively configured LAN or broadcast domain. Instead of going to the wiring closet to move a cable to a different LAN, network administrators can accomplish this task remotely by configuring a port on an 802.1Q-compliant switch to belong to a different VLAN. The ability to move endstations to different broadcast domains by setting membership profiles for each port on centrally managed switches is one of the main advantages of 802.1Q VLANs.

I especially like the Instead of going to the wiring closet to move a cable, which blithely abstracts away a very big difference between having two physical LANs or a single physical LAN shared by multiple VLANs.
Of course fully bridged VLAN segmented infrastructures with centralized services appear to work, at last initially: as long as there are no problems and there is no great load on them, because if there is load or problems they tend to become global quickly and nastily. But users can become insentive to issues as the negative consequences are incremental, and once can end with poor usability so smoothly that expectations get adjusted downwards constantly and thus usually invisibly.

October 2006

061031b EMC2 may default to RAID3

Quite a bit of surprise recently when I discovered that for many years the huge storage system manufacturer EMC² has used as default RAID3 or a custom variant. RAID3 (for those who can) is surely better than RAID5 (but then almost everything is), but it still has some severe performance (IOPS are not parallelized) and reliability (any two disk failures cause loss of data) implications.

061031 Storage wire and command protocols, and SAN vs. NAS

Today I overheard part of a discussion about storage systems, and in particular about iSCSI and SAN and why bother, and as to iSCSI, the difference between bus protocol and command protocol.
I mentioned a bit the distinction between command protocol and bus protocol when discussing mailboxing and tagged queueuing and it can be restated as:

The bus protocol usually depends on the type of physical connection (bus) between the host and storage device and is about transferring bytes between host and storage, whether the bytes be commands or data.
The command protocol usually depends on the sophistication of the firmware of the storage device and is about which operations the device offers to the host and how they are to be encoded.

In old times before integrated drive electronics the bus protocol was also the command protocol, with things like operation codes like read, write or seek being explicit at the bus level; currently instead command or data packets are encoded, transferred over the bus, and decoded by the IDE part of the storage device. Which can be very sophisticated: contemporary disc drives can have hundreds of thousands to millions of lines of code of firmware which runs on embedded operating systems on fast 16 and even 32 bit CPUs with some MiBs of RAM.
FireWire, ATA, SATA, iSCSI, FC, all define different bus protocols; irrespective of bus protocol most storage devices currently implement the SCSI command protocol, with usually just the exception of the ATA native command set. But even ATA (and SATA) can encapsulate the SCSI command protocol and that is called ATAPI.
Now what is the point of encapulating the SCSI command protocol inside the IP protocol and thus of iSCSI? Well, technically it is a crazy idea. The SCSI command protocol is a low level storage protocol, not a high level network protocol. However, expediency sometimes justifies its use:

Current IP networks are so fast that the inefficiency of encapsulating SCSI in IP can be tolerated.
Many companies have large and fast IP networks, and reusing them for long distance access to storage devices can be cheap.

Why would anybody want to put a long distance between a computer and its storage devices? Well, there are sometimes two expedient reasons:

For offsite storage, that is for backup (synchronous usually).
To consolidate storage devices in a specific location.

Note that the case for things like ATA-over-Ethernet instead is much stronger: Ethernet is a much lower level protocol than IP, and it has similar properties by now to those of bus protocols like FireWire (and indeed FireWire itself can be used as a network link layer on which to transmit IP packets).
The case for SAN looks also dubious: things like iSCSI offer remote storage storage devices, SAN offers remote volumes or partitions, and the obvious alternative is instead NAS. Indeed a smart observation that I overheard is why ever people bother with iSCSI or SANs when there are good and semi good NAS systems like NFS or SMB.
Well, again from a technical viewpoint NAS seems a lot better than SAN: because a remote file system access protocol can be a lot faster and more flexible than a remote storage device or partition/volume protocol:

A NAS protocol can take advantage of file access semantics to optimize layout, access patterns, network traffic.
A NAS server has a much easier job than a SAN server as to sharing the same data among many clients: the SAN server must synchronize accesses by multiple operating systems to the same filesystem implementation, which can be done, just as it is possible to attach the same storage device to multiple servers, but not that good.

Well, the reasons are again based on expediency:

Just like with LVM many businesses use the Oracle DBMS and many Oracle DBAs prefer using partitions for tablespaces and not files, even if perhaps on recent Linux kernels this is no longer necessary (or one can use OCFS2).
Many businesses eventually realize that they have very large numbers of scattered servers, and then the IT department often seeks to save money (or more likely to increase its organizational power) by consolidating those servers and their storage devices. The servers can be consolidated by running them in virtual machines like with VMware ESX or Xen and storage can be consolidated with a SAN server because offering virtual disks to the virtual servers requires minimal reconfiguration of the operating system and software of the consolidated servers.

So yes, ideally one should use a remote filesystem protocol, but legacy configurations and the requirements of special applications might result in the use of of remote partition protocols or even remote storage device protocols.

061022b RAID5 perversions

I was recently amusedly discussing some medium large storage situations with a friend, after reading the recent article on XFS performance, and the discussion turned also to the vexed issue of RAID5. As cogently argued by the BAARF campaign site, RAID5 is almost never a good idea, and RAID10, RAID0 and RAID1 are almost always better alternatives (in very different cases). RAID5 may make some sense with 3 disc drives, and perhaps with 4, if there are really tight budget or space constraints. Otherwise it offers very little redundancy for a lot less performance (and poor usage patterns) compared to the obvious alternative, RAID10.
I then recollected some classic entries from the XFS mailing list about users who haven't quite figured that out, and here is a random selection, arranged in order of increasing perversity:

I have a 6 disk RAID5, I just added another disk, I plan to use the kernel feature in 2.6.17 to "hot-add" an additional disk to the array (not a spare), then I plan to use xfs_growfs to grow the filesystem to the maximum size it can support.
Fascinating example of having a 5+1 array and turning it into a 6+1 array because we can.
My RAID is an IDE-SCSI system -- IDE disks, looks like SCSI to the host. It has 8 80GB drives in RAID 5, for about 560GB of usable space. I'd like it to be just one partition.
A 7+1 RAID5 is a good way to push the boundaries, and making a single system out of it demonstrates courage :-).
All I know at this time for my RAID array is that there are 12 disks of 500 GB, so 11 * 500 GB of usable space (last disk for checksum), and it's not enough to know which value to set to sunit, it seems that sunit must not be equal to the size of one of the disk in my RAID array. I heard here and there about chunk size and stripe size. But I don't know what is the cunk size of a RAID array and how can I know it, (it's not me that bought this RAID array),
At least the author did not setup himself this seemingly insane 11+1 setup, but seems to have forgotten than in RAID5 the parity is interleaved...
I have a 12 disk HW raid 5 with 128K stripe size. I built my 4k block XFS volume with sunit=256,swidth=2816. Everything is peachy ... or is it?
Good question! :-)
I have two external JBODs with 12 disks each. Each JBOD has two channels, 6 disks per channel, and each channel is connected to a QLogic ISP 10160 controller.
Each of the JBODs is built as an md raid5 (md1 and md2). Both raid5s are mirrored (md3).
Not only an 11+1 setup, but two, mirrored ones. Perfect combination of low performance and low resilience.
but most people will have all disks configured in one RAID5/6 and thus it is not parallel any more.

hope this does not hold true for a 15x750GB SATA raid5. ;)
Good that it is understood that RAID5 has dire write performance implications. I agree sadly that as stated most people will default to a wide RAID5, and quite entertaining to see someone going for a RAID5 as wide as 14+1. Way not to go!
The box contains 16x750GB SATA drives combined into a single 11TB raid5 partition using md, and this partition contains a single XFS filesystem.
We are starting to get serious here: a single filesystem (that is, you lose any part of it, it is gone as a whole) on a 15+1 SATA RAID5. I would expect all the 16 drives to be of the same brand and model, and even from the same shipping carton :-).
This is in fact a 120 TB (not GB) filesystem that I am trying to build.
What I am attempting to do is to take 80 1.6 TB arrays ((8 x 250 GB Raid 5 arrays) 10 arrays from 8 separate SAN's). Use LVM to make one large volume, then use xfs to format and mount this as a single filesystem.
Any advice - or gotcha's would be appreciated.
This is probably my favourite. Not only the base unit is a 7+1 RAID5, but 80 of them are linearly concatenated in a single filesystem.

061022 XFS etc. performance for parallel IO and fragmentation

I have just noticed the addition to the list of XFS file system publications of an exceptionally interesting paper presented at the 2006 Ottawa Linux Symposium about several tests of speed for XFS nd quite a few other file system types. The special interests of the paper are that it is quite recent, it about quite parallel reads and writes, it includes something like a fragmentation test, and some of its results are somewhat surprising (to me at least). But first the non-surprising results are that:

XFS now scales better than most other filesystems to highly parallel IO subsystems with highly parallel applications. This is largely due to very recent improvements, see figures 9 and 10 on page 14 of the paper.
ext2 usually performs very well for that too, except that it and ext3 handle overwrites and resist fragmentation pretty badly, and thus become seek-bound even on (logically) sequential operations.
Many performance problems are due to high lock contention.
ext2 and ext3 issue many more IO operations per second than others.
ReiserFS (version 3, version 4 was not tested) does not scale.
How important is the design of the page cache dæmons to performance, and how they can consume lots of CPU time (improved by Cristoph Lameter in 2.6.16).

The more surprising results are that:

JFS performs badly on parallel writes, mostly because of interleaving extents from concurrently writtes files. But then some of the tests included questionable operations:
JFS demonstrated low write throughput. We discovered that this was partially due to truncating a multi-gigabyte file taking several minutes to execute.
However, the truncate time made up only half the elapsed time of each test. Hence, even if we disregarded the truncate time, JFS would still have had the lowest sustained write rate of all the filesystems.
JFS handles badly fragmentation caused by overwrites.
DM cannot support an array with more than about 90 devices because of a 1024B internal buffer limit.

My experience is that at least for my small mostly single threaded system JFS handles fragmentation, both due to rewrites and parallel writes, fairly well, thanks to extents and allocation groups.

061015 Options for mailboxing and tagged queueing

Having mentioned recently their benefits, I should discuss these terms and techniques a bit more:

A controller is some bit of electronics that drives a specific device and offers a logical view of the device via a command protocols running over a link protocol. Quite often people call controller what is actually an host adater, but that is an incorrect use. It survives because a long time ago PC disc drives were not IDE, and the controller was actually integrated on the WD100[36] host adapter (1 2), not the disc drive.
IDE is an arrangement in which the controller is integrated into a device. Virtually all current devices are IDE currently, whether their command and link protocol is ATA, or SCSI, USB, Firewire, ... IDE is usually used incorrect as to mean recent versions of ATA. The reason again is that early disc drives did not have a controller, and thus were not IDE, while later ones have had an integrated controller for a long time. But SCSI, USB and Firewire devices all are IDE too. It is best perhaps to remove ambiguity and use the word IDE qualified by the type of interface, as in IDE/ATA.
host adapter is a gateway between a host bus like PCI and an IO bus like ATA. Most devices are IDE ones, so the host adapter interacts over the IO bus with the device's controller. It accepts requests from the host CPU and performs operations on the devices it can reach. Often a host adapter can accept much higher level operations than can be done by the devices, and is heavily multithreaded.
Mailboxing is the ability of a host adapter (or a device) to accept requests to be queued and executed later, asynchronously.
Tagged queueing is a form of mailboxing in which each queued request has a tag indicating whether the request has been executed or not, and this allows requests to be executed not just asynchronously, but also out-of-order, allowing the host adapter (or the device) to sort them in a different order (for example by cylinder) which often requires the ability to define write barriers. There are two variants of tagged queueing known as TCQ in the SCSI command protocol and NCQ in the SATA command protocol.

There are several types of protocols in common use, both link and command ones. Actually as to command ones there are in practice only three, the ATA, SCSI and USB command protocols (each in several versions), and as to link ones there are many more, like ATA, SATA, USB, Firewire, FiberChannel, ... (each of them in various versions too).
Almost all the link protocols use the SCSI command protocol, and the ATA command protocol is only used over ATA and SATA links, and the USB command protocols is used only over USB links (fortunately). But the SCSI command protocol is often used over ATA and SATA buses, in which case it is called ATAPI, and also encapsulated in the USB command protocol, usually for mass storage devices and sometimes for scanners. There are then several cases:

SCSI devices over SCSI links (including Fiber Channel) with SCSI host adapters: almost all SCSI mass storage devices implement mailboxing and tagged queueing, and almost all SCSI host adapters support them, and almost all OSes support them too. Since the SCSI devices implement mailboxing and tagged queueing directly, this often involves minimum latency.
ATA devices over ATA links with ATA host adapters: almost all implement neither mailboxing nor tagged queueing, and implement only multiple contiguous block transfers in a single operation. There is a singular exception and it is recent disc devices from IBM and HDS which implement the SCSI command protocol as ATAPI, including TCQ. TCQ for ATA with IBM/HDS discs is only supported by FreeBSD and some experimental Linux drivers.
SATA devices over SATA links with SATA host adapters: some, and most of the recent ones, implement the NCQ (subset) variant of tagged queueing, which is supported by recent Linux kernels, and recent MS Windows ones.
ATA devices over ATA links with SCSI host adapters: notably the 3ware host adapters hoffer the host a SCSI interface including mailboxing and tagged queueing, which are implemented in the host adapter itself, while the devices are plain ATA or SATA ones. I think that 3ware host adapters do not take advantage of the ability of IBM/HDS discs to perform TCQ, but at least at least some of the SATA ones do.
USB and FireWire devices: even if the mass storage ones use SCSI as the command protocol, I think that the host adapter interfaces do not support mailboxing or tagged queueuing, even if I suspect that there have been FireWire host adapters capable of mailboxing and tagged queueing. Anyhow there are virtually no native USB or Firewire disc drives, and almost all disc drives attached to a Firewire or USB link are ATA drives with a protocol converter, and thus don't implement mailboxing or tagged queueuing.

061014 Effect of elevator on multistream reading performance

While discussing my video-on-demand system I mentioned that tagged queueing was essential to achieving good disc read performance with many readers at different points of a disc. The reason for this is that tagged queueuing allows the host adapter or the disc itself to reorder requests in a way to minimize arm movements, trading a bit of latency for throughput. Without tagged queueing requests cause enough arm movement that the aggregate bandwidth is severely reduced. The minor effect of tagged queueing is also that requests get done asynchronously (as it implies mailboxing), and thus can complete in a different order than the one they are issued in, which helps reclaim some latency lost to the batching of requests by position.
Well, that means that tagged queueing achieves most of its benefit on throughput by using an elevator algorithm to sort requests. For a while Linux has had five different elevator algorithms, and here is the difference between two of them (using a slightly different test from the previous one). First with the noop elevator and 4 streams:

# echo noop >| /sys/block/hdj/queue/scheduler; sh streams.sh 4 /dev/hdj
# 250000+0 records in
250000+0 records out
1024000000 bytes (1.0 GB) copied, 213.285 seconds, 4.8 MB/s
250000+0 records in
250000+0 records out
1024000000 bytes (1.0 GB) copied, 221.321 seconds, 4.6 MB/s
250000+0 records in
250000+0 records out
1024000000 bytes (1.0 GB) copied, 221.379 seconds, 4.6 MB/s
250000+0 records in
250000+0 records out
1024000000 bytes (1.0 GB) copied, 220.533 seconds, 4.6 MB/s

This is just 18.6GB/s but with the anticipatory scheduler throughput is a lot better:

# echo anticipatory >| /sys/block/hdj/queue/scheduler; sh streams.sh 4 /dev/hdj
# 250000+0 records in
250000+0 records out
1024000000 bytes (1.0 GB) copied, 64.3318 seconds, 15.9 MB/s
250000+0 records in
250000+0 records out
1024000000 bytes (1.0 GB) copied, 68.168 seconds, 15.0 MB/s
250000+0 records in
250000+0 records out
1024000000 bytes (1.0 GB) copied, 68.0298 seconds, 15.1 MB/s
250000+0 records in
250000+0 records out
1024000000 bytes (1.0 GB) copied, 67.5332 seconds, 15.2 MB/s

Well, surprises never end. It looks like that the anticipatory elevator batches and reorders as much as the elevator in the disc itself or the host adapter. I still prefer having a host adapter or discs with tagged queueing (and thus mailboxing) as that allows reordering with lower latency and out-of-order completion, which are quite useful (except that some host adapters use very slow CPUs and thus they can add to latency by being slow to process requests, sometimes adding several milliseconds to each one). I have also tried the deadline elevator, which has performance similar to the noop one, and again the cfq elevator, which performs sometimes like noop and sometimes like anticipatory (probably depending on the order in which the 4 dd processes are scheduled).
The streams.sh script used above looks like:

#!/bin/bash
let I=0
let N=${1-'4'}
let C=250000
while [ $I -lt $N ]
do (let S=$I*$C
    let D=$N-$I
    sleep $D && dd if=${2-'/dev/hda'} of=/dev/null bs=4k count=$C skip=$S)&
   let I=I+1
done

061013 Evolution of a video-on-demand system

I have spent a few days recently checking out and upgrading a video-on-demand system that I designed and setup and maintained for a decade for a university's modern languages department. They use a lot of movies both to familiarize their students with the media culture of the foreign countries they study, and of course as language practice. The were annoyed with having to schedule movie showing sessions where a room had to be booked, a projector set up, a technician assigned, and finding a suitable slot where molst student were going to attend. So they asked me about a video-on-demand server and PCs capable of receiving a movie stream and displaying it in real time. The benefit being that then movie watching could be entirely unscheduled, and each student could watch a different movie at any time, without any booking and a technician to assist.
My answer was that at the time this probably had just become possiblem, for MPEG1 streams, which in involve about 150-200KiB/s of streaming, and in their case 21 clients, for a total of 3-4MiB/s of traffic.
My design involved a Linux server and, because of institutional constraints, MS Windows clients. After some experiments it became clear that the project was indeed barely possible, and using a very simple structure: with a slighly oversized network capacity it was feasible to stream MPEG1 movies without any special arrangement, simply making them available as Samba shares. Then it was very critical to have tagged queueing in the server's storage subsystem and some kind of acceleration in the clients' video cards, drivers and movie player.
The final system was like this:

Server with Pentium 100MHz, 16MiB RAM, Adaptec SCSI host adapter, 2x4GiB SCSI discs, 2x100MHz network cards, Linux 1.2 kernel.
Two 100MHz hubs with 12 ports each; into each hub were plugged one of the server cards and half of the clients.
Clients with Pentium 100MHz, 16MiB, S3 cards with some MPEG1 acceleration, MS Windows 95 with ActiveMovie.

The really crucial elements were the MPEG1 acceleration in the client video card, and that it was taken advantage of by ActiveMovie, the tagged queueing in the server's SCSI host adapter, and the splitting of the network traffic on two hubs.
The movies have always been available via both HTTP and as files on SMB shares. But initially with MS Windows 9x the performance of SMB shares was terrible (it seemed to me that the SMB client was busy looping for packets), so in practice only HTTP was viable.
The tagged queueing mattered because access to stored movies is essentially random (it is almost always sequential per user, and each disc could have up to 10 users accessing it in different places). In such a case I found that the aggregate throughput delivered without tagged queueing is a fraction of the single stream bandwidth. For example, on a current ATA-5 250GB disc, with 1 and 4 streams(with the CFQ elevator):

base# sh streams.sh 1 /dev/hdj1
base# 100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 6.6336 seconds, 61.7 MB/s

These 61.7MB/s are more or less the full speed of the underlying disc.

# sh streams.sh 4 /dev/hdj1
# 100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 56.239 seconds, 7.3 MB/s
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 58.2396 seconds, 7.0 MB/s
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 57.2401 seconds, 7.2 MB/s
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 55.3099 seconds, 7.4 MB/s

With 4 streams this disc cannot do more than less than half the bandwidth with a single stream. By comparison here is the video server with a 3ware host adapter with two equivalent 250GB discs in a RAID0 arrangement (for extra bandwidth of course):

#  sh streams.sh 1 /dev/sda
# 100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 4.29543 seconds, 95.4 MB/s

The 95.4MB is quite good, without any special care RAID0 gives 50% bandwidth, and with 4 streams things are even better:

# sh streams.sh 4 /dev/sda
# 100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 16.1709 seconds, 25.3 MB/s
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 15.7893 seconds, 25.9 MB/s
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 16.7883 seconds, 24.4 MB/s
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 14.7877 seconds, 27.7 MB/s

Not only the aggregate bandwidth is not halved, but it is higher (103.3MB/s) probably because of better exploitation of the RAID0 parallelism.
This enormous difference is because the 3ware host adapter presents a SCSI tagged queue interface to the kernel, and this suffices even if the discs themselves do not support tagged queueing. Native SCSI discs who support tagged queueuing natively may be even better, but the bandwidth above suffices amply. The discs available 10 years ago could only deliver about 4-6MB/s sequentially, which means that tagged queueing was necessary to deliver 10-12 streams at 150-180KB/s for MPEG1.
As to latency, in theory in order to deliver video one should have quality-of-service, rate controlled file systems and networks, but my guess was that by providing enough excess capacity there would be no need of that, and I sized up the system with at least 30% spare capacity at every bottleneck (after a well know result of queueing theory), and that worked well indeed, coupled with a bit of buffering in the client.
The video-on-demand system has been upgraded several times, and currently looks like:

Two mirrored (via RSYNC nightly) server with 1.2HJz Athlon XPs, 256MiB RAM, 3ware SCSI/ATA host adapter, 2×123GB plus 2×250GB ATA-5 discs, arranged as two RAID0 arrays, 2x100MHz network cards, Linux 2.6 kernel.
Two 24 port 100MHz switches, each server is plugged into both.
40 clients with pentium 600Mhz, 192MiB, ATi RAGE128 cards with some MPEG1 acceleration, MS Windows XP with VLC or MS MediaPlayer.

Unfortunately I have not been able to influence much the upgrades to the network topology and the clients. For the clients the critical issue is that the card, its driver, and the video player software must support overlays and MPEG acceleration. Experiments with playing MPEG2 movies have shown that a newer card with MPEG2 acceleration, like semi-recent ATi and NVIDIA ones, can play MPEG2 without much using CPU, and that the servers and the network can easily deliver the required bandwith (about 400KB/s per client).
The system currently has two mirroring servers as a form of hot backup, in part because it is in daily use and many classes rely on it, and in part as a form of automatic content backup, as time has proven that it is very difficult to persuade the support staff to perform regular offline backups.
Overall I have been quite happy with the initial installation and the evolution of the system, and so have been the lecturers and students that use the system, as indeed it makes very easy to arrange collective or individual movie screenings.

061004 Another technique called "HDR" somewhat improperly

Just noticed that there is another techniques used in videogames that is called HDR rendering, which is partially misnamed, and it is not really a technique, but the removing of a limitation or a bug. The idea is that in computing the luminosity of each pixel the result needs to be clipped, but not the intermediate values. So for example if a pixel reflects 20% of incident light and there are two 80% light sources, the luminosity of the pixel should be 32% and not 20%. To achieve this graphics cards were upgraded from 8 bit integer color intensity values to 32 bit floating point ones, for example to support Shader Model 3.0 in DirectX 9.0 which of course can be a lot slower (four times as many bytes to process, and floating point instead of 8 bit integer).
Well, in theory this is not really needed to avoid clipping: clipping is overflow, and techniques to order operations to minimize overflow have been published (even if not as widely known and practiced) for decades. But they require care, and doing arithmetic in a much larger range than the final result means not having to care. More importantly though many graphics algorithms are multi-pass and doing all computations in 8 bit integer arithmetic just loses too much precisions. Thus John Carmack called for more bits per color (in 2000 he wanted 64 bits per pixels, or 16 bits per color) and this is what I think matters even more.
In any case HDR is a misnomer for the idea of doing intermediate calculations in a higher precision than the final result: because there is no dynamic aspect to this, and arguably the range is not high, because the range of the final result does not change.

061003 Games HDR, photos range compressions, and terminology

Some games are beginning to use a technique known as HDR which simulates the adjustement of the eye to light so that if one looks at a bright part of a scene the apparent light level goes down (as the eye adjusts to brightness) and if one looks at the dark parts of a scene the apparent light level goes up (as the eye adjusts to darkness). This is because the eye has a limit to the extent of light levels it can perceive, and dynamically adjusts its sensitivity up or down the scale.
Now thanks to a visit to a page of nice photos I have found that there is a completely different technique for digital photography that is also called HDR. This one works around the intensity range limits of cameras and screens by photographing the same scene a number of times with different exposures, and then combining the different photographs. This brings both the darker and lighter parts of the scene towards the middle of the range, so that in the resulting composite they seem to have much the same intensity. This has a very interesting effect, making the whole scene far more detailed and colourful than it appears otherwise, both to camera or eye. Part of the reason is that very dark areas or very bright ones do not appear colored to the eye (also because at low light levels the eye switches to black and white vision).
It is a nice and interesting effect, but there is no high or dynamic aspect to the light intensity range in this technique: the high range of the scene gets compressed to the small range of the camera and display, and this is done statically and not dynamically.

Software and hardware annotations q4 2006

December 2006

November 2006

October 2006