Software and hardware annotations 2008 January
This document contains only my personal opinions and calls of
judgement, and where any comment is made as to the quality of
anybody's work, the comment is an opinion, in my judgement.
[file this blog page at:
digg
del.icio.us
Technorati]
- 080122
Number of NFS server instances
- Recently some performance issue with an NFS server
reminded me of a classic episode many years ago. The
recent issue was about an NFS server being very, very slow at
serving data at around 1MB/s read or write rates to some clients,
even if its hardware was not overloaded with over 50MB/s local
performance. Tracing the NFS traffic with the classic
tcpdump
showed that the low network performance involved some seconds of
continuous packet traffic broken by several seconds of pause,
resulting in low average traffic. This pointed to some
bottleneck on the server, which however could transfer data over
the network at full 1gb/s interface speed. It then transpired
that some highly parallel task had been started on a nearby
cluster and the processes were all doing IO in parallel, and
since there were 24 of them, they were monopolizing the 16 NFS
server processes, which were then the critical resource.
Which reminded me of an amusing episode long ago: that on
some Sun 3 server the CPU would become very busy as soon as
more than 8 processes, and in particular NFS processes, became
active. I found that there was that there was a page table
cache with capacity for the page tables of 8 processes, evicting
a table from the cache and loading a new one was rather
expensive in CPU time, which in themselves were not problems.
But the big problems was that while processes were scheduled
FIFO
(round robin
) the page table cache had a
LIFO
replacement policy (least recently used
)
which guaranteed that if there were more than 8 processes
there would be a page table cache entry replacement at
every context switch. This because the OS team had read in some
textbook that processes should be scheduled FIFO, the hardware
team that caches should be managed LIFO, and neither could see
that the combination would be catastrophic.
How did Sun solve the issue? Well, as far as I know it was
never solved for the Sun 3, and for the Sun 4 the design
team had a beautiful idea: to increase the number of page cache
slots to 64, while keeping the LIFO replacement policy for the
cache. Fortunately at the time there were few situation where
more than 64 processes would become active.
- 080119
Another difficulty with centralized computing
- In general as already argued I prefer
Internet-like network service structures
composed of many relatively small workgroup sized cells
connected by a (hierarchicaly meshed) backbone. The main
reasons is resiliency and flexibility, as problems tend to be
local rather than global, and the sysadm overhead is not that
huge if one follows reasonable tools and practices of mass
sysadm.
But there are some other concerns, which are about
scalability of system administration:
- Performance tuning
- The difficulty and inconvenience of performance tuning
grows rather faster than capacity, because if a load is
partitionable the easiest strategy is to throw discrete bits
of hardware at it. Consider serving files to 400 concurrent
users from 20 servers each supporting 20 users or from a
single file server. Providing ample network and storage
bandwidth for 400 users is very hard, providing it for 20 is
almost trivial:
- For example one 1gb/s network interface might well be
enough to support 20 users, but how easily can one provide 20
1gb/s network interfaces on a 400 user file server? One
would have to use perhaps 3-4 10gb/s interfaces and then
a distribution network to the edge user stations.
- The same applies to storage bandwidth: the sort of
subsystem that can serve adequately 20 users is currently
a simple RAID with a few drives and a single filesystem of
a few TiB. A storage system capable of providing enough
capacity and especially bandwidth for 400 users is a
rather bigger undertaking, and it takes significant effort
to tune or cost to buy a ready-optimized package.
But there is a bigger point that affects the strength of the
two points: insufficient or poorly tuned capacity has a much
bigger impact in the centralized case, because it affects
everybody; in particular performance must also be high
enough in the central case that it satisfies the most
demanding user, while in the decentralized case one need
only to worry about the cells where the demanding users are
(and they are often clustered by job).
- Configuration maintainability
- Much the same argument can also be made about system
administration: any suboptimal configuration, and not just
peformance configuration, impacts everybody, and
configuration on the central system must be then union of
those suitable for all users, including the most demanding
ones. So for example a central server not only must have
every possible package installed, but also several different
versions of each package, as different users will require
different ABI levels because they use binary packages with
different version dependencies, and so on.
Of course if centralized capacity can be tuned and configured
optimally for everybody and kept tuned and configured optimally
over time then one gets the best possible outcome.
That is not however what my experience suggests: most
organisation are slightly or rather disjointed and inevitable
imperfections and mishaps prevent perfect execution on a global
scale, and good execution in a few locales where it really
matters is already a challenging goal in an imperfect world, but
at least it is a goal that usually capture most business benefit
and is often more easily achievable.