This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
[file this blog page at: digg del.icio.us Technorati]
It may seem incredible, but even quite smart people have been baffled by the inanity of the changes made to the X Windows System architecture to introduce the so-called RANDR extension, in part because they are radical change, in part because they have been rather objectionable, and went through a series of poorly misdesigned and incomplete iterations.
As a background it used to be that the X architecture was fairly clear:
display, usually but not necessarily display :0, and this was configured as a Layout as a set of input devices and a set of
screens, that is output devices.
virtual consolesa system could have multiple displays offering completely different setups.
screenwould have a
deviceand a monitor type for the output device associated with the device, and a set of properties, including the bit depth and bit size of the frame buffer, and the supported video modes (the intersection of those supported by the device and the monitor).
devicerepresents a
frame buffercontaining a frame to display on the
monitorwith the associated type.
monitoror more precisely a monitor type defining the characteristics of a monitor attached to a device. The main purpose of a monitor was to define the optical properties of the displayed frame, such as its size in millimiters and its pixel density, as well as its electronic properties, most notably the maximum bandwidth of its electronics and for monitors with moving elecron beam CRTs the pause times those electronics required.
I have put several examples of X11 configuration else where elsewhere on this site and here are the outlines of three common cases:
Section "Monitor" Identifier "TypeA" VendorName "generic" ModelName "LCD 23in" Gamma 2.2 DisplaySize 509 286 HorizSync 30-83 VertRefresh 56-75 #Bandwidth 155 EndSection Section "Device" Identifier "Card0" Driver "vesa" EndSection Section "Screen" Identifier "Screen0" Monitor "TypeA" Device "Card0" Subsection "Display" Modes "1920x1080" "1366x768" "1024x768" EndSubsection EndSection Section "ServerLayout" Identifier "Generic" Screen 0 "Screen0" InputDevice "Mice" "CorePointer" InputDevice "Keyboards" "CoreKeyboard" EndSection
Section "Monitor" Identifier "TypeA" VendorName "generic" ModelName "LCD 19in" Gamma 2.2 DisplaySize 376 301 HorizSync 31-81 VertRefresh 56-75 #Bandwidth 155 EndSection Section "Device" Identifier "Card0" VendorName "generic" BoardName "GeForce" BusID "PCI:1:0:0" Driver "nvidia" Screen 0 EndSection Section "Device" Identifier "Card1" VendorName "generic" BoardName "Radeon" BusID "PCI:2:0:0" Driver "radeon" Screen 1 EndSection Section "Screen" Identifier "Screen0" Device "Card0" Monitor "TypeA" Subsection "Display" Modes "1280x1024" "1024x768" EndSubsection EndSection Section "Screen" Identifier "Screen1" Device "Card1" Monitor "TypeA" Subsection "Display" Modes "1280x1024" "1024x768" EndSubsection EndSection Section "ServerLayout" Identifier "Layout2" Screen 0 "Screen0" Screen 1 "Screen1" RightOf "Screen0" InputDevice "Mice" "CorePointer" InputDevice "Keyboards" "CoreKeyboard" EndSection
Section "Monitor" Identifier "TypeA" VendorName "generic" ModelName "LCD 13in" Gamma 2.2 DisplaySize 286 178 HorizSync 50 VertRefresh 60 #Bandwidth 83 EndSection Section "Monitor" Identifier "TypeB" VendorName "generic" ModelName "LCD 24in or projector" Gamma 2.2 DisplaySize 518 324 HorizSync 24-94 VertRefresh 48-85 #Bandwidth 250 EndSection Section "Device" Identifier "Card0Fb0" VendorName "generic" BoardName "nVidia" BusID "PCI:1:0:0" Driver "nvidia" Screen 0 EndSection Section "Device" Identifier "Card0Fb1" VendorName "generic" BoardName "nVidia" BusID "PCI:1:0:0" Driver "nvidia" Screen 1 EndSection Section "Screen" Identifier "Screen0" Device "Card0Fb0" Monitor "TypeA" Subsection "Display" Modes "1366x768" "1280x800" "1024x768" Depth 16 EndSubsection EndSection Section "Screen" Identifier "Screen1" Device "Card0Fb1" Monitor "TypeB" Subsection "Display" Modes "1920x1200" "1280x1024" "1024x768" "800x600" Depth 24 EndSubsection EndSection Section "ServerLayout" Identifier "Layout01" Screen 0 "Screen0" Screen 1 "Screen1" RightOf "Screen0" InputDevice "Mice" "CorePointer" InputDevice "Keyboards" "CoreKeyboard" EndSection
Originally each screen could only share the input devices with other screens, and windows could not be displayed across two screens nor could they be moved from one screen to another, in part because multiple-screen systems were rare and expensive, in part because screens could have extremely different characteristics, such as color depth, or pixel density, such as a monochome portrait monitor and a color television.
However as the memory sizes of graphics units increased, and the cost and size of monitors decreased, especially with LCD monitors, systems with two (or more) identical (or nearly identical) monitors became common, and so the desire to be able to regard multiple monitors as interchangeable tiles.
Therefore a somewhat hacked solution was added, in the form of the XINERAMA protocol extension and associated mechanism. The protocol extension allowed applications to query the X server as to the geometry of the various screens and to handle them as if they were sections of a bigger meta-screen, with the positions of the screens within it to be specified in the X server's Layout section where they are listed.
It was a somewhat inelegant retrofit, putting the burden of dealing with the situation on applications, but since the main application code involved was in window managers and libraries rather than in end user code, it was mostly painless, and respected the overall successful philosophy of the X Window System to offer simple mechanisms and leave policies to user applications.
The above architecture was very flexible in many ways, in particular allowing very diverse devices and monitors to coexist in a display, but had a limitation: all the elements above, including the number and type of monitors and devices, and their characteristics, had to be statically defined in the X server configuration.
My solution to this was simply to define the maximum number of devices and monitors that I wanted to use in the worst case, and the list of screen modes that encompassed most of the actual devices and monitors I would be using, and for out-of-the-ordinary case just run a custom-configured X server as a separate display on another virtual console.
Otherwise the X server implementation of Xinerama could have been reworked to support enabling and disabling devices and monitors, and adding and deleting modes.
The one limitation that could not be overcome was to introduce the ability to rotate the screen, as that requires additional code to rotate the frame buffer.
Someone then decided to add another protocol extension to request screen rotation, and to overload it with dynamic screen addition and removal and resizing. However for whatever stupid reason they decided to add to rotating and resizing a completely new model which was whacked into the X server to coexist uneasily with the previous one, and which is rather uglier:
displayis made of a number of autodiscovered input devices, which cannot be explicitly configured, and exactly one
screenwith a frame of 8192×8192 pixels.
screencan only be supported by one
devicewith exactly one frame buffer.
deviceon a single graphics unit could have multiple regions mapped to
outputsby way of
crtcs.
outputshave
monitorinstances attached to them, can be enabled and disabled dinamically, and their position, pixel density, gamma, and the characteristics of the monitor can be changed dynamically.
A sample static configuration with two identical output monitors could look like:
Section "Monitor" Identifier "Monitor0" VendorName "generic" ModelName "LCD 19in" Gamma 2.2 DisplaySize 376 301 HorizSync 31-81 VertRefresh 56-75 Option "Primary" "true" Option "PreferredMode" "1280x1024" EndSection Section "Monitor" Identifier "Monitor1" VendorName "generic" ModelName "LCD 19in" Gamma 2.2 DisplaySize 376 301 HorizSync 31-81 VertRefresh 56-75 Option "Primary" "false" Option "PreferredMode" "1280x1024" Option "Right-Of" "DVI1" EndSection Section "Device" Identifier "CardR" VendorName "generic" BoardName "generic" Option "Monitor-DVI1" "Monitor0" Option "Monitor-VGA1" "Monitor1" EndSection Section "Screen" Identifier "ScreenR" Device "CardR" # 'Monitor' line in RANDR mode ignored. EndSection Section "ServerLayout" Identifier "LayoutR" Screen "ScreenR" # 'InputDevice' lines in recent servers ignored. EndSection
The equivalent dynamic configuration could be achieved with:
xrandr --newmode 1280x1024@60 108.0 \ 1280 1328 1440 1688 1024 1025 1028 1066 +HSync +VSync xrandr \ --addmode DVI1 1280x1024@60 \ --addmode VGA1 1280x1024@60 xrandr \ --output DVI1 --primary --mode 1280x1024@60 \ --output VGA1 --noprimary --mode 1280x1024@60 --right-of DVI1 xrandr \ --output DVI1 --dpi 100 \ --output VGA1 --dpi 100
The above only applies to relatively recent versions of RANDR, versions 1.2 and 1.3; previous versions are hardly usable except in narrow circumstances.
The static configuration is appallingly designed with the particularly silly idea of putting the geometry relationship among the outputs in the Monitor sections.
Probably RANDR is inspired by nVidia's TwinView which however is very much better designed, and is compatible with the old style X architecture.
Because of the vagaries of computing history some important network protocols are assigned a fixed port number, but are used for very different types of traffic.
In particular application protocols like
SSH
and
HTTP
are often used as if they were basic transport
protocol like UDP or TCP, with other
protocols layered on top, often to help with crossing network
boundaries where
NAT
or firewalls block transport protocols.
So for example SSH is used both for the interactive sessions for which it was designed, and for bulk data transfer for example with RSYNC.
This poses a problem in that the latency and throughput profiles of the protocols layered on top of SSH and HTTP can be very different, making it difficult for traffic shaping configurators like my sabishape to classify traffic correctly.
There is one way to make traffic shaping able to distinguish the different profiles of traffic borne by the same application protocol, and it is to assign to them different ports, as if they were different application protocols.
For example to use port 22 for interactive SSH traffic, but port 522 for RSYNC-over-SSH traffic. Similarly to use port 80 for interactive HTTP browsing, but port 491 for downloads.
Some of these ports are not NAT'ed or open by default in firewalls, and it is a bit sad to have to have independent local conventions but the benefit, especially avoiding the huge latency impact of bulk traffic on interactive traffic, is often substantial, and the cost is often very small, as many server dæmons can easily listen on two different ports for connections.
Having read my note about SSL issues and random number generators a smart correspondent has send me an email to point that that such problems, and in general time-dependent problems, are made much worse by running application code, in particular SSL, but not just, inside virtual machines.
Virtual machines disconnect to some extent virtual machine state from real machine state for arbitrary periods (even if brief) as the VMs gets scheduled, and this completely alters the timings of events inside VMs, and in a rather deterministic way, as schedulers tend to be so.
This and other aspects of virtual machines can starve the entropy pool of entropy or make pseudo random number generation much more predictable, thus weakening keys.
Some virtual machine platforms offer workarounds for this, but this is yet another reason why virtual machines are usually not a good idea.
The Dell
P2311H monitor
with a 23in display (510mm×287mm) belongs to the value
range of Dell monitors and these are my
impressions.
I got this because it was part of a package with a nice Dell Optiplex desktop. The display has a diagonal of 23" or 545mm (267mm×475mm) and it has a full resolution of 1920×1980 pixels, using a TN display. Things that I liked:
The things I liked less:
Overall I think that the similar model U2312HM is vastly preferable as the display is much better and the cost is not much higher. Even the smaller and cheaper IPS225 is much preferable.
I have recently been using an LG IPS225 monitor with a 21.5in display (477mm×268mm) and what I liked about it:
The things I liked less:
valuemonitors.
Overall I think that it is a good monitor, and for the price it is very good. The stand is terrible, but it is easy to find decent VESA mount stands, in particular those that allow rotating it into portrait mode.
Since it has a 21.5in diagonal its size is much more suitable than that of monitors with a 23in or 24in display for portrait operations, which tend to be too tall, and I think it works very well in Portrait mode, which I think is usually preferable, even if it is a bit too narrow (just like it is not a bit short when in landscape mode) because of the usual skewed aspect ratios.
A smart person I know also bought this model and is also using it (only) in portrait mode, having chosen carefully.
In its class the LG IPS225 is amazing value.
Today a smart person spotted that I sometimes write
for
loops with a single repetition, and asked me
why. There is more than one reason, and one is somewhat
subtle, and it is in essence to write something similar to a
with
statement from
Pascal
and similar languages, which is used to prefix a block with
the name of a datum and operate on it implicitly.
It is part of a hierarchy of control structures that is parallel to a hierarchy of data structures, as follows:
Data structure | Control structure |
---|---|
constants | expression |
variables | assignment |
records | (with ) block |
arrays | for |
lists | while |
trees | recursion |
acyclic graphs | iterators |
graphs | closures |
The above table (which is slightly incomplete) is in order of
increasing data structure complexity, and the corresponding
control structure is the one needed to
sweep
the data structure, that is to make
full use of it. Programming consists largely of designing
data structures and then various types of sweeps through
them.
The block is the control structure appropriate for manipulating a record, and here are two example in C and in Pascal:
struct complex { float re,im; }; const float cmagn(const struct complex *const c) { { const float re = c->re, im = c->im; return (float) sqrt((re*re) + (im*im)); } }
TYPE complex = RECORD re, im: REAL; END; FUNCTION cmagn(c : complex): REAL; BEGIN WITH C BEGIN cmagn = sqrt((re*re) + (im*im)); END; END
The intent of the above is to make clear to both reader and
compiler that the specific block is about a program section
specifically about a given entity
.
Similarly sometimes I write in my shell scripts something
like:c
for VDB in '/proc/sys/vm/dirty_bytes' do if test -e "$VDB" -a -w "$VDB" then echo 100000000 > "$VDB" fi donewhich is equivalent to but perhaps slightly clearer than:
{ VDB='/proc/sys/vm/dirty_bytes' if test -e "$VDB" -a -w "$VDB" then echo 100000000 > "$VDB" fi }
The version with for
also allows me to comment
out the value if I want to disable the setting of that
variable.
But the real reason is to convey the notion that the block is
about VDB
and it is a bit more emphatically clear
with the for
than with the generic {
block.
There are other cases where I slightly misuse existing constructs to compensate for the lack of the more direct ones, both of them after some ideas or practice by Edseger Dijkstra.
In the chapter he wrote for
Structured Programming
he introduced the idea of using goto
labels as
block titles, in outline:
extern void *CoreCopy( register void *const to, register const void *const from, register const long unsigned bytes ) { copySmallBlock: if (bytes < ClusterBEST) { CoreBYTECOPY(to,from,bytes); return to; } copyHead: if (bytes >= ClusterDOALIGN) { const long unsigned odd; if ((odd = ClusterREM((addressy) to)) != 0) { CoreODDCOPY(to,from,odd = ClusterBYTES - odd); bytes -= odd; } } copyClusters: CoreCLUSTERCOPY(to,from,ClusterDIV(bytes)); copyTail: CoreODDCOPY(to,from,ClusterREM(bytes)); return to; }
I gave up on the practice of using labels as section titles
because most compilers complain that such labels are unused in
goto
statements.
Edseger Dijkstra also introduces in
A discipline of programming
the notion that unhandled cases in if
statements
are meaningless so that this program fragment is meaningless
if n is negative:
if n >= 0: s = sqrt(n); fi
Sometimes I use while
and nontermination to
indicate a similar effect, relying on the property of
while
that it is a precondition
falsifier
thus for example writing in a shell script
something like:
for P in .... do while test -e "$P" do rm -f "$P" done done
Thanks to mod_ssl the Apache web server can support SSL connections, but since these involve encryption they can suffer from somewhat subtle issues.
These issues are usually related to the rather complicated
X.509 certificates
but sometimes they are caused by performance problems, as
SSL connections require significant numbers of random bytes
for encryption keys and then processing time to encrypt
data.
There are in particular two cases in which random number generation can cause failed connections:
In the past few days my laptop has shown signs of very, very slow writing, at less than 1MiB/s, while at the same time its flash SSD drive when tested with hdparm would read at nearly 300MiB/s. Doing some tests with dd showed that writing with oflag=direct would run at the usual write speed of just over 200MiB/s.
This clearly pointed at some issue with the
Linux page cache
and after some searching I found that the kernel parameters
vm/dirty_... were all 0. Normal
writing speed were restores by setting them to more
appropriate values like:
vm.dirty_background_ratio = 0 vm.dirty_background_bytes = 900000000 vm.dirty_ratio = 0 vm.dirty_bytes = 1000000000
But I had setup /etc/sysctl.conf with appropriate values, and anyhow eventually those parameters changed back to zero. After some investigation the cause is that the /usr/lib/pm-utils/power.d/laptop-mode scripts would be run by the pm-utils power management logic. Normal behaviour was restored by listing laptop-mode in the file /etc/pm/config.d/blacklist.
The problem with laptop-mode is that it sets the vm/dirty_... parameters in the wrong order, as setting the parameters that end in _bytes zeroes the parameters with a similar name ending in _ratio, as they parameters ending _ratio are supposed to be set first, and then those _bytes if available.
The reason for that is that eventually some Linux kernel
contributor
realized the stupidity of
setting the flushing on a percentage of the amount of memory
available rather than a fixed amount (usually related to IO
speed), so provided a second set of settings for that, with
the automagic
side-effect to zero the
overriden settings for percentage, to avoid ambiguity.
Unfortunately if the order in which the settings are made is wrong it is possible to end up with zeroes in most of them, which causes the page cache code to behave badly.
I had written a much better change in which values over 100 in the old settings would be interpreted as a maximum number of dirty pages, and also fixed some ridiculous other automagic side effects.
For historical reasons I have installed Ubuntu 12.04 on a laptop in an ext4 filetree, and I have decided that I would rather mount it in data=writeback mode with barrier=1 than the default data=ordered mode with barrier=0 (something that is even more important for the ext3 filesystem).
This has been frustrating because in most contemporary
distribution the Linux kernel boots
into an in-memory block device extracted from a initrd
image, and the root
filetree is mounted by a script in that filetree using
hardcoded parameters at first and then parameters copied from
/etc/fstab, and this can be rather inflexible as
previously noted.
Therefore at first that filetree was mounted with data=ordered then it is remounted with data=writeback which fails because the data journaling mode cannot be changed on remount.
However unlike other distributions the startup script in
the Ubuntu 12 /init script in the initrd
scans the kernel boot line for
various parameters as in:
.... # Parse command line options for x in $(cat /proc/cmdline); do case $x in init=*) .... root=*) .... init=*) .... root=*) .... rootflags=*) .... rootfstype=*) .... rootdelay=*) .... resumedelay=*) .... loop=*) .... loopflags=*) .... loopfstype=*) .... cryptopts=*) .... nfsroot=*) .... netboot=*) .... ip=*) .... boot=*) .... ubi.mtd=*) .... resume=*) .... resume_offset=*) .... noresume) .... panic=*) .... quiet) .... ro) .... rw) .... debug) .... debug=*) .... break=*) .... break) .... blacklist=*) .... netconsole=*) .... BOOTIF=*) .... hwaddr=*) .... recovery) .... esac done ....
Among them there is rootflags=.... which
can be set. But to set the parameters for mounting the
root
filetree three steps may be required:
initrdimage as that copies into it the /etc/fstab options.
A little known but historically very important operating system was OS6 developed at the Oxford Computing Laboratory by Christopher Strachey and collaborators around the year 1970.
It is important because the Xerox Alto operating system was derived from it:
Alto OS (1973-76): I designed (and, with Gene McDaniel and Bob Sproull, implemented) a single-user operating system, based on Stoy and Strachey's OS6, for the Alto personal computer built at PARC [14, 15a, 22, 38, 38b]. The system is written in BCPL and was used in about two thousand installations.
The Alto was the first modern workstation, inspired by Alan Kay's goal of a Dynabook.
OS6 is also important because being written in BCPL it popularized that language in the USA. From BCPL Ken Thompson derived first B (which was used to implement the Thoth operating system that eventually evolved into the V operating system and also QNX) which evolved into C and OS6 itself seems to have inspired several aspects of UNIX, as the OS6 paper on its stream based IO and filesystem shows.
I have been going around a bit with my older Toshiba U300 laptop and I have using its built-in display, which as most laptop displays has poor color quality and narrow viewing angles, I hopes because that reduces its power consumption.
Since I have been back to my desk, where I use my laptop as my main system but with an external monitor and keyboard and mouse. This not only avoid wearing out the builtin ones, but they tend to be more comfortable.
In particular the monitor: not only the display of my Philips 240PW9 monitor is much larger, but viewing angles and in particular the quality of colors are amazingly good as previously noted, something that I appreciate more after using for a while the builtin display of the laptop.
While browsing to see the current state of disk prices, I noticed that the 256GB flash SSD I bought some months ago for around £290 now can be bought for £180.
That's a remarkably quick drop in price, probably also due to
the restoration of some hard disk production after flooding in
Thailand. I am very happy with my SSD, which I
manage carefully as to endurance
,
not just because it is quite fast, epecially in randoma access
workloads such as metadata intensive ones, but also because I
need to worry a lot less about bumping it while operating, as
it does not have mechanical parts, never mind a high precision
low tolerance disk assembly spinning 120 times per second.
I have just had some issues with the main disk of my
minitower system, and I had a look at its state and while most
of its status as reported by smartctl -A is good
the load cycle count
is very high:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.1.0-3-grml-amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 19503 2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0 3 Spin_Up_Time 0x0023 074 056 025 Pre-fail Always - 7911 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 128 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 9602 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 001 001 000 Old_age Always - 285014 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 114 191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 1 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 064 059 000 Old_age Always - 33 (Min/Max 15/41) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 100 100 000 Old_age Always - 107 200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 6548 223 Load_Retry_Count 0x0032 001 001 000 Old_age Always - 285014 225 Load_Cycle_Count 0x0032 072 072 000 Old_age Always - 285151
In the above report also note that the reallocated counts are zero, that is no permanently defective sector was found, and all the IO errors that I have seen have been transient. For comparison these are the relevant rows for the other 3 drives in that mini-tower:
# for N in b c d; do smartctl -A /dev/sd$N; done | egrep -i 'power_on|start_stop|load_cycle' 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 1891 9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 25505 193 Load_Cycle_Count 0x0012 098 098 000 Old_age Always - 2906 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 444 9 Power_On_Hours 0x0012 099 099 000 Old_age Always - 10499 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 669 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1482 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 9798 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1486
From looking at the growth of the load cycle count over 30m on the affected disk it looks like that there is a load cycle every 2 minutes, or 30 times per hour, and indeed the total load cycle count is around 30 times the number of power on hours.
That is a well-known issue for laptop drives (also here and also here) but it seemed to affect only a few desktop drives, notably low-power ones like WD Green drives.
The drive affected in my case is a Samsung Spinpoint F3 1TB (HD103SJ) which is a performance oriented desktop drive. It must be replaced, as a load cycle count of almost 300,000 is very high for a desktop drive.
While I have a script that sets parameters I like for various machines and drives, the relevant lines used to be:
# Must be done in this order, '$DISKL' can be a subset of '$DISK[SA]'. case "$DISKS" in ?*) hdparm -qS 0 $DISKS;; esac case "$DISKA" in ?*) hdparm -qS 0 $DISKA;; esac case "$DISKL;$IDLE" in ?*';'?*) hdparm -S "$IDLE" $DISKL;; esac case "$DISKL;$APMD" in ?*';'?*) hdparm -B "$APMD" $DISKL;; esac
(where $DISKS is SCSI disks, $DISKA is PATA disks, and $DISKSL is laptop-style disks). I had thought that setting the spindown timer to 0 would disable completely power management, but that obviously is not the case at least with some drives, so I have added a specific setting to disable APM by default:
# Must be done in this order, '$DISKL' can be a subset of '$DISK[SA]'. case "$DISKS" in ?*) hdparm -qS 0 -qB 255 $DISKS;; esac case "$DISKA" in ?*) hdparm -qS 0 -qB 255 $DISKA;; esac case "$DISKL;$IDLE" in ?*';'?*) hdparm -S "$IDLE" $DISKL;; esac case "$DISKL;$APMD" in ?*';'?*) hdparm -B "$APMD" $DISKL;; esac
The previous default value for APM level was 254, which is the lowest setting that still enables APM, and on most drives that does not activate load cycles, but obviously on this drive it did.
Yet another small but vital detail to check and a parameter to set when buying or installing a drive, like SCT ERC.
Recently I was asking about the boundaries of some support
areas and I was relieved that these did not involve any
user-visible mailstore
s, and that the
e-mail services were done by some unfortunates with
Microsoft
Exchange.
When I expressed relied explaining that there are usually
very few mailstores that scale, especially those with
mail folders in the style of
MH
or
Maildir
ones, the rather smart person I was discussing with surprised
me by saying that they had no problems with a largish
Dovecot
based mailstore.
Which to me means that even smart persons can be quite optimistic as to the future, and that the mailstore issue is indeed often underestimated.
The mailstore issue is the issue of how to archive, either short or long term, e-mail messages, which are usually relatively small individuall, and can accumulate in vast numbers. This issue has become larger and larger for three very different reasons:
The combined effect is that MDA mailstores often grow because of the volume of traffic, many users retaining messages for years, and keeping them in the central mailstore instead of their own filestore.
This growth is similar to that of filestores in general, but for mailstores usually things are worse because of the sheer volume of traffic, as few users manage to create or receive hundreds of files per day.
Now the mailstore issue is that the cost profile of mailstore operations is extremely anisotropic, because there are two entirely distinct modes:
Even smart persons seem to consider only the cost of low-frequency, user-driven access to recent messages, and conclude that there is no mailstore problem. But the mailstore problem is related to bulk programmatic access by system administrators, for backup or indexing, and by users for searching.
The difficulty with bulk accesses is that they impact collection of very numerous, rather small messages, which have been received at widely different times, and thus can require a very large amount of random access operations.
The mailstore issue is very similar to the much wider issue of filetrees with many small files, with the added complications of the high rates of message arrival, and the far more common bulk searches.
The main reason why the mailstore issue has a large cost is the usual one that ordinary rotating disk storage with mechanical positioning devices devices are largely one-dimensional and have enormous access time anisotropy across their length, which means:
The second point is the biggest issue, because it makes it hard to put messages along the length of a disk so their physical locations correlate with some useful logical ordering, such as by user or by folder or by topic.
The most common storage representations for mailstores all have problems with this:
This can be realized in two different ways as to directories: directories represent logical partitions of the mailstore, for example by user and by folder, or directories represent physical partitions of the mailstore, for example by arrival time, and there are indices for locating all messages belong to a user or a folder or a topic.
This is the MH or Maildir or old-type
news-spool
style layout.
In other words messages are considered log entries, and they are logged by user, folder, or topic, with further classification by directories, and perhaps with in-band or out-of-band indices.
This is the classic mailbox
style
layout or new-type circular
news-spool
layout.
This is a variant on putting a list of messages in a file, where they are stored inside the tables of some kind (usually relational) of database.
It is then up to the DBA to choose a suitable physical layout, and as a rule it is similar indeed to a list of messages in a file.
Of these the most popular and at the same time the worst is the one file per message, because of two bad reasons, both related to updates:
There are however two big problems with mail stores implemented as file-per-message:
logsas they tend to arrive over time, and only the most recent and a few of the older ones are accessed with any frequency. This is a particularly damning point because MH style mailstores are particularly unsuitable for logs.
smallcollections of
largefiles mostly because of technological constraints, usually to avoid implementing filesystem metadata as a database capable of handling well large collections of small records. Even worse storage technology improvement tend to make improve sequential streaming rather than random access, making this point even stronger.
Then there is the question of why ever mailstores with one file per message are popular, and it is mostly about them being so very tempting despite these huge issues, and there are a couple of cases where these huge issues have a small cost or the advantages are more useful:
Inboxinto which incoming messages are delivered, and from this they are saved into topical or historical archives.
Unfortunately many if not not mailstores grow a lot even the spooling ones as many email users no longer move messages to archival message connections, but leave all messages in the inbox and rely on search and indexing instead.
Interestingly the mailstore issue happened several years
earlier with
NNTP
servers, where newsspools
, which used to be
small and transient, became persistent and transier as news
message volumes increased and most users stopped copying
messages to archival files and relied on accessing
newsgroup
history and searching it.
The same problems indeed occurred as this paper details:
Another problem one can face in maintaining news service is with the design and performance of most standard filesystems. When USENET first started out, or even when INN was first released, no one imagined that 9000+ articles/day would be arriving in a single newsgroup. Fortunately or unfortunately, this is the situation we now face. Most filesystems in use today, such as BSD's FFS (McKusick, et al., 1984) still use a linear lookup of files in directory structures. With potentially tens of thousands of files in a single directory to sort through, doubling the size of the spool (for example, by doubling the amount of time articles are retained before they are expired) will place considerably more than double the load on the server.
The solution has been switching to many messages per file, using a log style structure (circular as usually news messages expire after a limited time) with ain internal structure that avoids using expensive filesystem metadata as an external structure:
A number of other programming efforts are under way to address filesystem limitations within INN itself. One of several proposals is to allocate articles as offset pointers within a single large file. Essentially, this replaces the relatively expensive directory operations with relatively inexpensive file seeks. One can combine this with a cyclic article allocation policy. Once the file containing articles is full, the article allocation policy would "wrap around" and start allocating space from the beginning of the storage file, keeping the spool a constant size.
The Cyclic News File System (CNFS)[5] stores articles in a few large files or in a raw block device, recycling their storage when reaching the end of the buffer. CNFS avoids most of the overhead of traditional FFS-like[11] file systems, because it reduces the need for many synchronous meta-data updates. CNFS reports an order of magnitude reduction in disk activity. CNFS is part of the INN 2.x Usenet server news software. In the long run, CNFS offers a superior solution to the performance problems of news servers than Usenetfs.
Many years ago a large USENET site switched their newsspool to Network Appliance servers because their WAFL filesystem is itself log-structured and thus implicitly and to some extent turns message per file archives into something very similar on disk to log-structured files:
So, you've got 10,000 articles/day coming into misc.jobs.offered, in fact that's old, I don't know how many are coming in these days, maybe its 15,000 or so, I wouldn't be surprised if this were the case. And any time you want to look up attribute information on any one of them you need to scan linearly through the directory.
That's really bad. It's really much better to be able to scan for the file using a hashed directory lookup like you have in the XFS file system that SGI uses, the Veritas file system that you can buy for Sun, HP, and other platforms, or in our case, what we did was use the WAFL file system on the Network Appliance file servers.
My conclusions from this is that dense representations for mailstores are vastly preferable to message per file ones, except perhaps for small, transient mailspools (an indeed most MTAs use message per file mailspools) but that since currently most email users keep their files in their Inbox thanks to email access protocols like IMAP and often don't sort them by hand into archives, the ideal structure of a mailstore, like for newspools, is that of files with internal structure containing many messages, as the internal structure is likely to be far more efficient, especially in terms of random accesses, than that of a directory with one file per message.
Which particular type of many messages per file structure is
an interesting problem. The traditional
mbox
structure with in-band boundaries has
some defects but it has the extremely valuable advnatage of
being entirely text-based and therefore easily searchable and
indexable with generic text based tools, and therefore it
seems to be still to be recommended as the default, perhaps
with a hint to limit the size of any any mbox archive to less
than 1/2 second of typical sequential IO, for example around a
few dozen MiB on 2012 desktops.
An alternative used by GoogleMail is to store message in log-structured files such as those supported by their GoogleFS. where large files containing many messages are accessed using parallel sequential scans.
The alternative is to use a DBMS as they are specifically designed to handle many small data items. Unfortunately the most popular mail system that uses a DBMS to storage messages is Microsoft Exchange and it has had a very large number of issues due to a number of mistakes in its implementation.
Several recent mailstore server implementations like Citadel (which uses BDB), or Apache's James, Open-Xchange Server Zarafa, DBMail and some proprietary ones have DBMS backends.
While the Debian™ project continues to be strong, and some of the more restrictive practices have been improved, from a technical point of view the Debian distribution continues to have several bad aspects as previously remarked and most of these have to do with rather objectionable choices in packaging policy and tools. As to the packaging tools, that is DPKG and associated .deb tools, I have been following the comical story of how they have been extended with enormous effort and controversy to allow the installation of two packages with the same name but for different architectures, just as the toolset was previously extended to handle checksum verification of package files.
As as the extension for multiple architectures is rather incomplete, as it leaves out the very useful and important ability to have different versions of the same package installed, similarly limited is the ability to verify package file checksums, because as I was pointing out recently to someone I was discussing packaging issues with: the list of package file checksums is not cryptographically signed:
# ls -ld whois.* -rw-r--r-- 1 root root 172 2010-11-26 11:11 whois.list -rw-r--r-- 1 root root 240 2010-03-20 05:09 whois.md5sums # cat whois.list /. /usr /usr/bin /usr/bin/whois /usr/share /usr/share/doc /usr/share/doc/whois /usr/share/doc/whois/README /usr/share/doc/whois/copyright /usr/share/doc/whois/changelog.gz # cat whois.md5sums 370b01593529c4852210b315e9f35313 usr/bin/whois 1835d726461deb9046263f29c8f31fe8 usr/share/doc/whois/README 971fdba71f4a262ae727031e9d631fa8 usr/share/doc/whois/copyright 78f366b8fb0bb926a2925714aa21fbe7 usr/share/doc/whois/changelog.gz
Which of course means that the checksum list cannot be used to verify the integrity of installed files except at installation time or for accidental damage. It is surely possible to use a separate integrity checker for deliberate modifications, but that only reveals changes in the file contents, not whether they are as they were when built by the distribution, which is a far more interesting notion.
Note: the signature situation is also quite dubious, as many packages are not signed at all, never mind the checksum files, and only repositories are signed, which is far from adequate.
Also while the distribution checksums can be verified using an easily retrieved public key, those produced by an integrity checker need either to be copied to a secure location, or be signed with a private key on the system on which the integrity checker runs, which is rather risky.
Anyhow package repositories and their package lists are part of the dependency management layer, based on APT in Debian, and unrelated to DPKG and the package management layer (to the point that the APT dependency management tools can also handle RPM packages), and the APT tools seem to me rather better (especially Aptitude) than the package management ones.
What is amazing is that despite these grave packaging tool
issues Debian continues to be a successful distribution, in
the same crude popularity
sense that MS-Windows is successful because it is
popular, despite (among many others) even worse package and
dependency management issues (as MS-Windows barely has
either).
TIL about two different approaches to OLED display structuring: reportedly Samsung AMOLED displays use separate OLEDs for red, green and blue, each emitting the specific color, while LG uses white OLEDs with a color filter on top. I must admit that I had missed the existence of white OLEDs.
TIL also about the new display technology by Sony called Crystal LED and is about a display made of distinct red, green and blue LEDs.
The latter seems to me an evolution of LED backlights for LCD
monitors. While most LED backlights involve only a few bright
white LEDs providing illumination across the display, more
sophisticated ones have a LED per pixel (which means little
uneveness of illumination and allowing a wider dynamic range
of luminosity of each pixel). It probably occurred to Sony
engineers that tripling the number of backlight LEDs would
allow them to eliminate the LCD filter altogether,
transforming the display from transmissive
to emissive
.
The subtext to these developments is that Taiwanese and Chinese manufacturers of ordinary LCDs have invested in a lot of production capacity for LCD displays, and Japanese and Korean companies are trying to push forward with technologies that their competitors have not yet invested in.
Samsung have announced their Galaxy Tab 7.7 tablet and one of the main features is that it has a 7.7" AMOLED display with a 1280×800 pixel size, which gives it a pixel density of 200DPI.
This is the first OLED display with a diagonal larger than 3in for a mass market device, and the quick impression of the reporter from Computer Shopper UK is that it is a display of exceptional quality, both because of the fairly high DPI, and its luminosity and wide viewing angle.
As I am not particularly keen on tablets as I write a fair bit and even a laptop keyboard is much better than an on-screen keyboard, this display to me has the import that 1280×800 are pixel dimensions typical of laptops (at least before the latest rounds of increasingly skewed aspect ratios) which means that AMOLED displays may be coming to laptops, and eventually to standalone monitors.
I had seen in a recent discussion thread a misunderstanding
of
advising filesystems of the alignment properties
of the underlying storage device, in particular
stripe size
which really means alignment more than size.
Advising a filesystem of stripe size is really about
adivising it that the underlying storage medium has two
different addressing granularities
and some special properties:
write-block, which can be a logical or physical write sector size.
These things happen usually with devices where the physical
sector
size is larger than the logical
sector size expected, as a matter of tradition, by most
software using the device, including filesystems. In rare
case, such as flash
memory, the physics of
the recording medium have intrinsically different read and
write granularities.
In extreme yet common cases the effective or implied physical sector size for writing can be some order of magnitude larger than the one for reading, for example on some flash memory systems the sector size for reading is 4KiB but the sector size for overwriting is 1024KiB; the cost of simulating writing at smaller address granules can be an order of magnitude or two in sequential transfer rate.
A hint to the filesystem about the stripe
size
is a hint that allocation of space on and writing to
the device should preferentially done in multiples of that
number and not of the sector or block sizes; put another way
that the read and write sector sizes are different or
should be different, because while it is possible to
write sectors with the same size, a different (usually larger)
sector size for writes is far more efficient.
If the filesystem disregards this, it is possible that two bad or very bad things may happen on writing:
Most devices where as a rule writes should be done with a different sector size from reads are:
flash pagesbut only written in
flash blocksof 256KiB to 1MiB.
stripesize larger than the sector size of the member devices: in order to recalculate
parityon writes, the whole stripe needs to be read, updates, parities recomputed and the whole strip plus parities have to be written back. In the case of single parity it is fairly common to use a type of parity where this can be abbreviated to reading just the old data, the parity, updating them, and writing the new data and parity to storage.
In all the cases above the insidious problem is that there is a firmware or software layer that attempts to mask the difference in sector sizes between read and write by automatically performing RMW, with often disastrous performance.
There are also a few cases where no RMW is necessary, yet performance is affected if the write sector size used is the one declared by the device, for example:
The subject of when exactly applications and various kernel
layers commit data to persistent storage is a complicated and
difficult one, and for POSIX style
systems it revolves around the speed and safety of
fsync(2),
an (underused) system call
of great
importance and critical to reliable applications.
Therefore I was astonished when in a decent presentation about fsync and related issues, Eat My Data: How Everybody Gets File IO Wrong, on slide 118, I found the news that on Mac OSX requests to commit to persistent storage are ignored for local drives:
/* Apple has disabled fsync() for internal disk drives in OS X. That caused corruption for a user when he tested a power outage. Let us in OS X use a nonstandard flush method recommended by an Apple engineer. */
Note: On Mac OSX fsync has the narrow role of flushing only the system memory block cache (which is the most conservative interpretation of the POSIX definition of fsync), and does not flush the drive cache too, for which the platform specific F_FULLFSYNC option of fcntl(2) is provided.
That is quite unsafe. While fsync is usually a very expensive operation, and Mac OSX platforms usually (MacBooks) have battery backup (being mostly laptops), just nullifying the effect of fsync is a very brave decision.
These news might explain why on a Mac OSX laptop of a friend /usr/sbin/cups at some point vanished, and there were several other missing bits: it is entirely possible that during a system update the laptop crashed either because of an error, or because the battery run out of charge, or something else.
One of the apparent mysteries of
RAID
is the impact of chunk size
, which is for a
stripe the number of consecutive logical sectors in the stripe
on a single member of the RAID set.
In theory there is no need of a chunk size other than 1 logical sector, at least for RAID setups where the logical sector is not smaller than the physical one, however in practice the default chunk size of most RAID implementation is much larger than a single logical sector, often 64KiB even if the logical sector is 512B.
Obviously this is meant to improve at least some type of
performance, and it does, even if
large stripe sizes are a bad idea especially for RAID6
but to understand that one needs to look at
synchronicity
of RAID set members when they have non-uniform access times.
The problem is that when reading a whole stripe if the devices are not synchronized the current position of those devices may be different, and therefore reading all the chunks that make up a stripe involves different positioning latencies for every chunk, making the time needed to gather all chunks much longer than that needed for reading each chunk.
Making each chunk larger than a single logical sector spreads the cost of the per-chunk positioning latency over size of the chunk, increasing throughput. Therefore having chunks of 64KiB instead of 512B means that the cost of the spread of latencies has to be incurred 128 times less often.
This however only really applies to streaming sequential
transfers, and mostly only to reading, because when writing
sectors are typically subject to write-behind
and therefore can be scheduled by the block scheduler in an
optimal order after the factor.
There is something of a myth that suggests the opposite, that small chunk sizes are best for sequential transfers as they maximize bandwidth by spreading the transfer across as many drives as possible, while large chunk sizes are best for random transfers as they maximize IOPS by increasing the chance that each single transfer only uses one device leaving the others for other transfers at other positions.
It is a bit of a myth because if the sequential transfers are streaming (that is long) all drives are likely be used in parallel as the reads get queued, unless the read rate is lower than what the RAID set can deliver, in which case latency will be impacted. Put another way, a small chunk size will improve sequential transfers done in a size much larger than the chunk size, but most sequential transfers are streaming.
For random transfers the myth is more credible, because 4 transfers in parallel each reading 8 blocks serially takes less time than 2 transfers in parallel each reading 4 blocks in parallel over 2 drives. The reason is that parallelizing seeks is far more important than parallelizing block transfers, if the latter are not that large, because seeks take a lot longer.
As to that, a typical contemporary rotating storage device may have an average access time of 10ms, and be able to transfer 120MB/s, that is 1.2MB in 10ms. So 4 seeks parallelized 4-way cost 10ms, plus the data transfer time, while parallelizing 2-way 4 seeks costs 20ms plus half of the data transfer time, and the latter is better only if data transfers take 20ms or more, that is they are 2.4MB or more. The problem of course is that random transfers may happen to on positions within a chunk, which serializes seeks, but then if they are within a chunk they should be short seeks and fast.
However in general I prefer smallish chunk sizes, because the major advantage of RAID is parallelism especially for sequential transfers, and the major disadvantage is large stripe sizes in parity RAID setups.
Finally, some filesystems have layout optimizations that are
based on the chunk size: metadata is laid out on chunk
boundaries to maximize the parallelism in metadata access, and
this is quite independent of stripe alignment for avoiding
RMW
in parity RAID setups. One example is
ext4
with the -E stride=blocks parameter
to mke2fs:
This is the number of blocks read or written to disk before moving to the next disk, which is sometimes referred to as the chunk size. This mostly affects placement of filesystem metadata like bitmaps at mke2fs time to avoid placing them on a single disk, which can hurt performance. It may also be used by the block allocator.
It is recognized by XFS with the sunit=sectors parameter to mkfs.xfs:
This suboption ensures that data allocations will be stripe unit aligned when the current end of file is being extended and the file size is larger than 512KiB. Also inode allocations and the internal log will be stripe unit aligned.
Reconsidering my
previously suggested resilient network setup
which was based on taking advantage of routing flexibility
thanks to
OSPF,
and the
6to4 setup under Linux
they both use in very different ways unicasting
(or anycasting
).
Now that is in general a really powerful technique that is
quite underestimate, mostly for historical reasons, and most
people tend to think in terms of subnetting
and route aggregation via prefixes
.
The historical reasons are somewhat contradictory: the
original ARPAnet was designed as a
mesh network
, using point-to-point
links, where routes were to invidual hosts, and there weren't
many nodes (when I started using it as a kid there were 30
nodes).
Then the original IP specification had fixed-size address ranges:
An address begins with a network number, followed by local address (called the "rest" field). There are three formats or classes of internet addresses: in class a, the high order bit is zero, the next 7 bits are the network, and the last 24 bits are the local address; in class b, the high order two bits are one-zero, the next 14 bits are the network and the last 16 bits are the local address; in class c, the high order three bits are one-one-zero, the next 21 bits are the network and the last 8 bits are the local address.
In particular there can be very many ranges with 256-addresses, that is 221 or around two million. This rapidly caused concern as various organizations started connecting newly-available Ethernet LANs to the Internet, and requesting distinct 256-address ranges for each:
An organization that has been forced to use more than one LAN has three choices for assigning Internet addresses:
- Acquire a distinct Internet network number for each cable.
The first, although not requiring any new or modified protocols, does result in an explosion in the size of Internet routing tables. Information about the internal details of local connectivity is propagated everywhere, although it is of little or no use outside the local organization. Especially as some current gateway implementations do not have much space for routing tables, it would be nice to avoid this problem.
The concern arose because the original ARPAnet routers and their successors were by today's standards very slow computers, with very limited memory capacity (16KiB) and this resulted in the definition of subnetting (and supernetting) on arbitrary boundaries.
However a lot of people did allocate many
portable
class C
ranges, even if in some cases as subnets of class B and class
A ranges, and this led to large increases in routing table
size, and then to a stricter policy of assigning non-portable
ranges to organizations to minimize the size of the global
routing tables, and this was made very strict with
IPv6,
where routing is (nearly completely) strictly hierarchical
instead of mesh like.
But because of the intermediate period with many disjoint class C ranges and the advent of routers based on VLSI processors and memory even relatively low level routers today support quite large routing tables, for example the Avaya ERS5000 series support up to 4,000 routes (which is not much less than the up to 16,000 Ethernet addresses) and looking at a lower end brand like D-Link their most basic router is the DGS-3612G and is can support up to 12,000 routes (and 16,000 Ethernet address), and core routers, those that can be used as border gateways, can support hundreds of thousands to millions of routes. Linux™ based routers are reported able to handle at least as many.
Given this, it seems possible for many sites to just stop using subnet routing, and assign to each node a unique host (/32, 255.255.255.255) route.
This may seem somewhat excessive, and indeed it is, but it is feasible, and may be quite desirable, if not for all nodes, often for server nodes. The reason is that the unique, routed IP address assigned to the server is effectively location independent, and this gives at least two useful advantages:
For several reasons (including debugging) often the better option would be to give servers both a unique routable IP address, and a subnet address for each link interface it is on.
There is are some limitations from using unique routable addresses:
It is useful to note that single (or otherwise narrow) IP routes achieve at a higher protocol level the same effect as VLANs when the latter are used to achieve subnet location independence.
Note: subnet location independence with one IP subnet per broadcast domain, and VLAN tagging of Ethernet frames is used to create multiple broadcast domains over a bridged network with more than one switch.
The location independence is achieved by having in effect
individual link level routes
to each
Ethernet address in the bridged network, and that's why the
switches mentioned above have Ethernet forwarding tables
capable of holding as many as 16,000 Ethernet addresses, as
every Ethernet address in the whole infrastructure must be
forwardable from every switch.
VLAN tags and the notorious STP then reduce the costs by partition the network in effect into link-level subnets of which the VLAN tag is the network number.
The same effect, minus the risk of broadcast storms and other very undesirable properties of a bridged network, can be achieved by using routed IP address where IP addresses belonging to the same subnet have individual routes. Or perhaps where most IP addresses have the same route as most of those nodes are on the same link, and scattered nodes are routed to by more specific routes.
There are good arguments why IP addressing based on functional (for example workgroup) relatedness is rather less preferable than geographical (sharing of a link) relatedness, but if it is needed, it is better to bring it up to the highest protocol layer possible. Indeed my preference is for IP addresses to be strictly based on geographical relatedness (one link, one subnet), and to offer functional relatedness via the DNS with suitable naming structures.
Note: individually routable IP addresses are in effect node or even service identifiers, in effect names brought down from the application naming layer to the transport layer.
However in some important cases functional relatedness it best handled at the IP level (because of applications that don't ue DNS or that resolve names only when they start), and then very specific routes, thanks to routers that can handle hundreds or thousands of them, can be quite useful.