This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
[file this blog page at: digg del.icio.us Technorati]
elevatorsand smooth
dirtypage flushing) have been really slow recently. Now I would not normally notice because I run them during the evening, but I was surprised to notice the disk lights on this morning, and that was because my nighly backup job was lasting too long. Well, the reason was because it was running rather slower than usual because of very high CPU utilitization. On a guess I switches the copy program to use
O_DIRECT
(which miraculously seems to actually work as advertised since
some recent Linux version, instead of having negligible effect)
and CPU time dropped considerably, as
the Linux page cache is so slow that IO can become CPU bound,
but there were two additional complications that made it rather
unresponsive:
conservative
governor
ignores completely system CPU time when computing at which
speed the CPU should run, with the effect that it was
running at 1,000MHz even if it was 95% busy; also system CPU
time tends to crowd out user processes.vm/drop_caches
kernel parameter was still
set to 1 after some
recent experiments,
which may have been having a continuing effect,
keeping caching constantly flushing.performance
one. In part because with
O_DIRECT
it takes a largish block size, like
256KiB, to get good performance Like 60MiB/s disk-to-disk
copy). But on second thought, the Linux cache gets really
swamped without O_DIRECT
because file
access pattern advising
still seems unimplemented, and then O_DIRECT
with a
large block size is the lesser of two evils.vm/drop_caches
to non-zero. This is very welcome as it helps in several cases,
for example IO benchmarking, as the (ancient and traditional)
alternative is to unmount a filesystem and remount it, which is
not always convenient (note that -o remount
does
not have the same effect, as it really just changes options,
does not effect a real remount, except in special case).
BLKFLUSHBUFS
ioctl
(2) (issued by the
blockdev --flushbufs
or
hdparm -z
commands) should do the same, or at
last it is misunderstood by many to be supposed to do so, but it
does not, being instead a per-block-device
sync
. Anyhow BLKFLUSHBUFS
would apply
only to a block device, while one might want to flush the
unmodified buffers of a non block device filesystem for example
an NFS one.
umount
and mount
pair is less
convenient than vm/drop_caches
but more selective,
as it applies to a single filesystem, while
vm/drop_caches
applies to all unmodified cached
pages. As to this note that it does not apply to modified
dirtypages, being the exact complement to
sync
.
BLKFLUSHBUFS
would have been a
better approach.async
NFS
export over UDP went at wire speed, indicating that either or
both TCP and sync
export contributed to the
measured slowdown. As to TCP, it is well known that until recent
versions of the kernel its default tuning parameters were set
suitable for nodes with small memory and at most 100mb/s
connections. Changing the IP and TCP tuning parameters in
/etc/sysctl.conf
to some more suitable
values like:
# Mostly for he benefit of NFS. # http://WWW-DIDC.LBL.gov/TCP-tuning/linux.html # http://datatag.web.CERN.CH/datatag/howto/tcp.html net/ipv4/tcp_no_metrics_save =1 # 2500 for 1gb/s, 30000 for 10gb/s. net/core/netdev_max_backlog =2500 #net/core/netdev_max_backlog =30000 # Higher CPU overhead but higher protocol efficiency. net/ipv4/tcp_sack =1 net/ipv4/tcp_timestamps =1 net/ipv4/tcp_window_scaling =1 net/ipv4/tcp_moderate_rcvbuf =1 # This server has got 8GiB of memory mostly unused. net/core/rmem_default =1000000 net/core/wmem_default =1000000 net/core/rmem_max =40000000 net/core/wmem_max =40000000 net/ipv4/tcp_rmem =40000 1000000 40000000 net/ipv4/tcp_wmem =40000 1000000 40000000 # Probably not necessary, but may be useful for NFS # over UDP. net/ipv4/ipfrag_low_thresh =500000 net/ipv4/ipfrag_high_thresh =2000000made transfers with
async
export over TCP work
almost as fast as with UDP. I prefer so far NFS over UDP for
reliable LANs, given that the Linux nfs
client
driver does not recover properly from
session problems with the server,
as imprecisely described in the
Linux NFS-HOWTO:
The disadvantage of using TCP is that it is not a stateless protocol like UDP. If your server crashes in the middle of a packet transmission, the client will hang and any shares will need to be unmounted and remounted.but the ability to tune TCP for NFS to give almost the same performance of UDP is nice.
async
exports.
As it is
well known
async
on the server side violates the semantics of
NFS and the UNIX/Linux like filesystem
API
as the client is told that data has been committed to disk when
it has not, in order to prevent pauses while the data is being
flushed out. To be sure, of course the filesystem is usually
mounted with async
, the issue here is
whether it is exported from the server with
sync
or async
, as programs running on
the client can always request explicitly synchronous writing on
the mounted filesystem, but cannot override the
async
option on the server.
sync
export option should not
give performance different from async
as I am using
NFS version 3 and allegedly it allows doing delayed writes even
when the server is in sync
mode, as the client NFS
driver (transparently to the application) can explicitly request
flushing on the server when needed (and the server can refetch
from the client data that could not be written):
Version 3 clients use COMMIT operations when flushing safe asynchronous writes to the server during a close(2) or fsync(2) system call, or when encountering memory pressure.
NFS Version 3 asynchronous writes eliminate the synchronous write bottleneck in NFS Version 2. When a server receives an asynchronous WRITE request, it is permitted to reply to the client immediately. Later, the client sends a COMMIT request to verify that the data has reached stable storage; the server must not reply to the COMMIT until it safely stores the data.But the write rates I observed with
Asynchronous writes as defined in NFS Version 3 are most effective for large files. A client can send many WRITE requests, and then send a single COMMIT to flush the entire file to disk when it closes the file.
sync
export
were still half those with async
export (instead of
one third as before changing the IP and TCP parameters),
indicating some remaining stop-go behaviour), so I did another
network trace of an NFS session (printed then with tcpdump
-ttt
to get inter-packet times) and I noticed some
crucial moments:
WRITE UNSTABLE
at the beginning), and there are
no huge delays (117 microseconds). At that point the file
size is 32KiB (sz 0x8000).
COMMIT
for the first
512KiB, presumably as it wants to get rid of them from its
page cache, since these probably have long sice been written,
and then starts a new WRITE UNSTABLE
from 45MiB
(32768 bytes @ 0x2b0c000) which get immediately therefore a
REPLY
, then around 400 packets later
a huge 1.2s delay, not much thereafter. Evidently in that
1.2s period the server has executed the COMMIT
,
and for the whole 45MB outstanding.COMMIT
request from the client for
8MiB at 512KiB (8159232 bytes @ 0x80000), as evidently the client wants to free up the next 8MiB, and for 32KiB at 8.5MiB (
32768 bytes @ 0x848000), which is the next block, while there is a new
WRITE UNSTABLE
at
80MiB (32768 bytes @ 0x4f14000), then around 300 packets later another huge 1.3s delay, which probably means that the server has actually done a
COMMIT
to
80MiB instead of the requested 8.5MiB.COMMIT
s well
before the end of the file because it needs to flush to
reclaim page cache memory.COMMIT
, as without that it
cannot reuse the existing unflushed cached pages.COMMIT
requests, and flushes all the modified
blocks received so far.COMMIT
s irrespective of the region
specified in it.COMMIT
is equivalent to
fsync
and whenever any of the data cached
by the NFS client must be flushed all the data received so far
by the NFS server gets written to disk, which is hardly better
than NFS version 2-style synchronous writing on the server,
especially as the NFS client flushes whenever it needs to
reclaim some unwritten file page, not just at the end, and it
only keeps a few dozen megabytes unflushed at any time.
async
on the server,
and likely the COMMIT
range has never been
implemented because server side async
is how people
get write performance, and then use battery backed servers.
Indeed in my situation the server has a RAID host adapter with a
huge memory cache, so it must be batter backed anyhow. However
some people still would rather use sync
exports, so
I wanted to see if sync
exports could be improved.
COMMIT
, and just remember the outstanding
commit and when the NFS server responds mark just the data sent
before the COMMIT
as flushed. However as the
Linux NFS faq
elliptically says:
The Linux NFS client uses synchronous writes under many circumstances, some of which are obvious, and some of which you may not expect.and the Linux NFS client does not do that, as illustrated above, and at some point synchronously waits for the response to the
COMMIT
, largely because it sends it when its cache
is full (instead of for example periodically, or when it is half
full), and it is these pauses that reduce write performance.
COMMIT
is sent the
NFS server has already flushed all the pages sent to
it, then it can reply almost immediately to the client. One way
to do this is to use the export options sync
of
course, but this involves too frequent waits for a reply, and
the sessions becomes essentially half-duplex. So the server should be flushing the pages it receives from NFS clients asynchronously.
COMMIT
. This is because
its tuning parameters
are set rather loose, and allow, depending on circumstances,
dozen of MBs to several GBs of modified pages can be cached in
memory unflushed, to be written then all at once, something that
causes further problems both to NFS clients and
to interactive programs.
vm/dirty_ratio =40 vm/dirty_background_ratio =2 vm/dirty_expire_centisecs =400 vm/dirty_writeback_centisecs =200These parameters should be tightened also on the NFS client system, as it helps to have have the application written pages flushed and sent over the network to the NFS server in a continuous and smooth way.
async
and around 57MB/s with
the sync
option, which may be good enough with
async
anyhow because even with the latter, thanks
to the flushing dæmon parameters above the NFS server
instead of accumulating 300-600MB of modified pages and then
writing them out at once (the disk is attached to a
host adapterwith a large RAM cache), writes a steady stream of modified pages at around 60-70MB/s, and with a few seconds of delay with respect to the NFS client. This minimizes the window of vulnerability to crashes, giving almost the same safety as
sync
, nearly as if the NFS
client was indeeed doing incremental COMMIT
s (as it
should...). The effect would be probably sufficient for the
application I have been tuning this server for without having to
resort to sync
, even if the NFS server did not have
battery backup.vm/page-cluster
story).
The most recent one is this code from
mm/page-writeback.c:get_dirty_limits()
:
dirty_ratio = vm_dirty_ratio; if (dirty_ratio > unmapped_ratio / 2) dirty_ratio = unmapped_ratio / 2; if (dirty_ratio < 5) dirty_ratio = 5; background_ratio = dirty_background_ratio; if (background_ratio >= dirty_ratio) background_ratio = dirty_ratio / 2; background = (background_ratio * available_memory) / 100; dirty = (dirty_ratio * available_memory) / 100; tsk = current; if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { background += background / 4; dirty += dirty / 4; } *pbackground = background; *pdirty = dirty;which embodies a large number of particularly objectionable misfeatures, both strategic and tactical. The most damning is that the amount of unflushed memory allowed outstanding is set as a function of memory available, which is laughable, as it should be a function of disk speed, in other words should be set a number of pages, not a percentage of pages available (in other words
background
and dirty
should be
the parameters to be set directly).
dirty_background_ratio
should not be larger than the dirty_ratio
.
Fine, but why halve it? And why do so if it is
equal to it? This means that if dirty_ratio
is 10 and dirty_background_ratio
is 9 it is not
changed, but if dirty_background_ratio
is 11 it
is reset to 5, not to 10. Indeed it is reset to 5 even if it
is set to 10, the same as dirty_ratio
.dirty_ratio
and dirty_background_ratio
,
but does so invisibly. That is:
base# sysctl vm/dirty_ratio=8 vm/dirty_background_ratio=8 vm.dirty_ratio = 8 vm.dirty_background_ratio = 8 base# sysctl vm/dirty_ratio vm/dirty_background_ratio vm.dirty_ratio = 8 vm.dirty_background_ratio = 8even if the effective value of
dirty_background_ratio
is 4, as by being equal to dirty_ratio
it has
been set to half its value.
/* * Work out the current dirty-memory clamping and background * writeout thresholds. * * If the numbers are greater than 100 they are taken to be * directly number of pages, else percentages of available * lowmem pages. * * We try to bound the resulting number of page so that there * can be a minimum number of pages before the writing processes * or the flusher start writing out, and so that the flusher * activation treshold is not larger than the process * synchronous write one. */ static void get_dirty_limits(long *const pbackground, long *const pdirty, const struct address_space *const mapping) { #ifdef CONFIG_HIGHMEM /* Take only lowmem into account */ const long unsigned available_pages = vm_total_pages - totalhigh_pages; #else const long unsigned available_pages = vm_total_pages; #endif const long unsigned unmapped_pages = vm_total_pages - global_page_state(NR_FILE_MAPPED) - global_page_state(NR_ANON_PAGES); /* * If the value of the '/proc/sys' setting is higher * than 100 it is not a percentage but a number of pages * directly. */ const long unsigned vm_dirty_pages = (vm_dirty_ratio > 100L) ? vm_dirty_ratio : +(vm_dirty_ratio*available_pages)/100L; const long unsigned vm_background_pages = (vm_background_ratio > 100L) ? vm_background_ratio : +(vm_background_ratio*available_pages)/100L; /* * We leave at least 8 pages unflushed, with an upper * limit of 50% of unmapped pages for the process * synchronous writing threshold, or of that threshold * for the flusher treshold. */ const long unsigned dirty_pages = min(max(8,vm_dirty_pages),unmapped_pages/2); const long unsigned background_pages = min(max(8,vm_background_pages),dirty_pages); /* * Reset the '/proc/sys' variables to the actual values * computed here. */ #if 0 vm_dirty_ratio = +(dirty_pages*100L)/available_pages; vm_background_ratio = +(background_pages*100L)/available_pages; #else vm_dirty_ratio = (dirty_pages > 100L) ? dirty_pages : +(dirty_pages*100)/available_pages; vm_background_ratio = (background_pages > 100lL) ? background_pages : +(background_pages*100L)/available_pages; #endif *pdirty = (long) dirty_pages; *pbackground = (long) background_pages; }