Thursday, August 23, 2012

PostgreSQL, NUMA and zone reclaim mode on linux

Lately we've been seeing issues with zone reclaim mode on large memory multi processor NUMA linux systems.

What's NUMA?  It's just an acronym for Non-Uniform Memory Access.  This means that some memory in your system is more expensive for a particular CPU to access than its "local" memory.

You can see how much more distant the kernel considers the different zones by using the numactl command like so:

numactl --hardware
If you've got a modern multiprocessor system, you'll probably see something like this:


available: 2 nodes (0-1)
node 0 size: 48417 MB
node 0 free: 219 MB
node 1 size: 48480 MB
node 1 free: 135 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10 

Here we see a distance of 10 for node 0 to access node 0 memory and 21 for node 0 to access node 1 memory.   What does distance really mean?  It's a cost parameter based on number of "hops" or buses that separate the node from the distant memory.


Now, what's zone reclaim mode?

From the linux kernel docs:


zone_reclaim_mode:

Zone_reclaim_mode allows someone to set more or less aggressive approaches to
reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes
in the system.

This is value ORed together of

1       = Zone reclaim on
2       = Zone reclaim writes dirty pages out
4       = Zone reclaim swaps pages

zone_reclaim_mode is set during bootup to 1 if it is determined that pages
from remote zones will cause a measurable performance reduction. The
page allocator will then reclaim easily reusable pages (those page
cache pages that are currently not used) before allocating off node pages.

It may be beneficial to switch off zone reclaim if the system is
used for a file server and all of memory should be used for caching files
from disk. In that case the caching effect is more important than
data locality.

Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
throttle the process. This may decrease the performance of a single process
since it cannot use all of system memory to buffer the outgoing writes
anymore but it preserve the memory on other nodes so that the performance
of other processes running on other nodes will not be affected.

Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset
configurations.

I highlighted the text above because PostgreSQL depends heavily on the filesystem cache and disabling zone reclaim mode is desirable in this situation.

There's been a bit of discussion about this on the pgsql-performance mailing list here: http://archives.postgresql.org/pgsql-performance/2012-07/msg00215.php

If you've got a modern multi-socket system, odds are good that zone reclaim mode is enabled automatically on boot.  You can check this by looking at /proc/sys/vm/zone_reclaim_mode.

The biggest issue we've seen with zone reclaim mode enabled on customer multi-socket systems is the filesystem cache never filling up even when the database is much larger than RAM.  That's because the system is trying to keep some "local" memory available. After disabling zone_reclaim_mode, the filesystem cache fills up and performance improves.

So, how to disable zone_reclaim_mode?  The best way to do this is via sysctl.  Just add:

vm.zone_reclaim_mode = 0 

to /etc/sysctl.conf, save it and execute sysctl -p to load the new settings into the kernel.

Other interesting non PostgreSQL pages on NUMA/zone_reclaim_mode:


No comments:

Post a Comment