Lately we've been seeing issues with zone reclaim mode on large memory multi processor NUMA linux systems.

What's NUMA?  It's just an acronym for Non-Uniform Memory Access.  This means that some memory in your system is more expensive for a particular CPU to access than its "local" memory.

You can see how much more distant the kernel considers the different zones by using the numactl command like so:

numactl --hardware
If you've got a modern multiprocessor system, you'll probably see something like this:


available: 2 nodes (0-1)
node 0 size: 48417 MB
node 0 free: 219 MB
node 1 size: 48480 MB
node 1 free: 135 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10 

Here we see a distance of 10 for node 0 to access node 0 memory and 21 for node 0 to access node 1 memory.   What does distance really mean?  It's a cost parameter based on number of "hops" or buses that separate the node from the distant memory.


Now, what's zone reclaim mode?

From the linux kernel docs:


zone_reclaim_mode:

Zone_reclaim_mode allows someone to set more or less aggressive approaches to
reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes
in the system.

This is value ORed together of

1       = Zone reclaim on
2       = Zone reclaim writes dirty pages out
4       = Zone reclaim swaps pages

zone_reclaim_mode is set during bootup to 1 if it is determined that pages
from remote zones will cause a measurable performance reduction. The
page allocator will then reclaim easily reusable pages (those page
cache pages that are currently not used) before allocating off node pages.

It may be beneficial to switch off zone reclaim if the system is
used for a file server and all of memory should be used for caching files
from disk. In that case the caching effect is more important than
data locality.

Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
throttle the process. This may decrease the performance of a single process
since it cannot use all of system memory to buffer the outgoing writes
anymore but it preserve the memory on other nodes so that the performance
of other processes running on other nodes will not be affected.

Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset
configurations.

I highlighted the text above because PostgreSQL depends heavily on the filesystem cache and disabling zone reclaim mode is desirable in this situation.

There's been a bit of discussion about this on the pgsql-performance mailing list here: http://archives.postgresql.org/pgsql-performance/2012-07/msg00215.php

If you've got a modern multi-socket system, odds are good that zone reclaim mode is enabled automatically on boot.  You can check this by looking at /proc/sys/vm/zone_reclaim_mode.

The biggest issue we've seen with zone reclaim mode enabled on customer multi-socket systems is the filesystem cache never filling up even when the database is much larger than RAM.  That's because the system is trying to keep some "local" memory available. After disabling zone_reclaim_mode, the filesystem cache fills up and performance improves.

So, how to disable zone_reclaim_mode?  The best way to do this is via sysctl.  Just add:

vm.zone_reclaim_mode = 0 

to /etc/sysctl.conf, save it and execute sysctl -p to load the new settings into the kernel.

Other interesting non PostgreSQL pages on NUMA/zone_reclaim_mode:


  1. The slides from my SCALE 15x talk: Minimizing PostgreSQL Major Version Upgrade Downtime Using Slony are available here:

    https://www.slideshare.net/JeffFrost2/scale-15x-minimizing-postgresql-major-version-upgrade-downtime

    You'll want to download the powerpoint version as it includes some embedded screencasts that don't play on slideshare.
    0

    Add a comment

  2. If you're like me, you use wal-e for PITR based backups on many of the PostgreSQL servers you manage.  While setting it up, you likely discovered the --terse option greatly reduces the log output.  However, if you run the backup-push job in cron and have email aliases set up properly, you probably receive emails like this every time the backup-push runs:

    NOTICE:  pg_stop_backup complete, all required WAL segments have been archived

    Why is that?  Well, the default value for client_min_messages in PostgreSQL is NOTICE, which means the client (wal-e in this case) will be sent messsages of NOTICE and higher priority.

    So, how to change that?  You may have gotten this far and realized there isn't a handy command line switch that wal-e exposes.  That's ok though, wal-e uses libpq, so we just need to use the PGOPTIONS environment variable.

    If you're using wal-e with envdir like in the documentation examples, then you can just plunk down a PGOPTIONS file in /etc/wal-e.d/env that looks like this:

    -c client_min_messages=WARNING

    and that's it. If you're using a wrapper script for wal-e instead of envdir, you just need to export that environment variable before calling wal-e like this:

    export PGOPTIONS="-c client_min_messages=WARNING"

    Now sit back and enjoy a few less cron emails in the morning.

    EDIT: The envdir method only works if you have bash as your cron shell SHELL=/bin/bash in the crontab or you call the command like this:

    bash -c "envdir /etc/wal-e.d/env /usr/local/bin/wal-e --terse backup-push /var/lib/postgresql/9.4/main"



    0

    Add a comment

  3. Found a great walkthrough on setting up WAL-E to use python-swiftclient for storage in Rackspace Cloud Files: https://developer.rackspace.com/blog/postgresql-plus-wal-e-plus-cloudfiles-equals-awesome/

    Unfortunately by default, your backups use the public URL for Cloud Files and eat into metered public bandwidth.

    The way to work around this is to set the endpoint_type to internalURL instead of the default publicURL.

    You do that by setting the following environment variable:

    SWIFT_ENDPOINT_TYPE='internalURL'

    That allows WAL-E to use Servicenet for base backups and WAL archiving which will be much faster and not eat into your metered public bandwidth.
    0

    Add a comment

  4. Sometimes I find myself wanting to start a PostgreSQL cluster with pg_ctl on a Debian or Ubuntu system.  If you've ever tried this, you know that something like this doesn't work:

    vagrant@vagrant-ubuntu-trusty-64:/vagrant$ sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl -D /var/lib/postgresql/9.3/main/ start
    server starting
    vagrant@vagrant-ubuntu-trusty-64:/vagrant$ postgres cannot access the server configuration file "/var/lib/postgresql/9.3/main/postgresql.conf": No such file or directory


    because the Debian packages place the config files in /etc/postgresql/<version>/<cluster name>

    So, I used to think that this was the way to start it:

    vagrant@vagrant-ubuntu-trusty-64:/vagrant$ sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl -D /etc/postgresql/9.3/main/ start
    server starting
    vagrant@vagrant-ubuntu-trusty-64:/vagrant$ 2014-04-30 00:59:32 UTC LOG:  database system was shut down at 2014-04-30 00:57:49 UTC
    2014-04-30 00:59:32 UTC LOG:  database system is ready to accept connections
    2014-04-30 00:59:32 UTC LOG:  autovacuum launcher started


    And that does generally work, but the proper way to do it is this:

    vagrant@vagrant-ubuntu-trusty-64:/vagrant$ sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl -D /var/lib/postgresql/9.3/main/ -o "-c config_file=/etc/postgresql/9.3/main/postgresql.conf" start
    server starting
    vagrant@vagrant-ubuntu-trusty-64:/vagrant$ 2014-04-30 01:00:22 UTC LOG:  database system was shut down at 2014-04-30 01:00:16 UTC
    2014-04-30 01:00:22 UTC LOG:  database system is ready to accept connections
    2014-04-30 01:00:22 UTC LOG:  autovacuum launcher started


    The real Debian way is to use pg_ctlcluster like so:

    vagrant@vagrant-ubuntu-trusty-64:~$ sudo -u postgres pg_ctlcluster 9.3 main start
    vagrant@vagrant-ubuntu-trusty-64:~$



    3

    View comments

  5. Sometimes I find myself wanting to grep all the query durations out of a file for sorting.  I recently discovered that I can use gnu grep's lookbehind syntax like so:

    grep -oP '(?<=duration: )[0-9\.]+' postgresql-Mon.csv

    and then sort them numerically like so:

    grep -oP '(?<=duration: )[0-9\.]+' postgresql-Mon.csv | sort -n | tail
    1

    View comments

  6. As I mentioned in my previous post, Ubuntu recently released a 3.11.0 kernel for Precise.  Unfortunately, while it appeared stable during a long burn-in process with various pgbench workloads and scale sizes, it crashed shortly after putting the machine into service, so make sure you test thoroughly on your hardware. 

    This particular hardware was the Dell Poweredge 720 that rackspace uses.

    Guess we're going back to custom compiling kernels for a little while longer.

    Anyone else have experiences good or bad with it?

    I believe this is the same kernel that's shipping with 13.10 Saucy Salamander.
    4

    View comments

  7. This one is for those of you compiling a custom kernel or staying on 10.04 LTS because of this issue on database servers in the 3.x kernels: https://lkml.org/lkml/2012/10/9/210

    Ubuntu has released linux-image-3.11.0-13-generic for 12.04 in which that fix is included.  It's actually been included since 3.9rc2, but Ubuntu skipped straight to 3.11.

    And there was much rejoicing in the PostgreSQL community.  At least from those of us who manage 12.04 LTS PostgreSQL servers!

    For those of you who don't recall, here's a link to the big discussion thread:

    http://www.postgresql.org/message-id/60B572D9298D944580F7D51195DD30804357FA4ABF@VMBX125.ihostexchange.net

    One item of note on 3.11 is that the kernel.sched_migration_cost setting is now called kernel.sched_migration_cost_ns, so if you are setting that via sysctl, you will need to update your sysctl.conf or sysctl.d file after upgrade.

    Now go upgrade your test servers and let us know how it performs!  I've been burning in a new server for about 24 hours with pgbench and the 3.11.0-13-generic kernel and so far it's been stable and a good performer.
    2

    View comments

  8. Just whipped up a quick ruby script to put PgBouncer "show pools" data into ganglia using gmetric.  Based it heavily off the postgres_gmetric script.

    You can find the pgbouncer gmetric script here: https://github.com/pgexperts/pgbouncer_gmetric
    0

    Add a comment

  9. You've probably heard the announcement: Martin Pitt's PPA is being deprecated.  This is in an effort to push folks towards the new official apt.postgresql.org.

    See the announcement here: http://www.postgresql.org/about/news/1432/

    Also, see the deprecation notice on Martin's PPA:
    https://launchpad.net/~pitti/+archive/postgresql

    Since we have quite a few customers using Martin's PPA for PostgreSQL packages under Ubuntu, I wrote a little shell script to switch repositories if you're using Martin's PPA and just install the apt.postgresql.org repository if you are not.

    You can find it here: https://github.com/pgexperts/add-pgdg-apt-repo

    2

    View comments

  10. Since slony doesn't automatically add tables to replication for you, sometimes it's handy to double check that everything is being replicated.  If your slony cluster is named "sample" then you would have a slony schema named "_sample". As such, you can query against the sl_table and sl_sequence tables to find unreplicated tables like so:

    SELECT pg_class.OID::regclass as table_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'r' AND pg_class.OID NOT IN (SELECT tab_reloid FROM _sample.sl_table) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample')

    SELECT pg_class.OID::regclass as sequence_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'S' AND pg_class.OID NOT IN (SELECT seq_reloid FROM _sample.sl_sequence) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample');

     Those queries are a bit unpleasant to remember, so I like to make them into views like so:


    CREATE OR REPLACE VIEW public.unreplicated_tables AS SELECT pg_class.OID::regclass as table_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'r' AND pg_class.OID NOT IN (SELECT tab_reloid FROM _sample.sl_table) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample') AND pg_class.relname <> 'fwd_epc_map';

    CREATE OR REPLACE VIEW public.unreplicated_sequences AS SELECT pg_class.OID::regclass as sequence_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'S' AND pg_class.OID NOT IN (SELECT seq_reloid FROM _sample.sl_sequence) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample');

    Now, that's a little handier because we can just select straight out of the view.  Now that we have these easy to use views, we can use check_postgres.pl's custom_query check to setup a nagios alert whenever there are unreplicated tables or sequences.

    First we need to create the check commands like so:

    define command{
            command_name    check_pgsql_slony_unreplicated_tables
            command_line    /usr/lib64/nagios/plugins/check_postgres.pl --host $HOSTADDRESS$ --port=5432 --dbuser nagios --dbname mydb --action=custom_query --query="SELECT count(table_name) AS result FROM public.unreplicated_tables" --valtype=integer --warn=1 --critical=5
    }

    define command{
            command_name    check_pgsql_slony_unreplicated_sequences
            command_line    /usr/lib64/nagios/plugins/check_postgres.pl --host $HOSTADDRESS$ --port=5432 --dbuser nagios --dbname mydb --action=custom_query --query="SELECT count(sequence_name) AS result FROM public.unreplicated_sequences" --valtype=integer --warn=1 --critical=5
    }
    I apologize for all the wrapping there.

    Now that we have the commands defined, we use them in a service definition:

    define service {
        use                          postgres-service
        notifications_enabled        1
        hostgroup_name               slony_primary
        service_description          Slony Unreplicated Tables
        check_command                check_pgsql_slony_unreplicated_tables
        contact_groups               pgexperts
    }

    define service {
        use                          postgres-service
        notifications_enabled        1
        hostgroup_name               slony_primary
        service_description          Slony Unreplicated Sequences
        check_command                check_pgsql_slony_unreplicated_sequences
        contact_groups               pgexperts
    }
    And that's it! Now we'll receive an alert whenever an unreplicated table or sequence exists on the servers in hostgroup slony_primary.



    3

    View comments

About Me
About Me
Blog Archive
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.