Thursday, August 04, 2011

Linux and glibc Scalability

As some of you probably already know from following the traffic on pgsql-hackers, I've been continuing to beat away at the scalability issues around PostgreSQL.  Interestingly, the last two problems I found turned out, somewhat unexpectedly, not to be internal bottlenecks in PostgreSQL.  Instead, they were bottlenecks with other software with which PostgreSQL was interacting during the test runs.

I ran some SELECT-only pgbench tests on a 64-core server and noticed something odd: pgbench was consuming very large amounts of system time.  The problem turned out to be that pgbench calls random().  Since random() does not take a random seed as an argument, it has to rely on some kind of global state, and is therefore not inherently thread-safe.  glibc handles this by wrapping a mutex around it - on Linux, this is a futex - so that executions of random() are serialized across all threads.  This works, but it slows things down significantly even on a 32-core machine, and on a 64-core machine, the slowdown is even more.  I doubt this was effect was visible on earlier releases of PostgreSQL, because the bottlenecks on the server side limited the throughput far more severely than anything pgbench was doing.  But the effect is now visible in a SELECT-only pgbench test, provided you have enough CPUs.

Tom Lane and I had a somewhat protracted discussion about what to do about this, and eventually decided to forget about using any OS-supplied random number generator, and instead use our own implementation of erand48(), now renaming pg_erand48().  This takes the random state data as an argument and is therefore both thread-safe and lock-free.

With that problem out of the way, the bottleneck shifted to the server side.  pgbench was now humming along without a problem, but PostgreSQL was now using enormous amounts of CPU time.  It took a while to track this down, but the bottleneck turned out to be the Linux kernel's implementation of lseek.  Linux protects each inode with a mutex, and PostgreSQL uses lseek - which takes the mutex lock - to find the length of the file for query planning purposes.  With enough clients, this mutex can become badly contended.  This effect is, I believe, measurable even on a 32-core box, but it's not severe.  However, on the 64-core server I tested on, it led to a complete collapse in performance beyond 40 clients.  When I ran pgbench with the "-M prepared" option, which avoids replanning the query and therefore doesn't repeatedly invoke lseek, the performance collapse vanishes.  There is still some degradation due to other contention problems, but it's not nearly as bad.

As it turns out, I'm not the first person to run into this problem: the MOSBENCH guys at MIT, hacking on their modified version of PostgreSQL, ran into it last fall.  They, too, described the performance as "collapsing" due to lseek contention.  I thought they were exaggerating, but if anything they were understanding the extent of the problem.  On this 64-core server, going from 40 clients to 56 clients lead to more than a sevenfold drop in aggregate throughput.  This problem was largely masked in earlier releases of PostgreSQL by the bottlenecks in our lock manager; but, as I blogged about before, those problems are now fixed.  So, in PostgreSQL 9.2devel, it's pretty easy to hit this problem.  You just need enough CPUs.

It's not yet clear to me exactly how we're going to solve or work around this problem.  It would be nice to see it fixed in the Linux kernel, because surely this is an issue that could also affect other applications.  On the other hand, it would also be nice to see it fixed in PostgreSQL, because it doesn't seem inconceivable that it could affect other kernels.  Fixing it in PostgreSQL would presumably mean interposing some type of cache, designed in such a way as to avoid having the cache - or any single cache entry - protected by a single spinlock that 40+ CPUs can go nuts fighting over.

15 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Is it possible for you to send a patch for kernel fixing this problem?

    ReplyDelete
  3. There was a patch back in 2005 to convert PostgreSQL from using lseek+read to pread():
    http://archives.postgresql.org/pgsql-patches/2005-10/msg00068.php

    However, it was not accepted at the time because it didn't have a measurable benefit (on a single-core Celeron ;). Perhaps it's time to try again? It's plausible that pread doesn't need as much synchronization in kernel space.

    ReplyDelete
  4. @intgr: pread() might help if we were doing an lseek() by a read to get the data at a particular offset, but we're not. We're using lseek() to determine the length of the file, which pread() cannot do.

    ReplyDelete
  5. Why are you using lseek(2) and not fstat(2)? I don't know a case for the former works and the latter doesn't.

    ReplyDelete
  6. @Jörg Sonnenberger: I'll give that a try.

    ReplyDelete
  7. > It would be nice to see it fixed in the Linux kernel

    Well don't expect them to read your blog, send an email to the LKML.

    Also, whenever talking about Linux kernel performance, you should state the kernel version number. There are some locking improvements in every release.

    By the way, why does PostgreSQL need the file length so often in a read-only test? Is it for the planner's rowcount estimate?

    ReplyDelete
  8. Things like file size tend to get checked much more often than they change. Wouldn't the RCU algorithm work better in this scenario?

    ReplyDelete
  9. If PostgreSQL needs to know if the file has changed -- and PostgreSQL is the only one that should be changing it -- maybe instead of relying on the filesystem (potentially slow, especially if it has to go to disk) and system calls (slow, serialized) that a shared memory buffer storing that change information would be sufficient, no?

    ReplyDelete
  10. I hate to ask, but have you tried a test on FreeBSD? UFS2 is reentrant and I'd be curious to know where it's FS tipping point is under this scale/workload.

    As a bonus, FreeBSD has dtrace support so you could possibly instrument the extraction of some useful information from that type of test that would be useful regardless of the OS.

    ReplyDelete
  11. I passed your blog link to an acquaintance on the kernel list. He passed it on to the relevant kernel folks. Just got an email back that patches making llseek lockless have been passed upstream to the kernel.

    ReplyDelete
  12. @Jörg Sonnenberger: I tested fstat and that is, indeed, much faster at high client counts. It appears that it might be slightly slower at low client counts, though. Also, who knows what might be true on some other operating system? It seems an uncomfortable thing to rely on.

    @PSulecki: I don't think RCU would be suitable here even if we had an implementation in PostgreSQL, because relation extension can be pretty frequent, if (for example) there's a big COPY in progress. We don't want to slow that down.

    @Jon: Yeah, that is true. The trick is that we'd need synchronization on the buffer, and that would need to be carefully thought through; else the cure might be worse than the disease. Still, it's worth thinking about...

    @Sean: No, I don't have access to any BSD boxen. If you (or anyone else out there) can provide access to one with lots of cores, I'm happy to give it a whirl.

    @dnewcomb: Great, thanks! Is there a link I should be looking at.

    ReplyDelete
  13. Sorry, no link. I just got an email that mentioned Andi Kleen had posted the patches upstream. I'm not a kernel ( or C ) programmer :-(.

    ReplyDelete
  14. https://lkml.org/lkml/2011/9/16/167

    ReplyDelete
  15. @Robert, @dnewcomb: The link for the lockless lseek() is https://lkml.org/lkml/2011/9/15/399

    ReplyDelete