Lessons learned scaling PostgreSQL database to 1.2bn records/month

“This isn’t my first rodeo with large datasets. The authentication and product management database that I have designed for the largest UK public Wi-Fi provider had impressive volumes too. We were tracking authentication for millions of devices daily. However, that project had a funding that allowed us to pick any hardware, any supporting services and hire any DBAs to assist with replication/data warehousing/troubleshooting. Furthermore, all analytics queries/reporting were done off logical replicas and there were multiple sysadmins that looked after the supporting infrastructure. Whereas this was a venture of my own, with limited funding and 20x the volume.

Others’ mistakes

This is not to say that if we did have loadsamoney we would have spent it on purchasing top-of-the-line hardware, flashy monitoring systems or DBAs (Okay, maybe having a dedicated DBA would have been nice). Over many years of consulting I have developed a view that the root of all evil lies in the unnecessarily complex data processing pipeline. You don’t need a message queue for ETL and you don’t need an application-layer cache for database queries. More often than not, these are workarounds for the underlying database issues (e.g. latency, poor indexing strategy) that create more issues down the line. In ideal scenario, you want to have all data contained within a single database and all data loading operations abstracted into atomic transactions. My goal was not to repeat these mistakes.

Our goals

As you have already guessed, our PostgreSQL database became the central piece of the business (aptly called ‘mother’, although my co-founder insists that me calling various infrastructure components ‘mother’, ‘mothership’, ‘motherland’, etc is worrying). We don’t have a standalone message queue service, cache service or replicas for data warehousing. Instead of maintaining the supporting infrastructure, I have dedicated my efforts to eliminating any bottlenecks by minimizing latency, provisioning the most suitable hardware, and carefully planning the database schema. What we have is an easy to scale infrastructure with a single database and many data processing agents. I love the simplicity of it — if something breaks, we can pin point and fix the issue within minutes. However, a lot of mistakes were done along the way — this articles summarizes some of them…”

https://medium.com/@gajus/lessons-learned-scaling-postgresql-database-to-1-2bn-records-month-edc5449b3067

Why Uber Engineering Switched from Postgres to MySQL

The early architecture of Uber consisted of a monolithic backend application written in Python that used Postgres for data persistence. Since that time, the architecture of Uber has changed significantly, to a model of microservices and new data platforms. Specifically, in many of the cases where we previously used Postgres, we now use Schemaless, a novel database sharding layer built on top of MySQL. In this article, we’ll explore some of the drawbacks we found with Postgres and explain the decision to build Schemaless and other backend services on top of MySQL.

 

The Architecture of Postgres

We encountered many Postgres limitations:

  • Inefficient architecture for writes
  • Inefficient data replication
  • Issues with table corruption
  • Poor replica MVCC support
  • Difficulty upgrading to newer releases

https://eng.uber.com/mysql-migration/

 

Why we lost Uber as a user…

On 07/26/2016 01:53 PM, Josh Berkus wrote:
> The write amplification issue, and its correllary in VACUUM, certainly
> continues to plague some users, and doesn’t have any easy solutions.

To explain this in concrete terms, which the blog post does not:

1. Create a small table, but one with enough rows that indexes make
sense (say 50,000 rows).

2. Make this table used in JOINs all over your database.

3. To support these JOINs, index most of the columns in the small table.

4. Now, update that small table 500 times per second.

That’s a recipe for runaway table bloat; VACUUM can’t do much because
there’s always some minutes-old transaction hanging around (and SNAPSHOT
TOO OLD doesn’t really help, we’re talking about minutes here), and
because of all of the indexes HOT isn’t effective. Removing the indexes
is equally painful because it means less efficient JOINs.

The Uber guy is right that InnoDB handles this better as long as you
don’t touch the primary key (primary key updates in InnoDB are really bad).

This is a common problem case we don’t have an answer for yet.



Josh Berkus
Red Hat OSAS
(any opinions are my own)

https://www.postgresql.org/message-id/5797D5A1.5030009%40agliodbs.com

WHY UBER ENGINEERING SWITCHED FROM POSTGRES TO MYSQL

The early architecture of Uber consisted of a monolithic backend application written in Python that used Postgres for data persistence. Since that time, the architecture of Uber has changed significantly, to a model of microservices and new data platforms. Specifically, in many of the cases where we previously used Postgres, we now use Schemaless, a novel database sharding layer built on top of MySQL. In this article, we’ll explore some of the drawbacks we found with Postgres and explain the decision to build Schemaless and other backend services on top of MySQL.

The Architecture of Postgres

We encountered many Postgres limitations:

  • Inefficient architecture for writes
  • Inefficient data replication
  • Issues with table corruption
  • Poor replica MVCC support
  • Difficulty upgrading to newer releases

We’ll look at all of these limitations through an analysis of Postgres’s representation of table and index data on disk, especially when compared to the way MySQL represents the same data with its InnoDB storage engine. Note that the analysis that we present here is primarily based on our experience with the somewhat old Postgres 9.2 release series. To our knowledge, the internal architecture that we discuss in this article has not changed significantly in newer Postgres releases, and the basic design of the on-disk representation in 9.2 hasn’t changed significantly since at least the Postgres 8.3 release (now nearly 10 years old).

https://eng.uber.com/mysql-migration/

PostgreSQL Exercises

Welcome to PostgreSQL Exercises! This site was born when I noticed that there’s a load of material out there to help people learn about SQL, but not a great deal to make it easy to learn by doing. PGExercises provides a series of questions and explanations built on a single, simple dataset. It’s designed for use as a partner to a good book or Postgres’ excellent documentation.

The exercises on this site range from simple select and where clauses, through joins and case statements, and on to aggregations, window functions, and recursive queries. Most people who aren’t already pros should find something to test themselves with.

For an introduction to the dataset, go to Getting Started, then select an exercise category from the menu and go!

https://pgexercises.com/

The beets blog: we’re pretty happy with SQLite & not urgently interested in a fancier DBMS

Every once in a while, someone suggests that beets should use a “real database.” I think this means storing music metadata in PostgreSQL or MySQL as an alternative to our current SQLite database. The idea is that a more complicated DBMS should be faster, especially for huge music libraries.

The pseudo-official position of the beets project is that supporting a new DBMS is probably not worth your time. If you’re interested in performance, please consider helping to optimize our database queries instead.

There are three reasons I’m unenthusiastic about alternative DBMSes: I’m skeptical that they will actually help performance; it’s a clear case of premature optimization; and SQLite is unbeatably convenient.

http://beets.io/blog/sqlite-performance.html

How To Set Up Django with Postgres, Nginx, and Gunicorn on Ubuntu 16.04

Django is a powerful web framework that can help you get your Python application or website off the ground. Django includes a simplified development server for testing your code locally, but for anything even slightly production related, a more secure and powerful web server is required.

In this guide, we will demonstrate how to install and configure some components on Ubuntu 16.04 to support and serve Django applications. We will be setting up a PostgreSQL database instead of using the default SQLite database. We will configure the Gunicorn application server to interface with our applications. We will then set up Nginx to reverse proxy to Gunicorn, giving us access to its security and performance features to serve our apps.

https://www.digitalocean.com/community/tutorials/how-to-set-up-django-with-postgres-nginx-and-gunicorn-on-ubuntu-16-04

PostgreSQL Indexes: First Principles

We have all heard about indexes. Yeah, that thing that it’s automatically added to the Primary Key column that enables fast data retrieval and stuff. Sure, but have you ever asked yourself if there are multiple types or implementations of indexes? Or maybe, what type of indexes your favourite RDBMS implements? In this blog post, we will take a step back to the beginning, exploring what indexes are, what is their role, types of indexes, metrics and so on. And all of this in PostgreSQL…

http://eftimov.net/postgresql-indexes-first-principles

Records: SQL for Humans™

Records is a very simple, but powerful, library for making raw SQL queries to most relational databases.

Just write SQL. No bells, no whistles. This common task can be surprisingly difficult with the standard tools available. This library strives to make this workflow as simple as possible, while providing an elegant interface to work with your query results.

Database support includes Postgres, MySQL, SQLite, Oracle, and MS-SQL (drivers not included)…”

https://github.com/kennethreitz/records
https://pypi.python.org/pypi/records/

Introducing HypoPG, hypothetical indexes for PostgreSQL

“DALIBO is proud to present the first release of HypoPG, an extension that adds hypothetical indexes in PostgreSQL.

An hypothetical index is an index which doesn’t exists on disk. It’s thefore almost instant to create and doesn’t add any IO cost, wether at creation time or at maintenance time. The goal is obviously to check if an index is useful before spending many time, I/O and disk space to create it.

With this extension, you can create hypothetical indexes, and then with EXPLAIN check if PostgreSQL would use them or not…”

http://www.postgresql.org/about/news/1593/