Changing Quay.io’s MySQL database to Postgres and surviving
Here in this blog, we are going to learn about Changing Quay.io’s MySQL database to Postgres and surviving.
Context
When it launched in 2013, Quay.io became the first private container registry online. Since we were still a small startup, our architecture at the time was designed for larger scale and simpler operations. Quay.io was eventually integrated into CoreOS and then Red Hat during the ensuing years. Due to the expansion of our team, Quay.io now hosts tens of petabytes of container images every month and serves as the primary interface for a large number of enterprise clients, including Red Hat’s whole container catalog.
Even though our city has grown in importance and traffic volume, our architecture has stayed mostly true to the original plan. Although we have transitioned from standalone virtual machines (VMs) to a managed Red Hat OpenShift cluster, the initial Relational Database Service (RDS) database remains operational since our initial go live.
A vital component of Quay.io is its database. Our primary responsibility as a container registry is to securely store customer images and make them accessible when needed. From an architectural standpoint, this entails storing image layer blobs in S3 storage and using our database to maintain track of which blobs are associated with which customer image. We would experience a catastrophic event if our database were ever lost.
Additionally, MySQL 5.7, which is getting older and soon to reach its end of life, was what we were using. We kept vertically scaling our database as traffic volumes increased until we reached the maximum capacity on AWS. Our database had grown to be a significant architectural vulnerability, so we set out to replace it with a more contemporary system. This would offer us much more options in terms of architecture and durability for our important data.
After considering our options, we decided to switch the databases over to Aurora Postgres. Since Postgres is used by almost all Red Hat customers running Quay on-premises, we are familiar with and have tested this database extensively. Since Quay.io is hosted on AWS, we decided to stick with the AWS platform and went with Aurora. We were excited to investigate the many architectural and operational advantages that Aurora offers over a dedicated RDS instance.
Condition of Quay.io before migration
It was evident by the middle of 2023 that our migration would eventually become essential. Although Quay’s code has always been rather connection-hungry, this hasn’t usually been a problem. Quay.io has an intentionally minimalistic design consisting of a series of stateless OpenShift pods that are connected to a database (or Redis cache). We’ve optimized to place as few hops as possible between an image pull event and the database because the majority of our traffic patterns involve reads. An extensive network of worker processes runs inside a Quay pod, managing the registry itself, the web user interface, and asynchronous functions like garbage collection and coordinating security scans with Clair. Our original pre-OpenShift architecture serves as the foundation for this design.
Quay.io could support up to 9K connections on a typical day and, depending on traffic volumes, even up to 10K connections at times.
As more people have begun using Quay.io over the past few years, traffic on the website has gradually increased. Red Hat is currently using Quay.io to serve all of the images in its product catalog, which has led to an increasing number of steady-state connections. Everything was surviving operating smoothly on a massive MySQL instance that we were using. Until they weren’t, that is. This could become problematic if we experienced a prolonged period of high traffic or if our read IOPS started to increase excessively. It would be challenging for the worker processes inside the Quay pods to establish connections. They would then restart, requesting even more connections, until finally there was such a massive influx of requests that our RDS instance would crash.
The connection graph from the November 8, 2023, outage is displayed here. Up until a little traffic spike that results in a connection storm and eventual outage, everything is going smoothly. Obviously, this needed to be fixed.
Database migration
We chose to use AWS Data Migration Service (DMS) to manage the actual database migration. Many of the minor details involved in switching from MySQL to Postgres would be resolved by doing this. Over the course of a few days, we would use DMS to create a brand-new Postgres database and then use DMS to maintain it in sync with MySQL until the migration surviving event was scheduled.
To maintain consistency of the databases while we restarted our Pods against the new database, the migration process would involve putting Quay.io into read-only mode. In the unlikely event that we had to revert for any reason, we could then enable writes again and reverse the DMS synchronization to capture Postgres changes back to MySQL.