A guide to moving Quay.io’s MySQL to Postgres database
Here in this blog, we will learn about moving Quay.io’s MySQL to Postgres database.
On Sunday, November 12, 2023, our team completed the migration early in the morning, and everything went according to plan. Everything appeared to be fine when we brought up our main Quay.io Pods on Aurora Postgres. We ended the maintenance window after four hours, but our site reliability engineering (SRE) team continued to monitor the database load.
We were then paged back into an incident with our SRE team shortly after noon. Quay.io was starting to perform worse and worse very quickly. We switched over to the secondary region and started looking into it.
We noticed the problem right away when we looked at the database connections. Quay was never releasing any of its connections, for whatever reason. We eventually reached the server’s limit, which caused the database to crash.
Subsequent analysis revealed that the Aurora Postgres instance was not set up to automatically close idle connections after a predetermined amount of time, in contrast to our original MySQL RDS instance. We came to the conclusion that some of our pre-migration outages were being caused by the same underlying behavior, which was occurring slowly. The Aurora connection count began to rise and fall steadily as soon as we added a new configuration setting to our Object Relational Mapping (ORM) framework to perform the idle connection termination.
We felt confident enough after this to switch back to the primary region for all of our traffic and enable write behaviors again.
Moving data around is a risky endeavor.
When we got back to work on Monday morning, our support staff started to escalate customer complaints, which revealed that some pulls were no longer functional. These support escalations are of the utmost importance to us as a registry service, so our on-call team started looking into them. We discovered a pattern when we went through the manifests that were no longer pulling: they were all truncated.
All container images must have the manifest, which is a JSON structure that unifies the layers to create the container image and includes configuration information such as the image type. A larger manifest would be found in images with many layers. We conducted several tests and verified that fresh pushes of large manifests were functioning properly; it seemed that the issue was limited to manifests that were already included in our MySQL database prior to the migration.
Our pre-migration MySQL snapshot was swiftly started, and we wrote a utility script to compare all manifests between our new Aurora Postgres instance and MySQL. Given that Quay.io is currently handling over 60 million images, this task took a while to complete. We soon discovered that running the script in sequential order would be far too time-consuming for our initial version of the script. The script was finished in a few hours after we modified it to run on ranges of manifests using multiple parallel instances.
We started examining how this occurred while the truncated manifests were being restored. Our migration verification scripts, which were mainly concerned with row counts and schema consistency, had somehow missed this. A more thorough analysis of our DMS setup revealed that MySQL’s “longtext” fields were being converted to the internal “nclob” type. However, we observed the following configuration setting:
This indicated that any “nclob” fields were being truncated to 32K because DMS was being used in “Limited LOB” mode. We promptly conducted another scan of our MySQL schema to check for any additional areas where this problem might have arisen. We found that there might have been an impact on certain image labels and repository descriptions. We also developed new utility scripts to handle those situations.
Nothing is right anymore.
Customers continued to occasionally report image pull failures to us over the course of the following few weeks. Our Postgres instance appeared healthy overall, and we were unable to identify a consistent pattern for the failures. But we started to notice that we were still having issues with large spikes in traffic. In spite of having an abundance of resources, the database just seemed to get sluggish at times, and even our own internal push/pull monitors were occasionally raising an alarm.
One of the challenges in troubleshooting problems such as these is that they are typically not repeatable in a load testing or local environment. We started examining our production metrics since we were aware that our connection count was still too high. First, we made a comparison between the total number of connections that were actually in use by all of the registry workers in each of the Pods and the number of connections that were available. This demonstrated to us that we were only making use of a small percentage of the thousands of connections that were available to us.
The next thing to figure out was how much we could cut the size of the connection pool itself. Because we set aside a different connection pool for every work instance, this is more complicated than just changing a single number. We risk having another outage if we set the threshold too low, which would lead to another backlog for the workers. To begin, we separated the behavior of the connection for a solitary worker process instance.
This was startling because it demonstrated that, even when under stress, the registry worker process—which handles pushes and pulls—was retaining up to five connections—out of a possible ten—for a mere few seconds over the course of an hour.
A similar trend was seen when examining the same graph for each registry worker process in each pod.
Even though each employee kept unused connections alive, none of them ever used more than a few connections during normal traffic volumes. It was also evident to us that no worker was carrying the same workload. More connections were being used by some employees than by others. In order to assess the overall effectiveness of our load balancing, we counted the requests that were being sent to each of our pods.
We observed uniform load balancing among all of our pods at the OpenShift level. However, we were seeing less evenly distributed requests within each pod. This implied that a large number of our worker processes were idling while maintaining connections to databases. We now understood that we could decrease our total number of connections and workers within our Pods without compromising our efficiency.
The web API workers and the registry worker processes were the two that had the most idle connections. In order to minimize any potential impact on image pulls, we decided to cut the number of web API workers in half as a preliminary test.
We noticed an instant decrease in our total connections after removing the web API worker count. Better yet, we noticed an abrupt increase in speed and UI responsiveness.
We then changed the number of registry workers from 64 per pod to 16 per pod because of the confidence this gave us.