November 21, 2024

Building Resilient Rummy Games using Amazon Aurora

Introduction

At Games24x7, we are dedicated to enhancing player experience and ensuring reliable gameplay. To add this commitment, we have focused on optimizing our database infrastructure. Previously, our Rummy gameplay services used to share databases with other non-Rummy services (like a monolith database), occasionally leading to availability issues due to cascading effects and unpredictable load generated from other services. To provide a more seamless and uninterrupted gaming experience, we have migrated our Rummy gameplay database to a dedicated, rummy-specific database. This change ensures that our players enjoy a more stable and responsive gaming experience, free from the impact of other services.

This blog outlines the details of this initiative and data migration from shared self-managed MySQL to Amazon Aurora and the challenges we have encountered during the migration process.

Current Infrastructure

Below are the possible issues that can result in database outage while working with a shared database.

The problem with this self-managed, shared database

Storage Scaling: Because of using shared DB, underlying EBS volumes are under constant stress from various services. Any storage throughput bottlenecks mean a cascading failure and, thus, a bad user experience. We needed to actively monitor it, set alarms, and prepare runbooks for various situations. This overhead in performance and maintenance strongly pushed us to use a managed solution like Amazon Aurora.

Cascading Effect: Changes made by other teams can cause downtime, directly impacting the rummy gameplay experience for players, often leading to negative reviews and decreased player engagement.

Increased Load: Non-rummy services can add significant load, leading to latency issues and a degraded gameplay experience, especially during peak times.

Vertical Scaling Limit: A shared DB made scaling difficult, as reaching a certain threshold in user transactions required disproportionate resource allocation. This prompted a shift to a rummy-specific DB, ensuring scaling depends solely on game play load.

Maintenance Downtime: Updates, backups, and maintenance on the shared DB can cause up to several hours of downtime for rummy gameplay, impacting user satisfaction and retention.

Which Database to Adopt Self Manage MySQL, RDS or Aurora?

It was clear that we had to use a relational database since the rummy service relies on it to generate unique IDs such as tournamentId, tableId, and gameId in the form of long. We aimed to maintain the integrity of our application architecture. However, with self-managed MySQL, we faced several issues: scaling required downtime, there was no failover mechanism in case the write instance failed, and there was no self-healing feature to ensure the database was functioning properly.

Given the limitations of self-managed MySQL, we plan to leverage cloud-based solutions like RDS or Aurora for improved scalability, reliability, and minimal downtime.

Advantage of RDS and Aurora over self-managed

Aurora automatically increases its storage capacity up to 128TB, while RDS also scales but is more limited to certain instance types.
We have performed the load test on RDS & Aurora and observed that write latency in RDS was higher in multi-AZ setups because RDS also writes synchronously to standby DB instances in other availability Zone, whereas Aurora performed better than MySQL even with the trx_commit log enabled.
Aurora’s storage layer is separated from its compute layer, allowing for quick failover recovery whenever any instance fails and a reader instance gets promoted within minutes to ensure full database availability shortly after a crash.

Comparative Load Test Results: MySQL vs. Amazon RDS vs. Amazon Aurora

We conducted a comprehensive load test to evaluate the performance of MySQL, RDS, and Aurora databases by tuning various DB parameters. We have tuned the listed DB parameters and analyzed DB latency on different sets of values. Let’s understand the 2 DB parameters.

innodb_flush_log_at_trx_commit:

0: Logs are flushed to disk once per second, improving performance but risking data loss if the system crashes.
1: Logs are flushed to disk at each transaction commit, ensuring maximum durability but potentially affecting performance.
2: Logs are written at each commit but flushed to disk once per second, balancing performance and durability.

sync_binlog:

0: Binary logs are not synchronized after each write, enhancing performance but risking data loss.
1: Binary logs are synchronized after each write, providing high durability with a potential performance trade-off.

We have evaluated mean p99 latency by tuning below DB parameters and observing the latency achieved.

‍

Scenario	innodb_flush_log_at_trx_commit	sync_binlog	HA
Scenario 1	2	0	SingleAZ
Scenario 2	2	1	SingleAZ
Scenario 3	2	0	MultiAZ
Scenario 4	2	1	MultiAZ
Scenario 5	1	0	MultiAZ
Scenario 6	1	1	MultiAZ

For scenario 1, EC2 was performing better than RDS. Since Aurora always operated in MultiAZ and sync_binlog=1 hence there is no test record for scenarios 2 and 3.

For scenario 2, EC2 and RDS perform equally the same. In this use case, we have enabled the sync_bin that’s why we can see our latency has increased significantly.

For scenario 3, EC2 was performing better than RDS. In this use case, we have disabled the sync_bin but configured RDS in MultiAZ. That extra latency number for RDS was added due to MultiAZ.

For scenario 4, where the trx_commit flag was set to 2, we observed that Aurora was performing better than EC2 and RDS for both versions.

For scenarios 5 & 6, EC2 and RDS latency were huge and we would have to think twice before changing that parameter in production. The current main DB use case follows the scenario 1 configuration that helps in achieving faster performance but losing one second of record in case of a system crash. Looking into future perspectives we might have to keep configuration as scenario 6 ensures durability.

After analyzing the load test results, Aurora outperformed the others in terms of latency while preserving the database durability with the help of innodb_flush_log_at_trx_commi=1.

Scaling Up: Insights from 5x Load Testing in Aurora

We conducted load tests on Aurora, simulating up to five times the current production load to ensure the system has ample capacity for future growth. This extensive testing was crucial for validating the robustness and scalability of our infrastructure.
In one significant test, with a load four times greater than the current production level, Aurora successfully handled 140,000 queries per second on a db.r6g.8xlarge instance type. This performance was achieved with no major impact on latency, confirming that our setup is well-equipped to manage high-traffic volumes effectively.

Migration Plan and Prerequisites

We have migrated all 2 variants of the game from MySQL-5.x to Aurora-3.x (Compatible with MySQL-8.x) and here are the 2 different rummy variants listed down

Game Variant V2
Game Variant V3

Migration of V2 & V3 variants from Self managed MySQL to Aurora:

Rummy variants V2 & V3 point to the same shared database and we wanted to migrate these rummy variants independently to avoid any hiccups. Currently, new records are generated from V2 & V3 and stored in the same self-managed database. Along with V2 and V3, this database was also used by other services that were not directly responsible for rummy gameplay.

The below Infrastructure setup is before the migration. Traffic from V2 and V3 and at the same time from non-rummy services. The interesting thing is that the V2 and V3 variants also share some common tables within the same database. Records generated from V2 and V3 are written to the same table for some of the use cases.

Overcoming Migration Hurdles: How to Ensure V3 Runs Smoothly After V2 Migration

In our migration strategy, the challenge was to migrate both variants independently. Simply migrating V2 to Aurora while V3 still points to older DB can lead to ID conflict V3 replication was enabled from MySQL and AuroraDB and AuroraDB was shared for both V2 and V3. Since the V2 and V3 table was shared we have to carefully think of auto-increment id conflict. From V2 new records will be directly created to AuroraDB and From V3 new records will be created in MySQL and that change log will get replicated to AuroraDB can cause the ID conflict, which can potentially break the replication and cause the data loss. We’ve made a pivotal decision to increase the auto-increment IDs in Aurora for tables that are shared between V2 and V3. This adjustment is not merely a technical tweak but a strategic move designed to safeguard data integrity and maintain operational continuity.

Why the increment, you might ask? The core of our migration plan involves moving the V2 variant to Aurora while keeping the V3 variant pointing to the shared database (mainDB). By increasing the auto-increment ID in Aurora, we aim to prevent potential conflicts that could arise during this transition.

Here’s the reasoning: since V3 continues to interact with the shared mainDB, records created in V3 will be replicated to the new Aurora setup via binary logs. Concurrently, as V2 transitions to Aurora, there is a risk of ID clashes if new records generated in the mainDB (with IDs such as id1) are replicated into Aurora. If Aurora has already assigned the same ID (e.g., id1), this could fail to store new records and disrupt the replication process.

Post-Migration Setup: V2 variant to Aurora Setup

Ensuring Smooth Transition: Final Setup Steps After Migrating V3 to Aurora

Once V2 has been migrated to Aurora, We have switched traffic for the V3 variant from MySQL to Aurora. Now both V2 and V3 are sending traffic to AuroraDB. We have also informed our reader beforehand of V2 & V3 migration to read the data from AuroraDB.

Ensuring Rollback Strategy: AWS DMS as Saviour

We used AWS DMS to replicate incremental changes from Aurora to MySQL. AWS DMS configures the source and target and adds change records from the source (Aurora) to the target (MySQL 5.7). For V2, we did not use DMS instead, we remembered the bin log position just before the V2 migration and kept the last ID of tables specific to V2, along with the first auto-increment ID for shared tables. To roll back V2, we will use these IDs to dump incremental changes into MySQL 5.7. After migrating V3, we used AWS DMS to sync new changes from Aurora to the new intermediate MySQL.

About the Author

Rohit Kesarwani is a senior developer specializing in crafting innovative solutions for complex challenges. With a keen instinct for enhancing the performance of existing systems. Leveraging this expertise and a collaborative approach, Rohit has successfully delivered numerous features for our Rummy Gameplay.

‍

Download full report