Needless to say, this last week has been a trial and stress-tested our systems to their maximum limitations and as a growing organization, I’d like to embark upon a journey of transparency to the valued members of our 3Commas family and share with you not only what’s brought us to this situation, but also share what we’re doing to fix this and scale as an organization.
Let’s start with what brought us here.
Where we were – The Bottleneck
In good news, our business has scaled tremendously over the past three years and as such, we have the largest VMs with the most powerful NVMe SSD’s, including 8 connected NVMe RAID arrays on AWS; we couldn’t be happier about our tremendous user growth!
On the flip side of this, as a result of this accelerated user growth, we’ve reached a ceiling in processing capacities and when the crypto market becomes most volatile, our systems are pushed to or beyond their limitations. Imagine a situation where your feet are growing uncontrollably and there aren’t any shoes being made that fit your foot size… and that’s where we were at with our servers.
The Savior enters… sort of.
In this situation, we needed to find a custom shoemaker. Ok, enough analogies… we really needed a savior to help us out of this situation via means of data optimization so we could scale alongside the growing crypto market. While we can’t find servers that would come close to our needs, perhaps we can clean our data in a way that the database would require less processing power of our existing servers. Enter Percona.
Adding an expert third-party provider into the mix brings growing pains, but if any provider was a good fit for the task, we felt most confident in Percona. Needless to say, this process didn’t go as smoothly as we’d hoped and the communication with Percona was lacking for several extended periods of time, resulting in downtimes for our users and frustrations from our internal dev team. Enter the power of social media.
The situation was stressful, so our CTO took to Twitter with our grievance, and Percona rapidly escalated the issue and stepped up their game. Problem solved…? Not quite. But we’re getting there.
Immediately following the incident, the 3Commas team was able to reduce the load on our primary array by half providing immediate bandaid-relief to the outage while giving us and Percona enough room to breathe as we implement the longer-term fixes of our scaling issues.
Here’s a roadmap of what’s been done, what’s in-store, and how we are working to prevent outages while maintaining consistent operations for our users (be warned of imminent tech jargon):
- We’ve already popped up an additional replica and spread load between 3 of our DB servers;
- We’ve rewritten some database intensive routines and make them use Redis instead;
- Kaffka’s partitioning was reworked to allow us to handle huge market movements with better grace;
- Rescheduled automatic cron-jobs to allow for simultaneous processing without intersection in addition to implementation CI/CD level check to prevent this recurring in future;
- Rebuilt our stats calculation so now it builds incrementally day-by-day versus entirely in one process;
- Optimized high-load components;
- Adjusted our internal incident escalation flow so in case of any issues for faster response times;
- Built an extremely thorough real-time internal dashboard for our devops team to identify issues before they occur;
- Prioritized tasks for Data Science team to develop AI tools to analyze charts and predict suspicious market activity or anomalies before they impact systems.
What does this all mean?
In short, all of these measures are going to give us a significant safety margin while we continue to hunt for and onboard additional talented developers for our team. This hasn’t been an easy road, but with these developments the winding road is becoming straighter and we can see a clear path in front of us. We truly value our 3Commas family and look forward to a future of growth and stability.