The Instance Barely Matters (Or Not?)

So, despite everyone saying that the SaaS era is over, you’re building your super-SaaS. Everything is good: you iterate fast, you use simple technologies, and you choose PostgreSQL as your core. You have clients, you are successful!

And your system handles solid traffic. Your metrics say you handle 20k requests per second. Everything is great.

But one day you receive an alert – system performance degraded significantly. You check and see that the bottleneck is the database. Now it only handles 5k requests per second.

You don’t panic; you read a lot about scalability, and you know you should first scale vertically, and only then think about horizontal scalability. And you have a lot of room to scale – your DB is currently on m7i.large.

So you scale the server to m7i.xlarge, and performance is back again! You go to sleep feeling you did it right.

In a few weeks you receive an alert again. Yeah, the same issue – performance is degrading.

Well, you know how to solve it. Let’s scale again, and you switch to m7i.2xlarge. Again, it helped. All is good.

In a few more weeks, you will receive an alert again…

What’s next? m7i.4xlarge? That’s almost $600 per month, while you were paying around $70 before these incidents started happening. Maybe it is worth looking deeper into the root causes?

What RDS instance do you actually need?

Your PostgreSQL RPS depends on whether your working set fits in RAM. I benchmarked every major RDS size to find that threshold for each – with measured RPS, latency, and cost per 1k requests by workload type.

DB Monitoring

You realised you only have metrics related to your application itself. Technical metrics on different levels, business metrics. But you never really monitored your DB. So, you set up monitoring for the DB – all the possible charts - and start looking for anomalies.

And you notice that your m7i.2xlarge is barely loaded. CPU never touches 30% load.

You also notice disk IO is at 100%...

So what really happened? Your DB started to grow in size. And initially, your fresh, small DB fit entirely in your m7i.large memory, giving you incredibly high RPS. But then your data started to grow, and at some point it stopped fitting in the server's RAM. Now, each query needs to go to disk to fetch the needed data instead of reading from fast RAM. So the bottleneck moved from the CPU of your server to the disk IO.

Then you scaled the server to m7i.2xlarge, which gave you 2 times more memory, and the dataset started fitting into memory again.

Unfortunately, the data is still growing, so you keep facing the same issue.

Disks

So, what can you do here?

Of course, you can try to make the data (at least the hot part) stop growing. Sometimes it's possible, but sometimes it's not.

And when it's not possible, you need to think about trade-offs.

You can keep buying more and more expensive servers with ever-increasing RAM. But you can also check faster disks and keep being on smaller servers. The difference is cost, and you need to estimate this cost to make a proper decision.

So, I performed several experiments on AWS (actually, more than 150 experiments!), testing the performance of popular instances across different configurations.

And I've created a website where I've published all the results, along with an interactive tool that can help you find the best instance and disk configuration for your case.

Enter your dataset size and required RPS, and you'll see the best instances that fit your needs. You can read more about the methodology here.

Here is a part of this tool, and I welcome you to visit the full website as well:

Conclusion

As you can see on the chart, ultra-high RPS can be easily achieved if the full dataset fits in memory. But if you don't need that high RPS, it might be much cheaper to tune disk IO.

Be aware of your bottlenecks, and see you in the next posts :)

The Instance Barely Matters (Or Not?)

What RDS instance do you actually need?

DB Monitoring

Disks

Conclusion

Get the next one in your inbox