How To Build Highly Available Backend App - The Big Book of Backend Engineering | Part - 2

How To Build Highly Available Backend App - The Big Book of Backend Engineering | Part - 2

I was faced with an issue when one of our core service was not scaling well. This cronjob was responsible for communicating with multiple modules (services) to gather necessary data, combine it, and save to the database. Data includes company data, financial reports, ratios, credit rating, and historical data.

The first step I took was to identify the bottlenecks in the system.

When it comes to performance optimization, it's essential to identify the slowest piece of the code, as fixing this will usually give me the most significant improvement.

The bottlenecks were:

• Calling an external service multiple times

• Calling database from a loop

Measuring performance was also a crucial step in the optimization process. I used manual logging execution times between method calls using the native node.js method

The next step I took was to reduce the number of round trips.

A round trip between our application and a database or service can last anywhere between 5-10ms or more. When you have many round trips in your flow, it quickly adds up and causes delays.

We were making around 300 API calls to our data service provider (S&P Global) for a single company. So, I converted that into one call and have S&P Global aggregate the required data and return everything at once for each single company.

Instead of storing data one company at a time, I utilized AWS DynamoDB's BatchWriteItem operation to streamline my DB operation.

In my situation, I had 3500 asynchronous calls for 3500 companies to S&P Global, with no dependencies between them. I used the Promise.allSettled() method, which allows you to run multiple tasks in parallel. This simple technique helped me achieve significant performance improvement.

Now, that we were making 3500 request in a burst of time, we were faced with rate limiting on S&P Global’s end.

So, we navigated this problem by using retry policies.

Retry policies:

  • Exponential Backoff

  • Jitter

  • Retry limit

Exponential Backoff means if you’re faced with an API that is throttled, you shouldn’t retry right away, instead you should wait a period of time. This waiting period was determined by the server-level rate limit setting.

For instance if we fail for the

  • 1st time, we wait/sleep for 1 sec

  • 2nd time, we wait/sleep 2 sec

  • 3rd time, we wait/sleep 4 sec

  • 4th time, we wait/sleep 8 sec

So, the duration we’re sleeping is growing by a factor of 2 with every retry as it goes up.

Also add jitter to the sleep interval, meaning add a bit of randomness to the sleep time instead of 2 seconds maybe it can be 2.18 or 1.8 seconds, because if we have a lot of requests all failing at the same time, then you don’t want all of them to retry all at the same time. If you have a hardcoded value of 2 seconds, they’re all going to try at the same moment and they’re just going to get throttled again, so it doesn’t make sense. Instead, it’s better to use a jitter here and add a sense of randomness to ensure that you’re kind of not layering your requests all at once but distributing them over time with a sense of randomness.

Finally, there should be a retry limit because we don’t want to be retrying over and over and over again into infinity. You want to cap this out, we’ve a retry limit of 4, while the most common use cases we’ve seen use them between 3 and 5.

Be fault tolerant

As a client consuming rate limiting API, we need to be fault tolerant. We understood that certain applications that had rate limiting built into them, are sometimes going to reject our request. Since we were building a workflow that was using an external API that is rate limited and that workflow needs to be ACID or ACID complaint, the application needs to be fault tolerant and capable of rolling back some of the updates we’re trying to make in cases where my requests are being throttled or rate limited

  • create logs and upload them to S3 for compliance reason

  • 3 different update mode

    • Create

    • Full Update → Once a week

    • Partial update → Daily → important time-sensitive fields like closing price, marketCap etc.

  • Statistics

    • Success

    • Failure