How to Build Resilient Backend App - The Big Book of Backend Engineering | Part - 1
Table of contents
No headings in the article.
Last year, one of our Express.js servers was facing erratic shutdown behavior. We noticed that the Node.js crashed and took down the entire production application.
I was tasked with investigating the bug and providing a solution
So, the very first step I figured we’d need is
Process Manager:
The application should always restart automatically in case of a failure. It can be due to a hardware failure in the cloud provider, network failure, power failure, a bad publish from one of the open-source NPM packages we’re using, and many more.
But there always should be a way to regenerate the server instance meaning there should be an automated feature in place to make sure that whenever our application crashes, it will automatically restart the process. We chose PM2 as our Process manager.
Logging Pipeline:
While digging through hundreds of lines of logs I realized our logs are too verbose and expansive, we had to read through multiple lines of unnecessary logs, it was like trying to find a needle in a haystack. So, it was time to head back to the drawing board and rethink our logging strategy. Because the next time our app crashes, we want to identify the problem as fast as possible and then start working on a solution right away.
We created a logging pipeline that picks up our application logs from stdout and makes them visible, searchable, and visualizeable ELK (Elasticsearch, Logstash, Kibana). Another issue was we discovered our application was down when a user mailed us or when one of our engineers was trying to log into the site in morning and found that the application was down the whole night. We have users from multiple time zone accessing our site, so that means some users were denied service during peak hour in their respective time zone.
We needed an automated alerting system as soon as the site went down, instead of having to wait for someone else to notify us. ElastAlert will notify us through email, slack whenever one of our application was facing some trouble. We also assigned certain codes from within our application so that we can quickly identify which service was facing issue.
Data Validation:
Now that we have a way to automatically restart without any downtime, and proper logging pipeline that visualizes logs, it was time to debug the error.
It turned out whenever one of our forms were collecting user data, it was trying to insert bad data without proper validation into our Database which caused the application to crash.
So, the first step was to ensure proper data validation where ever we were collecting user data, because a user can intentionally and unintentionally input harmful data. So, the way we ensured this was use strict validation using JOI
Then, we noticed that even if there was an exception in our application, it shouldn’t have crashed instead it should’ve delegated a proper error message to our frontend application through API. But instead it crashed, meaning we had a gap in our error handling. So, we added exceptionHandler attached to the try and catch function instead of forwarding it to the parent caller.
Improvement and future proofing:
we’ve implemented API versioning to ensure we can safely migrate our application to 2.0 without hurting or breaking the consumers of our 1.0 application. Added a /health end point to periodically check our API’s are working correctly or not.
Bonus point: As we know Node.js applications are single threaded and any CPU bound task will block the event loop. We configured Node.js to run in cluster mode to take full advantage of the underlying multi core CPU.