[Technical Resource] Handling Application Level Failures for Business Critical Apps on the Cloud
There are various options available to mitigate hardware failure (like RAID,Master/Slave, Load Balancers etc) and there are loads of content to read and ways to avoid failure.
Apart from just hardware failures, there are a bunch of other reasons for the service failures:
- erroneous deployment procedures like wrong configuration files being deployed
- buggy code
- encryption key being corrupted or lost.
I have encountered some of these issues in my previous life in Zoho where we learnt, failed and learnt some more to live & tell the story. At ChargeBee we are trying to put many of these learnings to good use.
Back to topic.
The above mentioned failures can be mitigated by following some of these practices.
Split service into Test Site / LIVE Site and expose Test to customers.
For a business critical application it is a good practice to have a sandbox (test) environment exposed to customers to integrate with their development or staging. For critical apps, Test is a Sandbox environment with which customer is integrating & testing things on a daily basis.
- Always rollout code changes first to the "test" account. The service should architected in a way to accommodate this, preferably at a domain level.
- Let it run for sometime with customers testing it with their sandbox environment.
- Then push to LIVE accounts after validation.
With API & code being tested, most issues can be captured here.
This is very much applicable for services where customers interact via API. An example is our own service that provides a Test and a Live account for customers.
Architect to run as mini self-contained service running in separate regions.
For business applications data can be segmented at customer level. And hence can be architected to run as mini-services. Each mini-service should serve a set of users and be deployed in different regions. If the services need to interact among themselves, they should do so as if interacting with a third party service.
As an example checkout Mailchimp: When you login your account itself is segmented into a region as captured in the domain.
https://us4.admin.mailchimp.com/.
Ideally the service could be deployed independently, with this architecture.
Before the days of Amazon & Rackspace, it was difficult to have such distributed setups for small players. Now it is all the more easier, where you can have smaller services deployed across regions.
The cost should not increase radically as instead of a few high end servers you will be having servers with lesser capacity. An additional advantage is that latency due to table schema updates for databases that lock+copy (like mysql) will be lesser as you are operating on smaller segments of data at a time.
Deployment and monitoring tools need to be in place though.
This allows you to do: Staggered update across regions.
By staggering, the chances of noticing an issue & fixing it is much higher. And your entire customer base will not be affected.
Everything should be backed up/versioned.
If you use a managed database service like Amazon RDS there is an option to recover data up to 5 minutes. If you manage your own db, you need to go for incremental backups.
Delayed replica (as provided by mongo db et al) is another good option.
Similarly encrypted data & keys should handled with proper versioned back ups. There is no point in having backed up encrypted data if you don't have similar backup process for keys.
Update in sets.
Even within the same mini-service, rolling out a key update should preferably be staggered.
Example: if you have a process to change keys every 30 days, you do not have to re-encrypt the data at one-go. You should instead update the data in sets over an hour or more. You would have to store the key version along with the encrypted data to support this.
Finally all these can only help you to mitigate the issues. Monitoring & managing of the servers needs to be robust to capture any failures to act upon it immediately. Investing in tools like newrelic, pingdom/site24x7.com and other home grown app specific monitoring tools could make a difference between a lynch mob vs. a few customers you could reach out personally.
Not all of these need to be in place while starting out, but there has to be a plan to make a natural progression towards this.
About the Author:
KP Saravanan (KPS) was an architect at Zoho and is now the CTO & Co-founder of ChargeBee, the Chennai based startup that offers Subscription Management & Recurring Billing Solution for SMBs.
KPS was one of the early employees at Zoho, from the days when it was called Vembu Systems. He has in-depth knowledge of the complete application stack and has built secure, scalable business applications at Zoho for the past 13+ years.