Chaos engineering: the steps to achieve on your application

Chaos engineering

In recent years, the methods of hosting and application development (micro-services) have led us to rethink the way our applications communicate with each other and the way we serve our customer service. The multiplication of services makes it possible to have better controlled applications in terms of development and business but also brings their share of problems: sources of error are indeed multiplied.

Chaos testing, or Chaos engineering, is a philosophy that requires developers to take into account possible failures that may occur on an application and thus prepare themselves to face a chaotic situation, namely:

  • Errors applicable,
  • Errors of infrastructure,
  • Errors network,
  • In a general way, any unexpected error.

Initiated in 2011 by Netflix, the teams described some principles to be respected, called the [http://principlesofchaos.org/](Principles of Chaos). I invite you to read them before reading this article. We will see the steps to achieve Chaos engineering in your teams and thus have a resilient application to failure `.

All the failures mentioned above can (and should) be anticipated by developers when developing new applications, but existing applications in production today are sometimes far from being resilient to all these sources of error. That's why I thought it would be interesting to share with you the steps that will lead to a more resilient application, I hope.

The first advice I'd like to give to start with is to do not break your services on your production environment right now. You have pre-production or recipe environments, use them first! There is no need to damage your production while you have the possibility to anticipate these failures, your users or customers do not have to suffer this.

Then, as specified above, errors can be both infrastructure, network and also application! It is up to you, the developers, to implement the services that will make it possible to bear these failures, even sometimes to infrastructure or network problems. Finally, proceed step by step. We will now see the different steps you can take, not necessarily in the order mentioned.

Some information before starting

I would not deliberately quote tools before starting because everyone is free to implement their chaos rules. The important thing is to observe and react to a failure you have caused; thus, a few simple command lines can be used to start playing on your environments.

To cut a process, a simple kill can be used:

$ kill -9 <pid>

Also, in case your applications are deployed on a Kubernetes cluster, to delete a pod randomly :

$ kubectl delete pod/`kubectl get pods | cut -d' ' -f1 | sed 1d | shuf -n1`

As you can see, you can write your chaos rules quite simply. Well, now let's see the different cases that you can start with observe.

Add latency

To start off not too violently, you can start by simply adding latency to your servers, so you should start to see various timeout problems on internal and external services that can damage your application. These errors are very frequent and occur by high load or soft connectivity by your provider. You must take them into account. To add network latency, you can connect to a machine and simply play with the "tc" command (for TrafficControl) :

# Add 500ms of latency
$ tc qdisc add dev eth0 root netem delay 500ms

# Verify that the rule has been applied
$ tc -s qdisc
qdisc netem 8002: dev eth0 root refcnt 2 limit 1000 delay 500.0ms

# Delete the rule
$ tc qdisc del dev eth0 root netem

Easy and efficient to start testing and observing the behavior of your application under latency !

Cut off your scheduled tasks

Without directly breaking your application, you can start by thinking about what would happen if your asynchronous jobs (sending emails, data synchronization,...) stopped working. These are not directly visible to your users and it sometimes doesn't matter if they are triggered a little late, as long as they are triggered.

Let's take an example of data denormalization to render this data on a front: when indexing new data, remember to suffix your index with a timestamp so that it doesn't impact your current data. <Also, your current data must be stored in a dedicated index (or better: an alias pointing to an index), in order to be able to switch at the end of the denormalization job and thus ensure that in case of job error, the index is not affected and that the "old" data is always displayed on your front.

Of course, if these are jobs that absolutely must be triggered on time (opening rights following an order made upstream for example), make sure that your jobs are correctly executed and have alerting and retry on them.

Cut your event publisher/subscriber server

When your applications exchange data with a pub/sub server (publisher / subscriber), you should also expect that it may be unavailable. Even when your server is in cluster mode, you are unfortunately not protected from a crash.

You must therefore ensure that all event notifications that have failed to be sent to the ad/sub server are stored in order to be sent back as soon as it is available again. It is indeed much better to be able to catch up the time rather than to lose data important for your business.

Cut your database

We reach a critical point here: in general, when a database is made unavailable, many applications are concerned because they can no longer access their data, read or write. In addition to advising you to set up a cluster of your databases, I would also suggest that you expect them to become unavailable: corrupted data, failed network connection,...

The most important thing here is to try to reassure your users and show them something nice. If you have some cached information in the user's storage room, take advantage of it and display it to the user, if not better.

Delete your data

Your database may still be available, but the problem could very well come from your data becoming corrupted or being erased as a result of a flaw in your application. In this case, you must make sure that you can be able to detect this and quickly re-import a stable and recent backup.

In the same way, it is very important that you test your backups! At any time, these can be faulty and it would be a pity not to be able to restore a recent backup in case of data loss because you have not made sure that they are functional.

Cut off your micro-services

Your micro-services are most certainly contacted in one of the following two ways: by an API Gateway (GraphQL?) upstream and HTTP or gRPC links or perhaps they are only micro-services responding to events (consumers / producers). In any case, you must expect cuts on these applications and make sure, at a minimum, that they do not jeopardize your entire application. Thus, the part concerning the micro-service in question could be made unavailable (favorites management, for example) but the other features would continue to work. Better still, in this case, you can tell your users that you have a concern about getting their favorites back but take the opportunity to push them the latest available content, if nothing else.

If your infrastructure allows it, a possible solution would also be to serve a fallback cache to users using a cache strategy of type LRU (Last Recently Used) or LFU (Last Frequently Used) depending on the cases. Thus, the data would not necessarily be completely up to date, but the user would have at least some available content and in most cases only fire. Of course, the fallback cache can represent a large volume, that's why it's important to calculate the data that would potentially be stored in it and thus control the data that you cache.

Increase the complexity of chaos

When you can control most of the failures that can happen to your infrastructure, it is time to increase the scale and prepare yourself to control multiple failures in parallel. It's chaos. Indeed, several micro-services can be rendered unavailable simultaneously: if you had therefore planned a case of fallback of your product recommendation, for example, on another micro-service allowing you to return the last products available in your catalogue, you must be able to find another solution.

In the event that your application is available in multiple geographical areas, then make a completely unavailable area to ensure that your customers are redirected to the second area. You will then start playing with tools such as [https://netflix.github.io/chaosmonkey/](Chaos Monkey) initially developed for Netflix's needs on these topics.

You are ready to play the chaos on your production

Remember, until now we were on non-production environments. There will be a time when you will need to test these different cases on your production environment. A good practice when it comes to this is to set up "game days" in order to devote a full day to putting chaos on your infrastructure and mobilizing your teams so that they are ready to intervene to test their fallback solution and/or restore the failure in case of failure. Even in the event of a failure, it will only be beneficial for your project because it will allow you to improve the resilience of your application as you go along, so don't be afraid to get there.

Conclusion

Chaos engineering is approached step by step step by step because in the life cycle of a project, it is usually one of the last steps once the application is stable and in production. It allows you to make your application resilient to failures but also to prepare your teams to intervene on these subjects which can be really frustrating when they arrive to be able to solve them quickly but also to try to minimize the chances of them happening.