How Chaos Engineering can help to achieve Cyber Resilience?

6 min readNov 16, 2020


The enterprise infrastructure is becoming more and more distributed and hence complex because of the technological innovations, digitization, proliferation of IoT devices. The current pandemic situation has added to this complexity by accelerating cloud migration projects, cloud-native solutions, and employees Working From Home (WFH).

The survey jointly conducted by McAfee and Centre for Long Term Cybersecurity indicates that, across the industry verticals we surveyed, hybrid cloud is the most popular deployment architecture across all major cloud providers, including AWS, Microsoft Azure, GCP, Oracle Cloud, Alibaba Cloud, and IBM cloud. The complexity of the hybrid cloud adds difficulty to the cybersecurity team to get a single view of the entire system. The cybersecurity team is implementing and managing the security controls at multiple environments where enterprise applications are hosted. All this boil downs to the un-ability of the security team to understand and predict the cascading effect if something goes wrong in one of these parts, how it will have an impact on the other systems, applications that are not directly related to the affected system or applications.

Just add to this, Advance Persistence Threats (APTs) are getting stealthier and attackers can stay in the organization environment and move latterly without getting noticed. Identifying these stealthy attacks becomes difficult when the security team does not have a complete picture of what is the normal state of the organization’s entire IT environment. This makes it more difficult to comprehend the security health and operating state of the application and its hosting environment.

Above are the main challenges in achieving cyber resilience. So how can we tackle this challenge?

The answer is utilizing the Chaos Engineering techniques in Enterprise cybersecurity.

“The speed, scale, and complex operations within microservice architectures make them tremendously difficult for humans to mentally model their behavior. If the latter is even remotely true how is it possible to adequately secure services that are not even fully comprehended by the engineering teams that built them.” — Aaron Rinehart CTO Founder — Verica

What is Chaos Engineering?

The Chaos Engineering term first gained popularity a decade ago when Netflix created a tool called Chaos Monkey that would randomly take a node of their production network offline to force teams to react accordingly. The Chaos Monkey was effective because the Netflix team was able to keep its streaming service online and reduce dependencies between cloud servers.

Chaos engineering is not adding Chaos into the production environment rather it is understanding Chaos in the distributed complex production environment and proactively understanding the impact of failure and mitigating it before it impacts business.

To understand what is Chaos engineering we need to understand two terms and these are defined by the creator of the ChaoSlingr Security Chaos testing tool, Aaron Rinehart, and Charles N in their blog.

Security Chaos Engineering is the discipline of instrumentation, identification, and remediation of failure within security controls through proactive experimentation to build confidence in the system’s ability to defend against malicious conditions in production.

(Security) Chaos Experiments are foundationally rooted in the scientific method, in that they seek not to validate what is already known to be true or already known to be false, rather they are focused on deriving new insights about the current state.

How does Chaos Engineering Help in Cyber Resilience?

Resilience is about keeping business-critical services and applications running at an optimal level despite the cyberattack or bouncing back from cyber attack as fast as possible. To achieve we need to improve the availability of critical applications and services. Please find more information on Cyber resilience here.

The main challenge with achieving cyber resilience practically is, there is a very big difference between what the security team thinks the security posture should be as per their understanding of the complexities involved and how the security posture is in reality. The security posture of the organization is not static, it gets impacted by changes in the business environment, changes in attack vector, every evolving the attacker’s tactics, and the availability of cyberattack as a service on the dark web. What is implemented and tested yesterday may not be applicable today.

This where one of the techniques developed by the pioneers of Chaos Engineering for cybersecurity helps. That technique is Continuous Verification of the cybersecurity posture. This technique uses experimentation to discover security and availability weaknesses before they become business-disrupting incidents. In the Continuous Verification experiment, the hypothesis is built around how the security team thinks the security posture is and how it will act during the attack situation. This hypothesis is testing as part of experiments, if the experiment results disapprove of the hypothesis that means the team has inputs to make changes in security architecture to build better resilience.

As per the CEO Casey Rosenthal of Verica “ You can improve the security and availability by enabling engineers to optimized for reversibility. In complex systems you are going to make mistakes, what differentiates a fragile system from a robust system is your ability to identify mistakes and roll forward or roll back and change the decisions that laid up to the conditions we don’t like quickly. With agile, now we can make architectural decisions that can improve reversibility. It is called chaos engineering but it is not stuff to break the stuff in production, the point is to fix the stuff in production. It is not you are engineering chaos, it is assumed that in complex systems you already have chaos so how do you engineer your ways around that.”

Cybersecurity Chaos Engineering Framework

The following diagram depicts the cybersecurity chaos engineering framework.

To use the Chaos engineer techniques, we need to start with an understanding of the current security posture of the organization. Understanding the posture of the hybrid, complex environment of the difficult job, with the help of technical/architectural design documents, current security testing, audit, assessment, policy compliance reports will help to get a better picture of security posture. On the basis of this understanding, we can build the hypothesis of how this security posture handles the cyberattacks, how the different components of the security architecture respond to the attack situations, this includes tools, technologies, processes, and the policies configured as a part of tools deployment. The next step is to build the experiments to test this hypothesis. If the experiment validates the hypothesis that means what the security team thinks and the way security architecture responds is in line. Looking at the number of successful attacks this might be a rare case. If the experiment does not match the hypothesis then it provides inputs to improve the security posture. These improvements then need to be implemented. These inputs can be configuration changes, finetuning, policy improvements, changes in architecture, integrations to adding or removing technology components. And this entire cycle repeats.


The digital transformation, availability of the large quantity of data, interconnectedness, loss of boundaries between organizational IT assets, and from these assets are accessed, the use of AI/ML-based tools to make decisions is making an organization’s IT infrastructure complex. Cyberattacks are driven by the not sole attackers but sometimes sponsored by a nation-state with intention of not only financial gains but to destroy the businesses, governments. This is why we need to take proactive steps to make the business cyber resilient. Chaos engineering provides us a way to understand how the complex environment functions in real-world attacks and provides inputs to improve it proactively.




Protecting bits to save humanity, Cybersecurity's Changing Gameplan