The Five Pillars of Resilience EngineeringThe Five Pillars of Resilience Engineering
Keeping systems up and running has become even more critical given today's distributed workforce. Here are five ways to keep your engineering team ready for anything.
In today’s “Always On” world, just being available from the infrastructure perspective is not enough. Services not only need to be responding to requests -- but they also need to ensure that all of the integration points are working properly and that their core function in your ecosystem of applications is working the way you expect and at the pace you expect. A resilient engineering team is always necessary, especially at my company, where identity is central to everything we do.
It’s always critical to keep systems up and running, but it’s more critical than ever given today’s distributed workforce. We’ve been practicing it on my team for the past 12 years, and because of that, we have created some unique ways to drive this home across our engineering team. Here are five ways to get started:
Monitoring and Visibility
It’s critical to implement constant monitoring to ensure your team can act quickly in the case of an emergency. You have to monitor at the application level, identify your critical user flows, and ensure you create synthetic transactions and heuristics monitoring to identify signs of disruption before the experience for your customers starts to degrade.
One way you can challenge your engineers to prepare for the unknown is through regular games and testing opportunities like SRT (site reliability testing) and outage simulations. In these games, we divide the team in half. One team is tasked with understanding how to monitor several metrics of the new technology to ensure it’s working correctly and to take manual action if needed to restore service when a disruption is identified. The other team will purposely introduce several disruption modes and monitor how they affect the system. It’s okay -- and even encouraged -- to push teams over the edge, forcing them to reassess themselves and learn for next time.
A “Redundancy is King” Attitude
To ensure resilience engineering, it’s critical to have no single point of failure and proactively prepare for where you might need “backup.” This can look like multiple cells supported by several servers and all backed by different data centers. When you send your credentials to authenticate, if one subsystem isn’t working, you can redirect to another, so the authentication works and appears seamless to the end-user. We’ve spent a lot of time understanding failure modes and making sure our architecture can immediately work around those modes.
Always remember that redundancy should be considered at all levels, not only within your infrastructure but also with the third-party providers or services you rely on.
A “No Mysteries” Mindset
Embracing a “no mystery” culture comes down to being willing and motivated to find the root cause of any issue that happens in your production system, no matter the complexity. Every engineer must maintain a mindset of curiosity and exploration and never settle for not knowing.
I like to occasionally remind my team about what happened when we didn’t implement this mindset and how much additional work it created. Several years ago, we had a recurring issue around 6 am each Monday that eventually caused customer disruption. At first, we’d assumed it was related to normal load coming to the system, but because it was only happening in one of the cells, that theory was quickly dismissed. We had to start hosting watch-parties starting at 4:30 am with engineers monitoring different parts of the application and infrastructure. Eventually, we found the actual root cause -- after many weeks -- and fixed it. But the team still remembers those disruptive 4:30 am watch parties, and they serve as a powerful reminder of the need to never leave a mystery lingering long enough to cause customer disruption.
Strong Automation
Automation is an absolute requirement, but the only thing worse than having no automation at all is having bad automation. A bug in your automation can take an entire system down faster than a human can restore it and bring it back to operation.
The key to implementing effective automation is to treat it as production software, meaning strong software development principles should apply. Even if your automation starts as a small number of scripts, you need to consider a release cycle, testing automation, deployment, and rollback procedures. This may seem overkill for your team initially, but your whole system will eventually depend on your automation making the right decisions and having no bugs when executing. It’s hard to retrofit good SDLC processes for your automation if they’re not incorporated from the beginning.
The Right Team
An organization that practices and prioritizes resilience engineering starts with its people. Long gone are the days when an engineer would write software and then pass it off for someone else to test it and run it. Today, every engineer today is responsible for ensuring their software is robust, reliable, and always on. Resiliency engineering is hard and requires a lot of passionate engineers, so make sure you reward and recognize your team; ensure they know you understand the complexity of the challenges.
This takes a cultural shift and starts with who you hire. When you’re interviewing, ensure you hire people who are proud of what they’ve built in previous roles and who get satisfaction from solving tough problems while keeping a product running.
And finally, remember that merely stating these components of resilience engineering isn’t enough -- bake them into your organization’s culture. Incorporate games and sayings and ensure everyone feels like an owner to win as a team, and ultimately, keep your customers satisfied.
Hector Aguilar is the President of Technology at Okta, and is responsible for running engineering and technology. His focus is developing strategic planning for the direction of product development activities and managing the engineering team, as well as business technology and corporate IT. Prior to Okta, Hector served in a variety of roles at ArcSight since its inception, driving technology development as the CTO and Vice President of Software Development for the company during its successful IPO in 2008 and after its acquisition by Hewlett Packard.
About the Author
You May Also Like