The second pillar of an Engineering Culture
An Engineering Culture consists of eight pillars. The second pillar is called Appropriate Continuity. Within the DevOps mantra, we think autonomous teams should be fully responsible for their software. “You build it, you run it!” On this page, we’re going to investigate the run part of that mantra. What are the things we should focus on when building an application that should also run in production and what measures should we take to keep it running smoothly for our users?
Finding the appropriate level of continuity
When asked to a business person how often a certain application can be down, the answer is most often: “Never”, or at least “At least as possible”. Because of that, the software industry has a big focus on continuity. This is not bad at first, but you should take into account that with every step you take to make an application more robust, you also add complexity which makes it harder to add new features or roll out these features to your end-users. Finding the balance in this is key. You probably don’t need the same level of continuity for all application components.
Everything that matters is how your users are impacted by certain continuity. Can your users still use the application but are things on the backend piling up because there is something down? This could be something that does not impact the users for hours or even days depending on the scenario. If that is the case it might also be smart to focus less on the stability in the back-end process compared to the front-end the users are facing.
This is a very simple scenario but it is meant to make you think about these processes. In the end it is a business decision on what areas you should invest by adding more continuity.
Architecting for continuity
As mentioned before “you probably don’t need the same level of continuity for all application components”. When building an application, you should take this into account right from the start and cater for an architecture that is able to handle these scenarios. Creating an architecture that is loosely coupled and uses asynchronous messaging based on events makes it possible to scale and add continuity to different components. Read more on these types of architecture in the “State-of-the-art Software Engineering” article.
Being able to add new features without impacting continuity is also part of building modern applications. To read more on that, read our article on “Smooth Delivery”. Releasing to production frequently, automated, and in a controlled way is a really good way to improve continuity in general because often downtime occurs because of deployments or configuration changes.
Having such an event-driven architecture & building loosely coupled components or services is one thing. Next to that having the right tools within your application landscape can help you measure continuity and help you find problems before they impact your users.
When building continuity into your application landscape it’s also important to know that these measures actually work in case of disaster. Chaos Engineering is a practice that can help in finding out if your application is as robust as you thought it would be.
Traditional monitoring focusses on measuring things that we know might go wrong. For example, disk space (disks might run out of free space). In modern architectures, applications become more and more complex and there are so many things that could go wrong that this type of monitoring does not work anymore.
Observability is a term used for a more modern monitoring approach where you focus on finding issues wherever as they occur (Unknown unknowns) instead of traditional monitoring that focusses on the “Known Unknowns”.
There are many tools that help you measure continuity and help with observability through logs, metrics and traces. As described in the first paragraph the most important thing to focus on is user impact.
Creating a Service Level Objective (SLO) based on Service Level Indicators (SLI) is a way to set guidelines on what continuity you are aiming for internally and can steer towards it.
If you consider everything above your objective as an “Error Budget” – a term from the SRE space – you can use this time to do experiments or learn about how users are impacted. The SLIs that make up an SLO can focus on actual user impact and are based on the logs, metrics, and traces from instrumentation in your application.
Here are some resources to learn more about continuity, observability &SRE
Let's talk Engineering Culture