Why do we need full-stack automation?
Infrastructure automation can deliver many benefits. These are summarized as speed, repeatability, and the ability to work at scale, with reduced risk.
Automation is a key component of functional software-defined infrastructure and distributed and dynamic applications. Below are additional benefits of full-stack automation.
Automated self-service frameworks enable users to requisition infrastructure on demand, including:
- Standard infrastructure components such as database instances and VPN endpoints
Development and testing platforms
- Hardened web servers and other application instances, along with the isolated networks and secured internet access that make them useful, safe, and resistant to errors
- Analytics platforms such as Apache Hadoop, Elastic Stack, InfluxData, and Splunk
Scale on demand
Apps and platforms need to be able to scale up and down in response to traffic and workload requirements and to use heterogeneous capacity. An example is burst-scaling from private to public cloud, and appropriate traffic shaping. Cloud platforms may provide the ability to automatically scale (autoscale) VMs, containers, or workloads on a serverless framework.
An observable system enables users to infer the internal state of a complex system from its outputs. Observability (sometimes abbreviated as o11y) can be achieved through platform and application monitoring. Observability can also be achieved through proactive production testing for failure modes and performance issues. But, in a dynamic operation that includes autoscaling and other application behaviors, complexity increases, and entities become ephemeral. A recent report by observability framework provider DataDog, states that the average lifetime of a container under orchestration is only 12 hours; microservices and functions may only live for seconds. Making ephemeral entities observable and testing in production are only possible with automation.
Automated problem mitigation
Some software makers and observability experts recommend what is known as Chaos Engineering. This philosophy is based on the assertion that failure is normal: as applications scale, some parts are always failing. Because of this, apps and platforms should be engineered to:
- Minimize the effects of issues: Recognize problems quickly and route traffic to alternative capacity, ensuring that end users are not severely impacted, and that on-call operations personnel are not unnecessarily paged.
- Self-heal: Allocate resources according to policy and automatically redeploy failed components as needed to return the application to a healthy state in current conditions.
- Monitor events: Remember everything that led to the incident, so that fixes can be scheduled, and post-mortems can be performed.
Some advocates of Chaos Engineering even advocate using automation tools to cause controlled (or random) failures in production systems. This continually challenges Dev and Ops to anticipate issues and build in more resilience and self-healing ability. Open source projects like Chaos Monkey and “Failure-as-a-Service” platforms like Gremlin are purpose-built for breaking things, both at random and in much more controlled and empirical ways. An emerging discipline is called “failure injection testing.”