...
Ingenious places a high emphasis on system reliability, and this is evident in their structured approach to rollback procedures. Kubernetes, renowned for its flexibility and robustness, forms the backbone of these procedures. Specifically, Ingenious utilizes the 'Deployment' feature in Kubernetes, which is equipped with rolling updates. Each time a Deployment is updated, Kubernetes automatically creates a corresponding ReplicaSet. In the event of an issue with the new release, this allows for a seamless rollback to the previous state by simply reverting to the desired ReplicaSet. This mechanism ensures minimal disruption and maximizes system uptime. Furthermore, Ingenious has a fail-safe approach in place. If required, they can always release a service directly from its master branch. This dual-pronged strategy, encompassing both Kubernetes' native rollback capabilities and the ability to deploy directly from the master branch, ensures that Ingenious can swiftly address any deployment anomalies, thereby maintaining a consistent and reliable service for their users.
Escalation Procedures in Case of Incidents
Ingenious adopts a structured and tiered approach to incident management, ensuring that any disruptions to the service are addressed swiftly and effectively.
Incident Classification:
System Impaired:
The impact on the application is limited in both scope and severity. There's no overt user impact, and the business ramifications are relatively mild. This might entail minor inconveniences or slight disruptions to non-critical business processes.
Production System Impaired:
Here, the application experiences a noticeable degradation in a production environment. Users might encounter errors or face difficulties at a discernible rate. The business impact is moderate, with potential risks including decreased productivity and potential revenue loss. However, there's often a feasible workaround to mitigate the most significant business impacts.
Production System Down:
The most severe classification, this level indicates that the application is essentially unusable in a production setting. Users face a substantial rate of errors, and the business implications are critical, ranging from revenue loss to potential data integrity issues. Unfortunately, no swift workarounds (within 30 minutes) are typically available.
...
Notification and Response: Upon identifying an incident, internal stakeholders are promptly notified via Slack. The incident report is generated automatically by the monitoring system. In scenarios where customers report an outage, a support ticket is initiated. The first responders vary based on the nature of the incident. If a customer creates a ticket, the Solution team takes charge. Conversely, if the incident has localized implications, the developers responsible for the affected module step in. Ingenious prioritizes seamless communication, primarily utilizing Slack for both internal coordination and customer communications. For key customers, additional communication methods, such as emails or phone calls, are employed.
Timeliness: While there isn't a strictly defined timeframe, Ingenious prioritizes addressing incidents as swiftly as possible. The response urgency is proportionate to the incident's severity.
Post-Incident Analysis: On the rare occasion that it's deemed necessary, Ingenious conducts a post-incident review, commonly referred to as a post-mortem analysis. This exercise not only identifies the root causes of the incident but also results in clear action items. These action items are pursued diligently, ensuring that lessons are learned, and preventative measures are instituted.