Monitoring & On-Call & Rollback & Incident Response

Unified Monitoring with Prometheus in a Micro-Service Environment

Ingenious has adopted a cutting-edge approach to monitoring by implementing a unified system that utilizes Prometheus. This system has been meticulously designed to be compatible with a micro-service architecture. The integration of Prometheus in such an environment ensures that the complexities inherent to micro-services — such as multiple, independent service deployments — are effectively monitored. This ensures that the system's health and performance metrics are consistently and comprehensively captured, aiding in quicker diagnostics and resolution of any issues.


Comprehensive Instrumentation Across Layers

A hallmark of Ingenious's robust monitoring approach is the thorough instrumentation across all layers of their infrastructure. At the foundational level, managed Kubernetes nodes are meticulously monitored, with metrics related to CPU usage, RAM allocation, I/O operations, and more being continually recorded. Ascending the stack, Docker containers, which encapsulate the individual micro-services, are also under constant surveillance. This is complemented by the monitoring of application instances that provide insights into the real-time performance and health of individual apps. Finally, at the highest level, services themselves — which could be a composite of multiple micro-services or application instances — are also instrumented, ensuring an end-to-end view of the system's health.


Prometheus/Grafana Integration with CI/CD

In line with modern DevOps practices, Ingenious ensures that their monitoring tools, specifically Prometheus and Grafana, are treated no differently than any other application in their ecosystem. These tools are built and configured using the same Continuous Integration/Continuous Deployment (CI/CD) approach. This is facilitated by leveraging industry-standard tools like git for version control and Jenkins for automation. Such an approach not only ensures consistency in deployment practices but also ensures that any updates or modifications to the monitoring tools can be rolled out swiftly and reliably.


Operations and On-Call Duty with Slack

Operational efficiency and swift response times are crucial for any tech-driven organization. To facilitate this, Ingenious has integrated Slack into their operational workflow. Slack, known for its real-time messaging capabilities, serves as the primary communication channel for operations and on-call duties. Whether it's an automated alert from a monitoring system or a message from a team member, Slack ensures that the right individuals are notified immediately, enabling quick diagnostics and timely resolution of issues. This modern approach to operations communication ensures that the team remains agile and responsive, minimizing downtime and maximizing system reliability.

 

Robust Rollback Procedures with Kubernetes

Ingenious places a high emphasis on system reliability, and this is evident in their structured approach to rollback procedures. Kubernetes, renowned for its flexibility and robustness, forms the backbone of these procedures. Specifically, Ingenious utilizes the 'Deployment' feature in Kubernetes, which is equipped with rolling updates. Each time a Deployment is updated, Kubernetes automatically creates a corresponding ReplicaSet. In the event of an issue with the new release, this allows for a seamless rollback to the previous state by simply reverting to the desired ReplicaSet. This mechanism ensures minimal disruption and maximizes system uptime. Furthermore, Ingenious has a fail-safe approach in place. If required, they can always release a service directly from its master branch. This dual-pronged strategy, encompassing both Kubernetes' native rollback capabilities and the ability to deploy directly from the master branch, ensures that Ingenious can swiftly address any deployment anomalies, thereby maintaining a consistent and reliable service for their users.

 

Escalation Procedures in Case of Incidents

Ingenious adopts a structured and tiered approach to incident management, ensuring that any disruptions to the service are addressed swiftly and effectively.

Incident Classification:

  1. System Impaired:

    • The impact on the application is limited in both scope and severity. There's no overt user impact, and the business ramifications are relatively mild. This might entail minor inconveniences or slight disruptions to non-critical business processes.

  2. Production System Impaired:

    • Here, the application experiences a noticeable degradation in a production environment. Users might encounter errors or face difficulties at a discernible rate. The business impact is moderate, with potential risks including decreased productivity and potential revenue loss. However, there's often a feasible workaround to mitigate the most significant business impacts.

  3. Production System Down:

    • The most severe classification, this level indicates that the application is essentially unusable in a production setting. Users face a substantial rate of errors, and the business implications are critical, ranging from revenue loss to potential data integrity issues. Unfortunately, no swift workarounds (within 30 minutes) are typically available.

Notification and Response: Upon identifying an incident, internal stakeholders are promptly notified via Slack. The incident report is generated automatically by the monitoring system. In scenarios where customers report an outage, a support ticket is initiated. The first responders vary based on the nature of the incident. If a customer creates a ticket, the Solution team takes charge. Conversely, if the incident has localized implications, the developers responsible for the affected module step in. Ingenious prioritizes seamless communication, primarily utilizing Slack for both internal coordination and customer communications. For key customers, additional communication methods, such as emails or phone calls, are employed.

Timeliness: While there isn't a strictly defined timeframe, Ingenious prioritizes addressing incidents as swiftly as possible. The response urgency is proportionate to the incident's severity.

Post-Incident Analysis: On the rare occasion that it's deemed necessary, Ingenious conducts a post-incident review, commonly referred to as a post-mortem analysis. This exercise not only identifies the root causes of the incident but also results in clear action items. These action items are pursued diligently, ensuring that lessons are learned, and preventative measures are instituted.