Introduction to Software Observability at Hubtel
In our mission to Drive Africa forward by enabling everyone to find and pay for everyday essentials, gaining deep insights into our platform's operations is crucial. Observability allows us to answer key questions such as:
- How is the platform performing?
- How are the applications on the platform performing?
- What experiences are our customers having on the platform?
- How much resource is the platform consuming?
An observability solution helps us address these questions and many more, providing a comprehensive understanding of our system's behavior.
What is Observability?
Observability is the process of understanding what’s happening inside a system based on its external outputs. It refers to the ability to infer the internal states of a system from the data it produces. This capability is crucial for gaining insights into complex, distributed systems, often uncovering unknown issues or behaviors on our platform. Observability emphasizes a holistic understanding of the system’s behavior.
Why do we need Observability?
The benefits of adopting this approach include:
- Our systems are not only operational but also measurable, debuggable, and performant.
- As engineers, we are empowered to diagnose, predict, and resolve issues on our platform quickly and effectively, ensuring an always-available platform for our customers.
- We are well-equipped to support our product managers and research analysts in understanding user behavior on our platform, providing key insights to improve our product experiences.
- With observability, we shift from being reactive to proactive. This approach allows us to detect and resolve issues before they affect users, identify opportunities to optimize performance, and gather insights to enhance overall reliability.
Observability is essential for supporting continuous improvement in our software lifecycle, enabling data-driven decisions and fostering accountability.
How does Observability work?
Observability is achieved by collecting and analyzing telemetry data, which are the outputs of our systems. Telemetry involves gathering and sending data from remote systems to a central location for monitoring and analysis. This data, also known as telemetry signals, helps us understand system performance, health, and user experience. By leveraging these insights, we can make informed decisions, quickly address issues, and continuously improve our platform's reliability and efficiency.
Types of Telemetry Signals
Telemetry signals can be categorized into:
- Metrics: Numeric measurements and recordings of system state, such as CPU usage, memory consumption, request rates, web vitals and error rates. Metrics provide a high-level view of system health and performance over time.
- Logs: Detailed point-in-time records of system events, capturing information about activities, errors, and transactions and user monitoring events. Logs are crucial for debugging and tracing issues.
- Traces: Represent a single transaction or journey through a distributed system, tracking the flow of requests across multiple services and components. Traces help identify bottlenecks and diagnose performance issues in distributed systems.
- Profiles: Insights into resource usage and performance characteristics of applications. Profiles help identify performance hotspots, optimize code, and ensure efficient resource use.
Gathering Telemetry Data
We gather telemetry data using tools and services that collect, process, and analyze metrics, logs, traces, and profiles. This can be done in two ways:
Auto Instrumentation: Uses tools and libraries to automatically collect telemetry data without manual code changes. Examples include agents or plugins that integrate with our applications and infrastructure.
Manual Instrumentation: Requires developers to add code to collect telemetry data, providing more control and customization. Developers can use APIs and SDKs from observability tools.
Combining both methods ensures comprehensive coverage and deep insights into system performance, health, and user experience.
CHAT SAMMIAT