Managed Kubernetes Observability

The Managed Kubernetes Observability service provides a fully integrated monitoring, logging, and alerting solution for Kubernetes clusters and workloads. It ensures centralized collection, storage, and analysis of metrics and logs while enabling proactive alerting, delivering complete observability across infrastructure and applications. This comprehensive solution combines:

Unified metrics & logs collection to aggregate infrastructure metrics, application performance data, and container logs in a single platform
Capacity management by real-time resource utilization tracking and visualization
Ruleset for proactive and reactive alerting to allow preemptive issue resolution and quick response to incidents
Multi-channel notification and alerting system for flexible alert routing with support for major communication platforms
Dashboard catalog for comprehensive visualization of infrastructure and application health and performance

Reason why

The Managed Kubernetes Observability service consists of three main parts. Together, they provide a comprehensive solution for monitoring, logging, and alerting. All components are fully managed with regular updates, scaling, and security patching.

Data Collection

Data collection is implemented using the industry standard tools Prometheus and Loki.

The service collects metrics at every layer - from infrastructure resources and Kubernetes components to application microservices. This includes:

Cluster health metrics (API server performance, etcd latency, node conditions)
Workload resource consumption (CPU/Memory utilization at pod/container level)
Network performance metrics (latency, packet loss, DNS resolution)
Storage I/O patterns and volume capacity trends

Log management features include:

Centralized log aggregation with automatic tagging
Preserved log context through Kubernetes metadata
Long-term archival with configurable retention policies

Data Visualization

The solution offers both a catalog of pre-configured dashboards and flexible customization options. The dashboard catalog includes:

Cluster Monitoring - Cluster level health, performance and capacity metrics
Namespace Monitoring - Resource utilization and performance metrics per namespace
Set of dashboards for specific exporters, such as blackbox exporter for web-endpoint and cert monitoring
Various application specific dashboards, e.g. for Velero, Loki, Longhorn, Falco, etc.

Alerting & Notification System

Alerting & Notification capabilities include:

Pre-configured alerts for critical cluster events
Customizable alert rules using PromQL expressions
Scheduled maintenance windows for alert suppression

Supported notification channels:

Email (SMTP/API-based)
Messaging platforms (e.g. Microsoft Teams)
Webhook endpoints
Google Chat (cAlert)
Telegram

Included Components and Configurations

Monitoring

The Kube-Prometheus Stack, an integral part of our observability framework, includes Prometheus, Alertmanager, and a set of exporters and prometheus rules for monitoring Kubernetes clusters.

Component	Description
Prometheus Operator	Prometheus is an open-source monitoring tool widely used in cloud-native environments, especially with Kubernetes. It stores metrics as time series data with timestamps and labels, supporting flexible and powerful queries through its query language, PromQL. It operates on a pull model for data collection but can also support push models via an intermediary gateway.
Alertmanager	Prometheus Alertmanager is closely integrated with Prometheus Operator. It takes care of deduplicating, grouping, and routing alerts to the correct receiver. It also manages silencing and inhibition of alerts. The Alertmanager configuration allows defining complex notification workflows and routing based on labels.

Extended Metrics Retention

By default, Prometheus is configured to store the metrics for a maximum of 30 days. If required, Grafana Mimir can be utilized to extended metrics retention beyond the standard 30 days. It offers robust scalability and long-term storage capabilities, ensuring efficient handling of large data volumes. Please let us know your interest in extended metrics retention by participating in this survey.

Log Management

Component	Description
Grafana Loki	Grafana Loki is a scalable, efficient log aggregation system designed to collect, store, and query logs from applications and infrastructure. Unlike traditional log management solutions, Loki structures logs in a cost-effective and index-free manner, making it highly efficient for large-scale deployments.

Visualization

To visualize the collected data, Grafana is provided. It is a central component of our managed observability service and installed via the Grafana Operator.

Component	Description
Grafana Operator	Grafana Operator is a Kubernetes operator for Grafana. It allows to deploy Grafana, dashboards, datasources and more as Kubernetes resources.
Grafana	Grafana is a powerful, open-source platform for data visualization and more. Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. It provides tools to turn your time series database (TSDB) and log data into beautiful graphs and visualizations. All on a central platform.

Extensions

In addition to the robust Kube-Prometheus Stack as well as Log Management with Grafana Loki, we offer a suite of enhancements that are exclusively managed and continuously developed by ONZACK. These enhancements are designed to complement and extend the core functionalities of Kube-Prometheus Stack, Grafana Loki and Grafana, providing advanced features that cater specifically to the unique needs of our customers. Our managed service ensures that these additions are not only up-to-date but also seamlessly integrated, allowing for a more comprehensive and efficient observability solution.

Component	Description
Custom Dashboards	We provide a continuously evolving catalog of dashboards, with new additions regularly. Additionally, users can instantly create and manage their own dashboards using Grafana, tailored to their specific monitoring needs for personalized visualization of metrics, logs, and traces.
Capacity Management	Visualizations and notifications allowing capacity management for economic and performance optimization.
Proactive and reactive alerts	Standard ruleset for proactive and reactive alerting to allow preemptive issue resolution and quick response to incidents.

Additional Features

While we continuously develop and enhance our features, we highly value the feedback and insights from our users regarding future developments. To ensure that our service evolves in a direction that benefits all, we invite you to participate in this survey. Your inputs will help shape the future of our managed observability services.

Maintenance & Support

Maintenance work is performed as outlined in the Service Levels - Maintenance Work section.

Version and Feature Support

We maintain support exclusively for the most current versions of the tools. Support for any version ceases concurrently with the end of support from the original developers. We support only those features that are designated as "General availability" according to the developers release and life cycle documentation.

Upgrade Policy

ONZACK discontinues support for major versions two months after the release of the latest major.minor version. Upgrades to new major versions must be executed within two months following. If a customer requires a longer transition period, the service will transition to an "unmanaged" status, although it will continue to operate. Once a service transitions to "unmanaged" status, it is no longer actively maintained by ONZACK. Consequently, any previously applicable Service Level Agreements (SLAs) will no longer be valid.

Service Levels

This service is currently available with service level Best Effort and Business Hours. For details about the service levels, please refer to the Service Levels section.

Pricing

Prices don't include initial setup, infrastructure costs, storage costs and VAT. For infrastructure costs, please refer to the cloud provider pricing.

Initial setup is charged hourly, please refer to engineering services. Required efforts depend on the complexity of the setup, but should be expected to be between 2 and 6 hours. We are happy to provide an estimate as part of the quote. Please contact sales@onzack.com.

	Best Effort	Business Hours
Managed Kubernetes Observability	CHF 180.00	CHF 340.00

Business Hours Service Level Requirements

The Business Hours service level for Managed Kubernetes Observability requires the same service level for the underlying platform (e.g. Rancher Kubernetes Cluster).

Where is this service available?

The Managed Kubernetes Observability service is designed to operate seamlessly on our Managed Rancher Kubernetes service. For more details, please see Managed Rancher Kubernetes section.

Upon request, we are happy to discuss additional deployment possibilities to meet your specific requirements. Please contact us at sales@onzack.com.