Setting Up Full-stack Observability On OpenShift Using Elastic: Logs, Metrics, And APM

Observability Platform | Apr 20, 2026

7 min read

Setting up full-stack observability on OpenShift using Elastic: logs, metrics, and APM

In the complex environment of legacy and new generation microservices running on IBM Z / Linux and OpenShift platform, you want to provide full-stack observability while addressing stringent security requirements. How do you use Elasticsearch effectively?

The environment in this paper consists of two OpenShift clusters running alongside 250+ virtual machines, hosting a mixed application landscape of approximately 60 percent monolithic Java workloads and 40 percent microservices-based applications built on Nginx, Tomcat, Spring Boot, and Oracle 19.

The core challenge: achieving unified observability across logs, infrastructure metrics, and application performance in an air-gapped environment where no external connectivity is available and application container images cannot be modified.

The environment already had Prometheus in place for infrastructure-level metrics. However, Prometheus alone addresses only one dimension of observability. It does not provide application-level tracing or request-level visibility required for root cause analysis — it cannot surface what a specific application was doing when an incident occurred, why a request failed, or where latency was introduced in a service call chain. Without centralized log management and APM, the three observability layers that Elastic unifies — logs, metrics, and traces — were operating in silos, with two of the three absent.

This paper walks through the architecture and implementation decisions for each layer: log collection using OpenShift’s Cluster Log Forwarder, infrastructure and container metrics using Metricbeat, and application performance monitoring using OpenTelemetry — all deployed within a fully air-gapped environment with immutable application images.

The Constraints That Defined the Architecture

Before picking any tool, understanding what the environment rules out entirely is the necessary first step. Two constraints eliminate every standard approach:

Fully air-gapped. No internet access at any stage. No pulling images from any public registry at runtime. Every image, every binary, every dependency must be sourced externally, serialized using docker save, transferred through a secure intake channel via SCP, loaded with docker load, tagged, and pushed to the internal OpenShift registry before it can be referenced in any deployment. This changes how you plan every single component of the stack, because each addition costs 1 to 2 days of transfer time.

Immutable application images. No modification to running containers is permitted. The standard way to deploy an APM agent in Java is to bake it into the container image during the build, or to use a mutating admission webhook that patches the pod spec at scheduling time. Both are off the table — the first because images cannot be rebuilt, the second because it requires cluster-level admission controller privileges that are not available in this environment. The instrumentation approach must work entirely at the pod spec level without any image changes and without any cluster-wide webhook.

CONSTRAINT	WHAT IT RULED OUT
Fully air-gapped environment	Runtime image pulls, cloud-based agents, any tool with external connectivity dependency
Immutable application images	Agent baked into image at build time, cluster-wide OTel Operator with mutating admission webhook
Regulatory compliance	Any solution with external data transmission, unaudited change control, or data leaving the environment

Layer 1: Centralized Log Management with CLF

OpenShift’s Cluster Log Forwarder (CLF) uses a Vector-based collector running as a DaemonSet on every node. CLF is native to the OpenShift platform, which makes it the appropriate mechanism for log collection in this environment — it operates within OpenShift’s security model without requiring additional images or elevated privileges beyond what the Logging Operator already manages.

The key design decision is pipeline separation. Rather than routing everything into a single Elasticsearch index, configure two distinct CLF pipelines — one for application logs and one for audit logs — each with its own outputRef pointing to a dedicated Elasticsearch index. During initial testing, both log types were included in both pipeline inputRefs, which caused duplicate forwarding to both indices. This is an easy misconfiguration to miss in CLF setups — the corrected configuration is as follows:

pipelines:

- name: application-logs

inputRefs:

- application

outputRefs:

- elasticsearch-app

labels:

logs: "application-logs"- name: application-audit

inputRefs:

- audit

outputRefs:

- elasticsearch-audit

labels:

logs: "application-audit"

Application logs land in the app-write index and audit logs in the audit-write index. Keeping them separate is not just an organizational preference — audit logs from the OpenShift API server have a completely different schema from application logs. Mixing them into a single index creates mapping conflicts in Elasticsearch and makes querying significantly harder.

One architectural detail worth noting: Elasticsearch is not deployed inside the OpenShift cluster. It runs on a dedicated VM on the internal network, managed separately from the OpenShift environment. CLF connects to it over the internal network — fully within the air-gapped infrastructure, but outside the cluster boundary. This is a deliberate separation of concerns, keeping the data store independent of the container platform.

Since cert-manager is not available in an air-gapped environment, TLS certificates are generated internally. Store the CA certificate as a Kubernetes secret in the openshift-logging namespace:

oc create secret generic elastic-ca-secret \

--from-file=ca.crt=ca.crt \

-n openshift-logging

The CLF output configuration connects to the external Elasticsearch VM and references the secret for TLS verification:

outputs:

- name: elasticsearch-app

type: elasticsearch

url: https://:9200

secret:

name: elastic-es-secret

tls:

ca:

secretName: elastic-ca-secret

key: ca.crt- name: elasticsearch-audit

type: elasticsearch

url: https://:9200

secret:

name: elastic-es-secret

tls:

ca:

secretName: elastic-ca-secret

key: ca.crt

All log data stays within the internal network. No external transmission at any point. The platform processes 10 to 20 GB of logs per day across these pipelines, all queryable through Kibana dashboards scoped to cluster, node, pod, and application level.

Layer 2: Infrastructure Metrics with Metricbeat and kube-state-metrics

With Prometheus already providing infrastructure-level metrics, the gap is not metric collection itself — it is that Prometheus has no native path into Elasticsearch, where all observability data needs to land for a unified Kibana view. Metricbeat ships natively to Elasticsearch without any translation layer, which keeps the stack simpler. This matters in air-gapped environments where every additional component requires a controlled offline transfer cycle to introduce.

Metricbeat runs as a DaemonSet — one pod per node — using the Kubernetes module to collect node-level and pod-level metrics simultaneously from the Kubelet API on port 10250.

metricbeat.modules:

- module: kubernetes

metricsets:

- node

- pod

- container

- system

- volume

period: 10s

host: ${NODE_NAME}

hosts: ["https://${NODE_NAME}:10250"]

bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

ssl.verification_mode: "full"

ssl.certificate_authorities:

- /usr/share/metricbeat/certs/ca.crt

The CA certificate is mounted into the Metricbeat container from the elastic-ca secret via a volume mount defined in the DaemonSet spec:

volumes:

- name: elastic-ca

secret:

secretName: elastic-ca

volumeMounts:

- name: elastic-ca

mountPath: /usr/share/metricbeat/certs

readOnly: true

Running on OpenShift requires a privileged SCC for the Metricbeat DaemonSet, since host-level access is needed to collect node metrics. Scope the service account permissions precisely to what the monitoring stack requires — nothing broader. TLS verification is set to full using internally generated and trusted certificates, consistent with the security posture of the environment.

For Kubernetes object-state visibility — deployment health, ReplicaSet status, StatefulSet conditions, PVC states — kube-state-metrics is deployed as a separate service. Metricbeat consumes it through a second module configuration pointing at the kube-state-metrics service endpoint.

- module: kubernetes

metricsets:

- state_node

- state_pod

- state_container

- state_deployment

- state_replicaset

- state_statefulset

- state_persistentvolumeclaim

period: 30s

hosts:

- "kube-state-metrics.elastic-observability.svc.cluster.local:8080"

Both metric streams flow into Elasticsearch and surface in Kibana. The result is a correlated view of infrastructure health — CPU, memory, disk, pod lifecycle, and Kubernetes object state — alongside the log data from Layer 1, in the same interface.

Layer 3: Application Performance Monitoring Without Image Modification

Instrumenting Java applications for APM without modifying container images requires a non-intrusive injection mechanism. The application estate in this environment consists of approximately 60 percent monolithic Java applications running on Tomcat and 40 percent Spring Boot microservices — none of which can be rebuilt or modified.

The standard OpenShift-native approach for OTel Java agent injection is the OpenTelemetry Operator with a cluster-wide mutating admission webhook that automatically patches pod specs at scheduling time. That approach does not fit this environment — it requires cluster-admin level permissions to register the webhook, which are not available here.

The approach is namespace-scoped OTel auto-instrumentation, which achieves the same result without a cluster-wide webhook. Label the target namespace to opt into instrumentation:

oc label namespace opentelemetry-injection=enabled

Add a single annotation to the application pod:

metadata:

annotations:

instrumentation.opentelemetry.io/inject-java: "true"

The Operator handles the rest automatically — it injects an init container that copies the OTel Java agent binary into a shared emptyDir volume, mounts it into the application container, and sets JAVA_TOOL_OPTIONS to load the agent at JVM startup. The application image is never modified. The only configuration required at the application level is the service name:

env:

- name: OTEL_SERVICE_NAME

value:< your-service-name >

Traces are exported to an OTel Collector running inside the cluster as a Deployment with a ClusterIP service — no external endpoint, consistent with the air-gapped constraint. The Collector receives traces over gRPC on port 4317 and forwards to the Elastic APM Server over HTTP (running 8.19.x, which supports OTLP ingestion natively on port 8200):

receivers:

otlp:

protocols:

grpc:

http:exporters:

otlphttp:

endpoint: http://apm-server:8200

WHY NAMESPACE-SCOPED OVER CLUSTER-WIDE WEBHOOK

The standard OTel Operator installation registers a mutating admission webhook at the cluster level, intercepting every pod creation across all namespaces. In a restricted environment, this is both a permission problem and a blast radius problem. Scoping instrumentation to specific namespaces via labels and annotations gives identical results — init container injection, automatic JAVA_TOOL_OPTIONS configuration, no image changes — with no cluster-level footprint and no risk of unintended instrumentation across other namespaces.

The design is validated on the first application. Each additional application requires one namespace label and one pod annotation — no new architectural decisions per application.

Result

With all three layers operational, the environment moves from fragmented, reactive monitoring to a unified observability platform. Log data from across both clusters — approximately 10 to 20 GB per day — is centralized and searchable in Elasticsearch, with application and audit streams indexed separately for schema consistency and query performance. Infrastructure and container metrics from 250+ virtual machines and both OpenShift clusters flow into Kibana through Metricbeat, correlated with log data in the same interface. APM instrumentation is active across the Java application estate with zero image modifications, providing request-level tracing and latency visibility that was entirely absent before.

The architecture described in this paper applies to any OpenShift environment operating under similar constraints — air-gapped, restricted, or compliance-bound. The specific combination of CLF for logs, Metricbeat for metrics, and namespace-scoped OTel for APM is the appropriate Elastic Stack pattern for environments where standard agent-based approaches are not permitted.

Key Implementation Considerations

The following patterns apply to any OpenShift environment operating under similar constraints:

Use OpenShift CLF instead of adding a separate log shipper. It is native to the platform and eliminates one more image to manage in an air-gapped lifecycle.
Separate log pipelines by type at the CLF level, not at the Elasticsearch level. Schema differences between audit and application logs will cause mapping conflicts if mixed into a single index.
If metrics need to flow into Elasticsearch, Metricbeat’s native output is simpler than adding a Prometheus remote write pipeline as an intermediary layer.
For APM where images cannot be modified and cluster-wide admission webhooks are not available, namespace-scoped OTel instrumentation via annotations is the right approach. The cluster-wide mutating webhook is what requires cluster-admin — the instrumentation itself does not.
Plan the offline image lifecycle before starting the deployment, not during it. Every image addition costs days. Knowing this upfront changes how the work is sequenced.

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Migrating to NGINX Plus Ingress Controller: A Production-Grade Migration Plan

Revolutionize Your CX with
Unified Observability

CloudOps Automation tool for Infrastructure monitoring and deployment.

From Chaos to Control – Transforming Log Management for a Leading Payment Solution Company

Revolutionize Your CX with Unified Observability

Automate and monitor your PostgreSQL with ease.

The CloudOps Automation Tool for easy Infrastructure deployment and monitoring

Maximize Potential of Your Data with Streaming Data Pipeline Architecture

AI Is Not Failing Because of Models. It’s Failing Because of Architecture.

Watch: Building an MCP Server for PostgreSQL: Making Databases Talk to AI

Setting up full-stack observability on OpenShift using Elastic: logs, metrics, and APM

The Constraints That Defined the Architecture

Layer 1: Centralized Log Management with CLF

Layer 2: Infrastructure Metrics with Metricbeat and kube-state-metrics

Layer 3: Application Performance Monitoring Without Image Modification

Result

Key Implementation Considerations

Read More

Step-by-Step Guide: How to Configure Elastic Machine Learning for Anomaly Det...

Cross Cluster Replication in Elasticsearch

Monitoring Containers using ELK Stack