In the complex environment of legacy and new generation microservices running on IBM Z / Linux and OpenShift platform, you want to provide full-stack observability while addressing stringent security requirements. How do you use Elasticsearch effectively?
The environment in this paper consists of two OpenShift clusters running alongside 250+ virtual machines, hosting a mixed application landscape of approximately 60 percent monolithic Java workloads and 40 percent microservices-based applications built on Nginx, Tomcat, Spring Boot, and Oracle 19.
The core challenge: achieving unified observability across logs, infrastructure metrics, and application performance in an air-gapped environment where no external connectivity is available and application container images cannot be modified.
The environment already had Prometheus in place for infrastructure-level metrics. However, Prometheus alone addresses only one dimension of observability. It does not provide application-level tracing or request-level visibility required for root cause analysis — it cannot surface what a specific application was doing when an incident occurred, why a request failed, or where latency was introduced in a service call chain. Without centralized log management and APM, the three observability layers that Elastic unifies — logs, metrics, and traces — were operating in silos, with two of the three absent.
This paper walks through the architecture and implementation decisions for each layer: log collection using OpenShift’s Cluster Log Forwarder, infrastructure and container metrics using Metricbeat, and application performance monitoring using OpenTelemetry — all deployed within a fully air-gapped environment with immutable application images.
The Constraints That Defined the Architecture
Before picking any tool, understanding what the environment rules out entirely is the necessary first step. Two constraints eliminate every standard approach:
Fully air-gapped. No internet access at any stage. No pulling images from any public registry at runtime. Every image, every binary, every dependency must be sourced externally, serialized using docker save, transferred through a secure intake channel via SCP, loaded with docker load, tagged, and pushed to the internal OpenShift registry before it can be referenced in any deployment. This changes how you plan every single component of the stack, because each addition costs 1 to 2 days of transfer time.
Immutable application images. No modification to running containers is permitted. The standard way to deploy an APM agent in Java is to bake it into the container image during the build, or to use a mutating admission webhook that patches the pod spec at scheduling time. Both are off the table — the first because images cannot be rebuilt, the second because it requires cluster-level admission controller privileges that are not available in this environment. The instrumentation approach must work entirely at the pod spec level without any image changes and without any cluster-wide webhook.
Layer 1: Centralized Log Management with CLF
OpenShift’s Cluster Log Forwarder (CLF) uses a Vector-based collector running as a DaemonSet on every node. CLF is native to the OpenShift platform, which makes it the appropriate mechanism for log collection in this environment — it operates within OpenShift’s security model without requiring additional images or elevated privileges beyond what the Logging Operator already manages.
The key design decision is pipeline separation. Rather than routing everything into a single Elasticsearch index, configure two distinct CLF pipelines — one for application logs and one for audit logs — each with its own outputRef pointing to a dedicated Elasticsearch index. During initial testing, both log types were included in both pipeline inputRefs, which caused duplicate forwarding to both indices. This is an easy misconfiguration to miss in CLF setups — the corrected configuration is as follows:
pipelines:
- name: application-logs
inputRefs:
- application
outputRefs:
- elasticsearch-app
labels:
logs: "application-logs"- name: application-audit
inputRefs:
- audit
outputRefs:
- elasticsearch-audit
labels:
logs: "application-audit"Application logs land in the app-write index and audit logs in the audit-write index. Keeping them separate is not just an organizational preference — audit logs from the OpenShift API server have a completely different schema from application logs. Mixing them into a single index creates mapping conflicts in Elasticsearch and makes querying significantly harder.
One architectural detail worth noting: Elasticsearch is not deployed inside the OpenShift cluster. It runs on a dedicated VM on the internal network, managed separately from the OpenShift environment. CLF connects to it over the internal network — fully within the air-gapped infrastructure, but outside the cluster boundary. This is a deliberate separation of concerns, keeping the data store independent of the container platform.
Since cert-manager is not available in an air-gapped environment, TLS certificates are generated internally. Store the CA certificate as a Kubernetes secret in the openshift-logging namespace:
oc create secret generic elastic-ca-secret \
--from-file=ca.crt=ca.crt \
-n openshift-logging
The CLF output configuration connects to the external Elasticsearch VM and references the secret for TLS verification:
outputs:
- name: elasticsearch-app
type: elasticsearch
url: https://:9200
secret:
name: elastic-es-secret
tls:
ca:
secretName: elastic-ca-secret
key: ca.crt- name: elasticsearch-audit
type: elasticsearch
url: https://:9200
secret:
name: elastic-es-secret
tls:
ca:
secretName: elastic-ca-secret
key: ca.crtAll log data stays within the internal network. No external transmission at any point. The platform processes 10 to 20 GB of logs per day across these pipelines, all queryable through Kibana dashboards scoped to cluster, node, pod, and application level.
Layer 2: Infrastructure Metrics with Metricbeat and kube-state-metrics
With Prometheus already providing infrastructure-level metrics, the gap is not metric collection itself — it is that Prometheus has no native path into Elasticsearch, where all observability data needs to land for a unified Kibana view. Metricbeat ships natively to Elasticsearch without any translation layer, which keeps the stack simpler. This matters in air-gapped environments where every additional component requires a controlled offline transfer cycle to introduce.
Metricbeat runs as a DaemonSet — one pod per node — using the Kubernetes module to collect node-level and pod-level metrics simultaneously from the Kubelet API on port 10250.
metricbeat.modules:
- module: kubernetes
metricsets:
- node
- pod
- container
- system
- volume
period: 10s
host: ${NODE_NAME}
hosts: ["https://${NODE_NAME}:10250"]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
ssl.verification_mode: "full"
ssl.certificate_authorities:
- /usr/share/metricbeat/certs/ca.crt
The CA certificate is mounted into the Metricbeat container from the elastic-ca secret via a volume mount defined in the DaemonSet spec:
volumes:
- name: elastic-ca
secret:
secretName: elastic-ca
volumeMounts:
- name: elastic-ca
mountPath: /usr/share/metricbeat/certs
readOnly: true Running on OpenShift requires a privileged SCC for the Metricbeat DaemonSet, since host-level access is needed to collect node metrics. Scope the service account permissions precisely to what the monitoring stack requires — nothing broader. TLS verification is set to full using internally generated and trusted certificates, consistent with the security posture of the environment.
For Kubernetes object-state visibility — deployment health, ReplicaSet status, StatefulSet conditions, PVC states — kube-state-metrics is deployed as a separate service. Metricbeat consumes it through a second module configuration pointing at the kube-state-metrics service endpoint.
- module: kubernetes
metricsets:
- state_node
- state_pod
- state_container
- state_deployment
- state_replicaset
- state_statefulset
- state_persistentvolumeclaim
period: 30s
hosts:
- "kube-state-metrics.elastic-observability.svc.cluster.local:8080"
Both metric streams flow into Elasticsearch and surface in Kibana. The result is a correlated view of infrastructure health — CPU, memory, disk, pod lifecycle, and Kubernetes object state — alongside the log data from Layer 1, in the same interface.
Layer 3: Application Performance Monitoring Without Image Modification
Instrumenting Java applications for APM without modifying container images requires a non-intrusive injection mechanism. The application estate in this environment consists of approximately 60 percent monolithic Java applications running on Tomcat and 40 percent Spring Boot microservices — none of which can be rebuilt or modified.
The standard OpenShift-native approach for OTel Java agent injection is the OpenTelemetry Operator with a cluster-wide mutating admission webhook that automatically patches pod specs at scheduling time. That approach does not fit this environment — it requires cluster-admin level permissions to register the webhook, which are not available here.
The approach is namespace-scoped OTel auto-instrumentation, which achieves the same result without a cluster-wide webhook. Label the target namespace to opt into instrumentation:
oc label namespace opentelemetry-injection=enabledAdd a single annotation to the application pod:
metadata:
annotations:
instrumentation.opentelemetry.io/inject-java: "true"
The Operator handles the rest automatically — it injects an init container that copies the OTel Java agent binary into a shared emptyDir volume, mounts it into the application container, and sets JAVA_TOOL_OPTIONS to load the agent at JVM startup. The application image is never modified. The only configuration required at the application level is the service name:
env:
- name: OTEL_SERVICE_NAME
value:< your-service-name >
Traces are exported to an OTel Collector running inside the cluster as a Deployment with a ClusterIP service — no external endpoint, consistent with the air-gapped constraint. The Collector receives traces over gRPC on port 4317 and forwards to the Elastic APM Server over HTTP (running 8.19.x, which supports OTLP ingestion natively on port 8200):
receivers:
otlp:
protocols:
grpc:
http:exporters:
otlphttp:
endpoint: http://apm-server:8200The design is validated on the first application. Each additional application requires one namespace label and one pod annotation — no new architectural decisions per application.
Result
With all three layers operational, the environment moves from fragmented, reactive monitoring to a unified observability platform. Log data from across both clusters — approximately 10 to 20 GB per day — is centralized and searchable in Elasticsearch, with application and audit streams indexed separately for schema consistency and query performance. Infrastructure and container metrics from 250+ virtual machines and both OpenShift clusters flow into Kibana through Metricbeat, correlated with log data in the same interface. APM instrumentation is active across the Java application estate with zero image modifications, providing request-level tracing and latency visibility that was entirely absent before.
The architecture described in this paper applies to any OpenShift environment operating under similar constraints — air-gapped, restricted, or compliance-bound. The specific combination of CLF for logs, Metricbeat for metrics, and namespace-scoped OTel for APM is the appropriate Elastic Stack pattern for environments where standard agent-based approaches are not permitted.
Key Implementation Considerations
The following patterns apply to any OpenShift environment operating under similar constraints:
- Use OpenShift CLF instead of adding a separate log shipper. It is native to the platform and eliminates one more image to manage in an air-gapped lifecycle.
- Separate log pipelines by type at the CLF level, not at the Elasticsearch level. Schema differences between audit and application logs will cause mapping conflicts if mixed into a single index.
- If metrics need to flow into Elasticsearch, Metricbeat’s native output is simpler than adding a Prometheus remote write pipeline as an intermediary layer.
- For APM where images cannot be modified and cluster-wide admission webhooks are not available, namespace-scoped OTel instrumentation via annotations is the right approach. The cluster-wide mutating webhook is what requires cluster-admin — the instrumentation itself does not.
- Plan the offline image lifecycle before starting the deployment, not during it. Every image addition costs days. Knowing this upfront changes how the work is sequenced.