xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. The Observability team builds and operates core infrastructure to monitor, debug, and optimize the performance and reliability of systems. This role focuses on building and maintaining the observability platform, owning metrics, logs, tracing, and alerting to empower engineering teams to operate services at scale and improve reliability.
Salary
USD 180,000 - 440,000
Requirements
Skills
Go
Rust
Scala
Distributed systems
Telemetry architecture
Infrastructure at scale
Prometheus
Grafana
OpenTelemetry
VictoriaMetrics
ClickHouse
Kafka
Redis
Large-scale time series databases
Kubernetes
Responsibilities
Design and implement scalable observability infrastructure for metrics, logging, and tracing.
Build high-performance telemetry pipelines that handle massive ingestion volumes.
Develop APIs, query engines, and UIs that allow engineers to get real-time insights into their services.
Define and enforce best practices for instrumentation, alerting, and reliability across the company.
Partner with infrastructure and product teams to deeply integrate observability into our internal platforms.
Own the reliability, scalability, and performance of the observability stack end-to-end.