Member of Technical Staff - Inference na xAI

Presencial - Palo Alto, CA

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. The Member of Technical Staff - Inference will design and optimize large-scale model serving systems for Grok, building a high-performance inference platform to serve millions of users daily. The role involves architecting scalable distributed infrastructure, optimizing latency and throughput, building high-concurrency serving systems, benchmarking, fine-tuning inference engines, developing custom trace tools, creating CI/CD infrastructure, and accelerating research on scaling compute and model-hardware co‑design.

Salary

USD 180,000 - 440,000

Requirements

Skills

Deep low-level systems programming (C/C++ or Rust)
Experience with large-scale, high-concurrent production serving
Experience with GPU inference engines (vLLM, SGLang, Triton, TensorRT-LLM, etc.)
Strong background in system optimizations: batching, caching, load balancing, parallelism
Low-level inference optimizations: GPU kernels, code generation
Algorithmic inference optimizations: quantization, speculative decoding, distillation, low-precision numerics
Experience with testing, benchmarking, and reliability of inference services
Experience designing and implementing CI/CD infrastructure for inference

Responsibilities

Architect and implement scalable distributed infrastructure for model serving (load balancing, auto-scaling, batch scheduling, global KV cache)
Optimize latency and throughput of model inference under real production workloads
Build reliable, high-concurrency serving systems that serve billions of users with 100% uptime, 0% error rate, and excellent tail latency
Benchmark, fine-tune, and accelerate inference engines (including low-level GPU kernel work and code generation)
Develop custom tools to trace, replay, and fix issues across the full stack — from orchestration down to GPU kernels
Create robust CI/CD infrastructure for seamless endpoint deployment, image publishing, and inference engine updates
Accelerate research on scaling test-time compute, RL rollout, and model-hardware co-design for next-generation systems

Technologies

CC++RustvLLMSGLangTritonTensorRT-LLMGPU kernelsCode generationLoad balancingAuto-scalingBatch schedulingGlobal KV cacheCI/CD infrastructure

Descubra se seu currículo está pronto para esta vaga

Veja como nossa IA pode otimizar seu currículo e aumentar suas chances de conseguir esta posição.