03.2024 - 08.2024Candy.AIRemote

AI/ML Lead (Contract)
Single-handedly scaled AI capabilities from nascent to millions of users. Rebuilt the platform from a handful of Azure VM GPUs into a hyperscale, autoscaling GKE-on-GCP fleet. Engineered the foundations of data pipelines and crafted a broker-based inference layer for high-throughput generative workloads. Stood up intra-cluster distributed storage for model weights and adapters, accelerating cold-starts and cutting storage costs.
5 direct reportsAIInfrastructure
Key wins
Scale from thousands to millions
Built from scratch: pure IaaC
GKE GPU cost-efficient fleets
Technologies
GCPGKETerraformHelmArgoCDGitOpsK6PyTorch/TransformersStableDiffusionLLMPub/Sub
Responsibilities & achievements
Single-handedly revolutionized a startup's AI capabilities, scaling from nascent to millions of users.
Platform & Orchestration
- Fleet rebuild: from a handful of Azure VM GPUs to a fully orchestrated, hyperscale GKE on GCP environment with NVIDIA GPU autoscaling, supporting multi-GPU and diverse generative workloads
- CloudNative IaC discipline: Terraform, Helm, ArgoCD, and GitOps
- Just-in-time launch: delivered the GKE fleet ahead of media and TV exposure — fleet held the resulting ramp from hundreds of thousands to millions of users at 99.99% uptime
- Dev → prod gating: containerized AI workflows and inference for consistent deployment and controlled promotion across environments
Inference & Generative Workloads
- Data pipelines: engineered the foundations that fed training and inference workflows
- 'SuperBooga' inference broker: an event-driven bus fronting GPU inference — nodes pull from a pub/sub queue and serve one request at a time per GPU, avoiding the performance hit a single GPU takes from concurrent inference
- GPU & instance right-sizing: through performance and load-testing, matched adequate GPU and instance family to each workload while keeping a cost-efficient footprint
- Distributed weights & adapters: intra-cluster storage kept in sync with the upstream source and fanned out to inference containers on demand, accelerating cold-start times and cutting storage costs
ML Engineering & Practice
- End-to-end product delivery: translated aspirational & conceptual product requirements into shipped features — both the supporting software and custom-trained PyTorch/Transformers models
- Performance discipline: custom stress-testing for LLM and Stable Diffusion, anchored in a K6-driven load-testing and observability culture