Experience
03.2024 - 08.2024Candy.AIRemote
Candy.AI logo

AI/ML Lead (Contract)

Single-handedly scaled AI capabilities from nascent to millions of users. Rebuilt the platform from a handful of Azure VM GPUs into a hyperscale, autoscaling GKE-on-GCP fleet. Engineered the foundations of data pipelines and crafted a broker-based inference layer for high-throughput generative workloads. Stood up intra-cluster distributed storage for model weights and adapters, accelerating cold-starts and cutting storage costs.

5 direct reportsAIInfrastructure
Key wins
Scale from thousands to millions
Built from scratch: pure IaaC
GKE GPU cost-efficient fleets
Technologies
GCPGKETerraformHelmArgoCDGitOpsK6PyTorch/TransformersStableDiffusionLLMPub/Sub
Responsibilities & achievements

Single-handedly revolutionized a startup's AI capabilities, scaling from nascent to millions of users.

Platform & Orchestration

  • Fleet rebuild: from a handful of Azure VM GPUs to a fully orchestrated, hyperscale GKE on GCP environment with NVIDIA GPU autoscaling, supporting multi-GPU and diverse generative workloads
  • CloudNative IaC discipline: Terraform, Helm, ArgoCD, and GitOps
  • Just-in-time launch: delivered the GKE fleet ahead of media and TV exposure — fleet held the resulting ramp from hundreds of thousands to millions of users at 99.99% uptime
  • Dev → prod gating: containerized AI workflows and inference for consistent deployment and controlled promotion across environments

Inference & Generative Workloads

  • Data pipelines: engineered the foundations that fed training and inference workflows
  • 'SuperBooga' inference broker: an event-driven bus fronting GPU inference — nodes pull from a pub/sub queue and serve one request at a time per GPU, avoiding the performance hit a single GPU takes from concurrent inference
  • GPU & instance right-sizing: through performance and load-testing, matched adequate GPU and instance family to each workload while keeping a cost-efficient footprint
  • Distributed weights & adapters: intra-cluster storage kept in sync with the upstream source and fanned out to inference containers on demand, accelerating cold-start times and cutting storage costs

ML Engineering & Practice

  • End-to-end product delivery: translated aspirational & conceptual product requirements into shipped features — both the supporting software and custom-trained PyTorch/Transformers models
  • Performance discipline: custom stress-testing for LLM and Stable Diffusion, anchored in a K6-driven load-testing and observability culture