Experience
03.2024 - 08.2024Candy.AIRemote
Candy.AI logo

AI/ML Lead (Contract)

Single-handedly scaled AI capabilities from nascent to millions of users. Architected hyperscale Kubernetes infrastructure on GCP. Designed state-of-the-art Inference Platform with 400% efficiency improvement. Pioneered 'SuperBooga' broker-based solution for high-throughput generative inference.

5 direct reportsAIInfrastructure
Key wins
Scale from thousands to millions
Built from scratch: pure IaaC
GKE GPU cost-efficient fleets
Technologies
GCPGKETerraformHelmGitOpsK6PyTorchStableDiffusionLLM
Responsibilities & achievements
  • 01Single-handedly revolutionized a startup's AI capabilities, scaling from nascent to millions of users:
  • 02Architected and implemented a robust, hyperscale Kubernetes infrastructure on GCP, evolving from basic toil instances to a fully orchestrated, cloud-native environment supporting multi-GPU and diverse generative workloads
  • 03Leveraged Terraform, Helm, and GitOps to create a comprehensive Infrastructure as Code (IaC) solution
  • 04Achieved seamless scaling from thousands to millions of users while maintaining 99.999% uptime
  • 05Designed and implemented a state-of-the-art Inference Platform, dramatically reducing operational costs and enhancing model serving efficiency up to 400%
  • 06Pioneered "SuperBooga," a low-frequency, low-latency, high-throughput broker-based solution, enabling high-pressure generative inference without system collapse for both LLM and SD
  • 07Containerized AI workflows/inference, ensuring consistent deployment across dev and prod environments
  • 08Spearheaded a culture of rigorous Load Testing (using K6) and Observability
  • 09Developed custom-tailored stress-testing solutions for LLM & StableDiffusion
  • 10Transformed "star-trek-fantasy" JIRA tickets into tangible pytorch.bin models
  • 11Reduced 200%+ Costs while performing at 99.999% SLA