01.2021 - 08.2022foodpandaSingapore

Principal DevOps Engineer

Ran APAC infrastructure for the Delivery Hero group on a macro-scale ArgoCD ecosystem — standing up new Local Business Units, countries and regions, fast and safe. Drove self-service GitOps that absorbed the bulk of Jira hot-requests, contributed upstream to argo-rollouts, ingress-nginx, and Atlantis, and held the fleet — thousands of distributed services — without a self-inflicted incident.

7 direct reportsAPAC SRE

Key wins

Macro-scale ArgoCD

Multiple mainstream contributions

Self-service IaaC state

APAC LBU enablement

Technologies

KubernetesArgoCDArgoRolloutsArgoWorkflowsTerraformPythonGoHelmKustomizeingress-nginxAtlantisAWSEKSDatadog

Responsibilities & achievements

First Principal Engineer of its kind at foodpanda — running APAC infrastructure not just for foodpanda but across the entire Delivery Hero group. Each cluster a self-contained Local Business Unit: a country or region running its own business, independently. Large-scale computing of containerised environments, driven by IaaC on high-availability and high-concurrency systems.

Platform & Orchestration

Kubernetes at large scale, concurrency, resilience — thousands of distributed services, thousands of nodes, tens of thousands of ingress resources under management
Cluster lifecycle on AWS/EKS: Terraform-imported the existing clusters into a blue/green blueprint, then spun new clusters inside the same VPC/networking — communicating internally and presenting as a single perimeter entity: a "metacluster"
Macro-scale ArgoCD ecosystem: a custom plugin, a shared chart, and a layered DRY footprint repo. Reproducible infrastructure built on meta-modular abstractions, so the group could stand up new Regions and Countries fast and safe
ArgoCD and ArgoRollouts (Design, Deployment, Customisation, Workshops)
Kubernetes Tailoring (MPA, Controllers, Advanced Scheduling, Affinity, etc.)
Self-service GitOps absorbed the bulk of the Jira Service Desk hot-request types — engineers spent their time reviewing PRs instead of crafting them

Reliability, Cost & Resilience

Observability: Datadog as the observability stack across the fleet
Cost engineering: introduced Spot and hybrid capacity, with on-demand fallback to tolerate instance-exhaustion events
GameDays: twice-yearly drills exercising failure scenarios — AZ-outage availability among them

People & Practice

Taking care of a small (7), talented APAC SRE squad — zero attrition and zero self-inflicted incidents during my tenure
Mentorship & upskilling: invested in team certifications (Terraform especially); the modules they shipped reached a level the team had not produced before, with continuous support so no engineer flew alone
Platform engineering at group scale: the macro-scale ArgoCD ecosystem and self-service GitOps enabled developer teams across many Local Business Units to ship to production fast and safely
Hackathons: drove the hackathon program with deliberate themes — disabilities and accessibility among them, surfacing features like ingredient-to-condition filtering (diabetes, allergies, G6PD, gluten) and visual-impairment support, several of which shipped to production
Tooling & Automation (Python, Terraform, Go, JS, HTML, CSS, Helm, Kustomize)
Infrastructure Security and Hardening (Topology & Perimetrical)

Mainstream Contributions

argo-rollouts — advanced canary across multiple rollout providers, so north-south and east-west traffic splits can be routed independently
kubernetes/ingress-nginx — incremental update capacity for fleets running tens of thousands of ingress resources, plus controller-runtime observability (timings, counts)
Atlantis — hardened the admin portal, and added proper SEMVER support so Terraform modules absorb minor tf-binary updates automatically across vast IaaC

Share is Care