Site Reliability Engineer
Current
Mountain View, California, United States
- Spearheaded resource optimization efforts, performing stress tests to determine hardware usage limits and personally reclaiming over $1.9 million/yr worth of CPU cores and 669 GPUs.
- Piloted the first migration to Monitoring-as-Code within my department, reducing the time to batch deploy and update alarms by 97% and unifying alert rules in a Single Source of Truth repository.
- Coordinated Disaster Recovery (DR) efforts for my team, overseeing 6 DR drills and 2 stress tests and updating my team's DR plan to align with the company-wide DR Framework.