Site Reliability Engineer
Current- Operational Support & Monitoring
- Serve as the primary operational support for large-scale, distributed applications deployed on AWS and GCP, ensuring system reliability and optimal performance.
- Monitor system availability and evaluate health metrics, implementing proactive alerts to address potential issues before escalation.
- Troubleshoot and resolve production issues across services and hosting stack layers.
- Configure and manage Grafana Cloud with Prometheus for metrics monitoring and Loki for log aggregation, enhancing system observability. Infrastructure
- Design, build, and maintain scalable infrastructure to support thousands of concurrent users across AWS and GCP environments.