Lead Site Reliability Engineer
Current- Hands on experience in planning and deploying cluster upgrades and maintenance end to end ensuring smooth transition via cluster availability and data integrity.
- Full-stack troubleshooting skills across network, application and distributed services layers.
- Support services from Dev till Production via infrastructure design, software platform development, stress testing, capacity planning, performance tuning and overall system health etc. Maintain services on clusters and.
- Hands on experience in system administration skills, including automation and orchestration of Linux/Windows using Chef, Puppet, Ansible and containers (Docker, Kubernetes, etc.)
- Working extensively on IAC technologies like terraform and used it to automate all infrastructure needs in AWS and Azure.
- Configured application monitoring and alerts for on-prep and cloud applications using Prometheus with Grafana and Dynatrace.