Tony Holmes

Tony Holmes Email and Phone Number

Sre Leader @ Apple
Mountain View, CA, US
Tony Holmes's Location
Mountain View, California, United States, United States
Tony Holmes's Contact Details

Tony Holmes personal email

n/a
About Tony Holmes

Seasoned veteran with extensive Engineering Leadership Experience across multiple industries. Highly effective customer- and product-centric leader specializing in building high performance teams, connecting every level of an offering to the customer experience via cutting edge observability and insights.

Tony Holmes's Current Company Details
Apple

Apple

View
Sre Leader
Mountain View, CA, US
Website:
apple.com
Employees:
163018
Tony Holmes Work Experience Details
  • Apple
    Sre Leader
    Apple
    Mountain View, Ca, Us
  • Affirm
    Head Of Sre
    Affirm Apr 2024 - Present
    San Francisco, California, Us
    Building and Evolving the SRE function at Affirm
  • Apple
    Sre Leader
    Apple Feb 2022 - Mar 2024
    Cupertino, California, Us
    The ACS Platform SRE team is responsible for building and operating the platform on which the customer-loved Apple Services are built. I lead the SRE teams responsible for FaceTime, iMessage, iCloud Edge, Push Notification, and inter-service communications.● Created a culture of strong psychological safety enabling engineers to be honest and open● Empowered SMEs to drive decisions and inform leadership instead of "asking permission"● Applied toil reduction measures to reduce on-call and interrupts, increasing project throughput● Eliminated on-call overload through team alignment with, and only funding most critical services● Decisively eliminated toxic behaviors, and directly resolved pre-existing inter-team and interpersonal conflicts● Effectively used the Code Red framework to provide stability and focus for critical service improvement● Championed strategic bets where expected impact exceeded normally accepted (short) timeframes● Technologies: Java, Go, Python, k8s, L4/L7 LBs (nginx, custom), DNS, service discovery, BGP, ECMP, DDoS, (m)TLSOutcomes and Results:● Toil/Operational work reduced from 80% (Apr 2022) to 63% (Feb 2023)● Promotion of manager from M1 to M2, 5 engineers to Senior● Defined and rolled out Service Criticality to all of iCloud● Improved overall iCloud service intercommunication stability
  • Wish
    Head Of Site Reliability Engineering (Sre)
    Wish Mar 2021 - Jan 2022
    San Francisco, Ca, Us
    The SRE team was responsible for defining and implementing SRE principles across Wish.● Transformed the SRE organization at Wish from an operational to customer service centric model● Created and codified critical business-wide processes including platform/service criticality, incident management (IM), unbiased performance/promo templates and process, SLI/SLO creation, precise service performance measurement, and improved on-call● Implemented toil reduction strategies including SLO definition, aligned alerts against SLOs, and enabled self-service● Instilled a culture of constant incremental improvement, and earlier SRE engagement by implementing the PRR (production readiness review) process directly into the SDLC workflow, to streamline and accelerate launches● Established relationships with Engineering, Product, Infrastructure, and Security orgs to champion Reliability focus, align SRE roadmap with key initiatives, and steer skill priorities for SRE hiring● Mentored Product and Engineering leads on Reliability and Software best practices and org structure● Enabled engineering teams to improve services by surfacing performance and cost efficiency data● Technologies: k8s, L4/L7 LBs (nginx, ELB), AWS, DNS, HashiCorp Vault, BGP, ECMP, Python, Go, PromotheusOutcomes and Results:● Delivered clear prioritization framework and rubric for consistent company-wide use● Doubled SRE team size by attracting and Staff+ SREs to own and lead domains● Shortened new product process production deployment from 16+ to 6 weeks● Successful adoption of Code Yellow and large scale IM process across organizations● Reduced SLI measurement error by 94% and reduced end-user latency by 40%● Increased utilization of cloud resources from <25% to 55% on average saving $4.5M monthly
  • Youtube
    Site Reliability Engineering Manager
    Youtube Dec 2019 - Feb 2021
    San Bruno, Ca, Us
    The YouTube Search/ML SRE team owned the Search, and Machine Language infrastructure. When hired, the team owned 5 total service domains which imposed too much cognitive load. My first act was to identify domains of alignment, and reorganize into three teams.● Resolved gaps in the feature cost tooling (predicative cost/capacity impact) by identifying the source of the gap, finding a complementary solution, PoC'ing, and driving adoption through all of YouTube● Developed programs to improve End User qualitative experience, and create contextual mitigation strategies● Leveraged previous remote-team experience to seamlessly transition into Work From Home context during COVID onset, successfully onboard new hires, and maintain strong team communications and delivery velocity● Championed the End-User journey as the primary health metric (versus system health)Outcomes and Results:● Drove a 95% reduction in End User session level errors of all severities● Served 35% COVID-19 increase in traffic while decreasing deployed footprint by 25%● Identified and fixed low levels issue saving 10% in service dedicated server costs● Reduced Search p99 tail latency from 120s+ to 20ms
  • Riot Games
    Sr Engineering Manager
    Riot Games Aug 2018 - Jul 2019
    Los Angeles, Ca, Us
    Lead the Core Identity Services team responsible for Identity, DNS, and Authentication platform services.● Developed and mentored Managers, Leads, and ICs in career and craft development● Co-driver of new org structure to increase engineering to manager ratio from 1:1 to 3:1● Engaged with customers and stakeholders outside of the org to build strong alliances, identify gaps in current solutions to inform future strategies and roadmaps for a cloud-first PaaS offering● Established the first formal documented PIP process which became the templateOutcomes and Results:● Rebuilt team to address performance issues, resulting in 10x delivery throughput● Delivered successful DNS consolidation in 6 weeks after 3 previous failed attempts over 5 years
  • Netflix
    Sr Engineering Manager
    Netflix May 2016 - Aug 2018
    Los Gatos, Ca, Us
    The Labs team is responsible for supporting the platform and availability of the internal developer version of the Netflix application, manage pre-production and released external partner hardware (ie. LG, Sony) for certification, and support the internal Certification and Test teams and external partners through the certification process.● Restructured the single team into 3 domains of focus, and drove the creation of a test and certification PaaS offering supporting internal and external stakeholders, including bootstrapping SRE into the local org● Mentored ICs, Leads, and Managers career and craft development, in both technical and soft skills● Engaged with partner teams to effectively surface critical metrics and signals for alerting● Introduced cloud and datacenter paradigms to build a new high density device test environment● Created physical device container and pluggable interface (patented) to simplify deployment● Unified Netflix Reference Application (RefApp) pipelines into single release framework that supported diverse cadences for external customer (6 month) and internal development (rapid iteration)Outcomes and Results:● Increased device (consumer quality) availability from 40% to 96%, RefApp use for testing 100x● Decreased AWS cost from $1.2m to under $200k yearly● Reduced average RefApp deployment time from 3.5 days to 10 seconds
  • Linkedin
    Site Reliability Engineering Manager
    Linkedin Sep 2015 - May 2016
    Sunnyvale, Ca, Us
    Lead the Identity Site Reliability team responsible for all Profile Data, Connections, and Network Service (Saas).● Resolved deep morale and technical issues, and rebuilt into a motivated and cohesive team● Mentored and developed each of the engineers into the SME for areas of interest and development● Implemented SRE principles to align and prioritize work via OKRs, SLOs, and a toil reduction plan● Improved the quality of the observability/metrics platform by eliminating 95% of (non-useful) dataOutcomes and Results:● Mentored an IC into management, one to Staff level, one to Senior level● Established and published first internal service SLOs and drafted SLAs, used OKRs to prioritize work● Scaled profile capacity from 400k rps to 1mil rps● Built measurement system to measure and validate capacity and SLO● Delivered first and only zero-incident year end / New Year (2016)
  • Sysomos / Marketwired
    Sr Engineering Manager
    Sysomos / Marketwired Jul 2013 - Jun 2015
    The Systems Engineering team owned all aspects of the Sysomos social analytics and Marketwired PR service including the infrastructure, networking, colocation, systems management, security, deployments, and HA / disaster recovery. I lead a team of 2 Managers, 2 engineering ICs, and 45 Contractors.● Drove the move from an ad-hoc, unstructured platform to an SOA based 3-tier architecture, allowing the legacy systems to operate and scale without code change.● Lead the design and build-out of an 8000 node Hadoop platform across two datacenters. Used this as an opportunity to mentor the Managers and ICs into primary ownership of domains, growing them.● Spearheaded the design of a next-gen virtualized Hadoop PoC with IBM and our developers to reduce data-center footprint by 98%● Managed all aspects of the RFP and RFQ process, OPEX and CAPEX reporting for the infrastructure / hardware, colocation and datacenter contracts, and service partnershipsOutcomes and Results:● Automated deployments reduced delivery from days to minutes● Reduced data-center costs by 30% by scaling down systems during off peak● Reduced incident rates from multiple a night to 1 a week● SOA pattern decoupled and isolated legacy code, accelerating new developmen
  • Datawire Communication Networks Inc. / First Data
    Team Lead/Senior Systems Administrator
    Datawire Communication Networks Inc. / First Data May 2010 - May 2013
    The Systems Platform team was responsible for all operational aspects of a PCI-1 / Sox compliant transaction transport network supporting across 330,000 merchants in 28 datacenters worldwide.● Primary SME for yearly PCI / SOC compliance audits● Drove modernization of the platform from single-use servers to a virtualized environment improving security, observability,deployments, and rollbacks● Planned and executed mission critical load balancer migration, and modernization of observability stack● Lead architect of distributed back off environment to improving Primary Security SME for and SSL/TLS services, responsible forexploit mitigation and large certificate migration● Designed and deployed SSL acceleration to improve capacity and performanceOutcomes and Results:● Backoffice: SLA availability improved 99.5% to 99.99%, reduced costs 73% and record update latency from 5min to under 1s● SSL improvements increased total network capacity from 20k rps to over 120k rps with no additional hardware● Virtualization platform reduced deployment/rollback from 5 hours to 30mins (<1min emergency)
  • Sunwing Travel Group / Signature Vacations
    Team Lead/Senior Unix Administrator
    Sunwing Travel Group / Signature Vacations Oct 2007 - Apr 2010
    Team Lead of Systems Admin team focused on security best practices, business continuity and disaster recovery plans, budgeting, and infrastructure evolution planningPrimary responsibility was the security, reliability, and performance of heterogeneous Unix systems, storage, and WebLogic based Java applicationsDrove automation efforts to minimize manual intervention and reduce downtime
  • Crosswinds Internet Communications Inc
    Co-Founder / Cto / Systems Architect
    Crosswinds Internet Communications Inc 1997 - Oct 2007
    Architected and deployed secure and reliable free Web and Email based service for 1.8mil users100% distributed office, lead a team of 12 engineers around the world to deliver servicesResponsible for all technical systems for production service, internal development, and productivity
  • Leitch Technology International Inc
    Design Engineer
    Leitch Technology International Inc Apr 1996 - Jun 1998
    Designed and built custom hardware and OS for video data and control network (Still File projects)Created roadmap and implemented mixed Unix/Windows network for global R&D with disaster recovery, data backup and recovery, and redundant connectivityReverse engineered and clean room implementation of compatible IPX protocol for FreeBSDContributed to software and hardware process standards, security and performance consulting

Tony Holmes Skills

Linux Unix High Availability Servers Networking Data Center System Architecture Load Balancing Disaster Recovery Cloud Computing Network Security Team Leadership Operating Systems Mysql Network Architecture Integration Shell Scripting Unix Shell Scripting Vpn Ubuntu Red Hat Linux It Infrastructure Operations Tcp/ip Network Monitoring Tools Apache Hardware Freebsd Virtualization Perl Security Firewalls Windows Php Hp Ux Lamp Administration Xen Drbd Vmware Esx Esxi Ospf Postgresql Dns Hadoop It Strategy System Deployment Technical Vision Agile Methodologies Distributed Systems Scalability Organizational Design Software Development Life Cycle Software Development Strategy Performance Management Engineering Project Management Web Services Scrum Amazon Web Services Python Sql Leadership Management Performance Improvement Cross Functional Team Leadership Technical Direction Site Reliability Engineering Microservices Engineering Leadership

Tony Holmes Education Details

  • University Of Waterloo
    University Of Waterloo
    Computer Engineering With Coop

Frequently Asked Questions about Tony Holmes

What company does Tony Holmes work for?

Tony Holmes works for Apple

What is Tony Holmes's role at the current company?

Tony Holmes's current role is Sre Leader.

What is Tony Holmes's email address?

Tony Holmes's email address is to****@****nds.net

What schools did Tony Holmes attend?

Tony Holmes attended University Of Waterloo.

What are some of Tony Holmes's interests?

Tony Holmes has interest in Children, Economic Empowerment, Politics, Environment, Education, Science And Technology, Disaster And Humanitarian Relief, Health.

What skills is Tony Holmes known for?

Tony Holmes has skills like Linux, Unix, High Availability, Servers, Networking, Data Center, System Architecture, Load Balancing, Disaster Recovery, Cloud Computing, Network Security, Team Leadership.

Who are Tony Holmes's colleagues?

Tony Holmes's colleagues are Paige Miller, Rao Priyanka, Kapil Singh, Sriram Moorthy, Joy Chinen, Yiğit Efe Karaş, Alixandra Peña.

Free Chrome Extension

Find emails, phones & company data instantly

Find verified emails from LinkedIn profiles
Get direct phone numbers & mobile contacts
Access company data & employee information
Works directly on LinkedIn - no copy/paste needed
Get Chrome Extension - Free

Aero Online

Your AI prospecting assistant

Download 750 million emails and 100 million phone numbers

Access emails and phone numbers of over 750 million business users. Instantly download verified profiles using 20+ filters, including location, job title, company, function, and industry.