Only AI Jobs


Sr. Technical Program Manager - AI/ML Hardware Health & Stability, Global Data Center Operations

ID: 4683

Type: Full-time

Category: Others

Company Name: Amazon Data Services, Inc.

Location: USA, WA, Seattle - Seattle - United States

Salary: 148,700.00 - 201,200.00 USD annually

Visit company vacancy
Job Description

The Central Operations team within Amazon Web Services (AWS) Infrastructure is seeking a Senior Technical Program Manager to drive the health, stability, and operational excellence of new hardware deployments across our global data center fleet. This role uniquely blends technical program management with strategic account management to ensure our GenAI and high-performance computing infrastructure delivers maximum value to customers.

As a Sr. TPM, you will be the technical advocate and strategic advisor for operational support of new AI/ML hardware platforms. You will serve as the central owner of operational health (failure rate, repair efficacy, repair dwell time, break/fix process improvement) while driving cross-functional initiatives to improve these key performance indicators. You will work at the intersection of hardware engineering, data center operations, and service teams like EC2—translating complex technical data into actionable insights and leading programs that accelerate capacity delivery while maintaining the highest standards of operational health.

This is not a sales role, but rather an opportunity to be the 'voice of the customer' and the 'voice of operations' for critical infrastructure that powers AWS's most demanding workloads. You will craft and execute strategies to optimize new hardware deployments, proactively identify and remediate stability issues, and establish best practices that scale across AWS's global infrastructure.


Key job responsibilities
Hardware Health & Stability Leadership

- Own the end-to-end health and stability metrics for new AI/ML hardware platforms, establishing KPIs and routines that provide real-time visibility into operational performance
- Drive deep-dive analyses on hardware failures to identify root causes and drive systematic improvements
- Lead cross-functional investigations, experiments, and post-mortem processes, ensuring lessons learned translate into preventive measures and design improvements
- Develop and maintain hardware health scorecards that inform leadership decisions on deployment readiness, capacity planning, and risk mitigation

Technical Program Management

- Manage complex, multi-phase infrastructure projects involving hardware engineering, supply chain, data center operations, and software teams across multiple time zones
- Establish and maintain program schedules, budgets, and resource plans, proactively identifying and mitigating risks to delivery timelines
- Facilitate technical deep dive sessions to troubleshoot diagnostic and repair issues, remove blockers, and accelerate project delivery
- Design and implement processes that eliminate non-value-add activities and optimize deployment velocity without compromising quality

Strategic Account Management

- Serve as the primary operational point of contact for new platforms across software and hardware teams, summarizing platform operational status and path-to-green
- Build trusted advisor relationships with data center operations, hardware engineering, and service teams to understand their operational needs and technical challenges
- Translate operational feedback and customer requirements into hardware and process improvement roadmaps, and engineering priorities
- Provide strategic technical guidance on AI/ML deployment strategies, best practices, and operational procedures
- Advocate for operational excellence, ensuring that hardware health considerations are integrated into capacity planning and service delivery decisions

Cross-Functional Collaboration & Influence

- Partner with hardware engineering teams to influence design decisions based on operational data and field performance
- Collaborate with new product introduction and hardware engineering teams to ensure quality gates are met before launch
- Work with monitoring and automation teams to implement appropriate signals to ensure customer commitments are met
- Drive alignment across diverse stakeholders including engineering, operations, finance, and executive leadership
- Present technical assessments and recommendations to senior leadership, clearly articulating trade-offs, risks, and business impact

Basic Qualifications

- 5+ years of technical product or program management experience
- 7+ years of working directly with engineering teams experience
- Experience in root cause analysis and error correction, identifying changes to procedures and systems to implement long-term fixes and avoid repeating issues
- Experience leading process improvements
- Experience in written and verbal communication skills to communicate with technical and non-technical audiences, including senior leadership

Preferred Qualifications

- Experience in technical account management, business relationship management, or consulting
- Knowledge of Six Sigma tools, Lean techniques, PMP or similar standards preferred
- Experience in server technologies such as, thermal, mechanical, power, and signal integrity
- Experience managing UltraServer, high-performance computing, or AI/ML infrastructure deployments

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.



USA, WA, Seattle - 148,700.00 - 201,200.00 USD annually

Company Information

Company Name: Amazon Data Services, Inc.

Company Website: https://aws.amazon.com

Company Address: 410 Terry Ave N, Seattle, WA 98109-5210, United States

Amazon Data Services, Inc. is a corporate subsidiary within the Amazon corporate family that functions as an organizational and operational entity supporting Amazon’s cloud infrastructure and related enterprise technology operations in the United States. The company appears in Amazon’s publicly filed subsidiary lists and corporate filings and is associated with the activities required to own, operate, lease, maintain and administer the physical and network infrastructure that underpins Amazon Web Services (AWS) and other Amazon technology operations. As such, Amazon Data Services, Inc. is best understood not as a separate product-brand consumer-facing company, but as a technology-focused legal and operational arm that facilitates the delivery of large-scale cloud computing, storage, and networking services offered by Amazon’s cloud businesses. Overview and scope Amazon Data Services, Inc. operates within Amazon’s broader corporate structure and is engaged primarily in activities that support the provision of cloud computing infrastructure and related services. Public corporate disclosures list the entity among Amazon’s domestic subsidiaries, and business records and filings indicate the company’s role in managing aspects of data center operations and infrastructure assets. While the subsidiary itself does not market a separate set of retail products to end customers under its own brand, it performs essential back-office, property, and operational functions that enable the availability, resilience, and expansion of Amazon’s cloud platforms. Core business activities The core activities attributable to Amazon Data Services, Inc. are centered on infrastructure ownership and operations, including matters commonly associated with data center and cloud platform support: acquisition, leasing and management of data center properties; implementation and maintenance of power, cooling and facility systems; coordination of network connectivity and backbone infrastructure; logistics and physical security for technology sites; and compliance with regional regulatory, environmental and safety requirements for infrastructure operations. In addition, the entity plays a role in contractual and administrative arrangements required to support cloud service delivery—such as vendor and supplier agreements for data center equipment, construction and facility services—and in certain cases holds title or leases for physical locations used by Amazon’s cloud and technology businesses. Relationship to AWS products and services Although Amazon Data Services, Inc. itself does not market end-user cloud services under a distinct consumer brand, its operations are integral to the delivery of Amazon Web Services (AWS). AWS is the public-facing suite of cloud products provided by Amazon, offering on-demand infrastructure and platform services such as compute (Amazon EC2), object storage (Amazon S3), managed databases (Amazon RDS and Amazon DynamoDB), serverless computing (AWS Lambda), networking services (Amazon VPC, AWS Direct Connect), content delivery (Amazon CloudFront), and a broad portfolio of platform, security and analytics services. The physical infrastructure and site operations managed or supported by entities like Amazon Data Services, Inc. are the foundation on which these AWS offerings are hosted and delivered to customers globally. Operational and compliance responsibilities As part of Amazon’s enterprise infrastructure organization, Amazon Data Services, Inc. is involved in ensuring high availability, operational continuity and compliance of critical facilities. This includes meeting industry and regulatory requirements for data center operations, implementing redundancy and disaster recovery planning, and participating in certification processes where applicable. The company’s activities support the technical and operational reliability expected by enterprise and public-sector customers who use cloud services for production workloads. Public communications and regulatory filings from Amazon and AWS describe extensive investments in physical infrastructure, security, and compliance regimes—areas in which data services subsidiaries participate through ownership, management, or contractual arrangements. Role within Amazon’s corporate and legal structure Amazon Data Services, Inc. functions as a corporate vehicle used by Amazon to manage specific infrastructure-related assets and obligations. In large multinational technology companies, such subsidiaries are commonly used for administrative clarity, legal and tax structuring, asset management, and focused operational control of real estate and data center portfolios. Filings from Amazon identify numerous related subsidiaries with distinct legal names; Amazon Data Services, Inc. is one such entity that appears across public records and filings associated with Amazon’s technology operations. Public-facing presence and contact There is no separate consumer website or distinct public product catalog for Amazon Data Services, Inc.; instead, the public-facing information that relates to the company’s operational domain is made available through Amazon and Amazon Web Services channels. For customers seeking products or services supported by the company’s infrastructure, AWS’s official online resources describe the portfolio of cloud services, technical documentation, operational commitments, and compliance information. For corporate, legal or regulatory inquiries about subsidiary entities, Amazon’s corporate filings and investor relations disclosures provide formal references to the company’s legal entities and their roles within the broader Amazon organization.
Visit company vacancy