Only AI Jobs


Cloud Hardware Development Engineer, Cloud AI/ML/storage server teams

ID: 4839

Type: Full-time

Category: Others

Company Name: Amazon Data Services, Inc.

Location: USA, WA, Seattle; USA, CA, Cupertino - Seattle - United States

Salary: 157,300.00 - 212,800.00 USD annually

Visit company vacancy
Job Description

As a Cloud Hardware Development Engineer, you will be an end-to-end owner of storage and/or accelerator (AI/ML/GPU) server platforms — from New Product Introduction (NPI) through fleet health in production. You own the full lifecycle: design, development, qualification, launch, and ongoing operational excellence of servers running at scale in the AWS fleet.

You will work closely with internal customers to understand their technical needs and business goals, leveraging your experience with server design and the knowledge of various teams to architect solutions we deploy at scale. To deliver your products, you will work with an interdisciplinary team of component, firmware, power, mechanical, electrical, test, qualification, manufacturing engineers, and lead our ODM (design and manufacturing partners) to bring these servers to the data center. After launch, you own the fleet — monitoring quality, driving reliability improvements, and ensuring servers continue to meet customer requirements throughout their
operational life.

This role demands deep technical curiosity and the willingness to jump in and personally solve the hardest problems. When a complex system failure occurs — whether during NPI qualification or in a production fleet of hundreds of thousands of servers — you roll up your sleeves, dive into the details across hardware, firmware, software, and physical layers, and drive to root cause. You don't wait for someone else to figure it out.

You will own end-to-end system reliability — proactively identifying deficiencies and driving toward zero-touch operations where automation detects, diagnoses, and resolves issues before customer impact. You will decompose complex server system problems (testability, reliability, diagnostics) into deliverable tasks and features, leading delivery yourself and through others in parallel.

This is a fast-paced, intellectually challenging position. You'll work with thought leaders in multiple technology areas, hold high standards for yourself and everyone you work with, and constantly look for ways to improve your products' performance, quality, and cost. We're changing an industry, and we want individuals who are ready for this challenge and want to reach beyond what is possible today.


Key job responsibilities
NPI — New Product Introduction
- Own the end-to-end NPI lifecycle for storage and/or accelerator (AI/ML/GPU) server platforms — from architecture definition through design, qualification, manufacturing ramp, and launch
- Lead technical solutions for complex server and rack system architectural challenges
- Work with ODM/manufacturing partners to develop, validate, and manufacture server products at scale
- Develop functional specifications, design verification plans, and test procedures
- Drive qualification and readiness milestones, ensuring new platforms meet performance, reliability, and cost targets before fleet deployment
- Identify and resolve technical risks early in the development cycle — don't let problems reach production

Fleet Health, Diagnostics & Automation
- Own fleet health for the server platforms you launch — reliability doesn't end at ship
- Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation to identify hardware issues before they cause customer impact
- Drive toward zero-touch operations — help build detection, diagnoses, and remediation of faults without human intervention
- Debug complex system failures in time-sensitive settings — personally diving deep when the problem demands it
- Perform root cause analysis correlating across firmware, kernel, driver, thermal, power, and physical layers

Systems Design & Technical Depth
- Apply expertise across hardware, software, system design, x86 architecture, processes, and operations (compute, storage, network, GPU)
- Design and implement solutions to address system-level issues at large scale
- Decompose complex server system problems (testability, reliability, diagnostics) into deliverable tasks and features
- Collaborate with hardware, software, manufacturing, supply chain, and product management teams

Cross-Team Collaboration
- Work closely with internal customers to ensure new server hardware meets data path and control path requirements
- Identify early any potential problems onboarding new servers into customer ecosystems
- Collaborate across Hardware Engineering, component, firmware, test, qualification, and integration teams
- Partner with datacenter operations to close the loop between field failures and design improvements

A day in the life
Your day-to-day responsibilities include interfacing with internal and external customers to understand product requirements and facilitate system development on top of your server designs. You will learn operational challenges facing our existing fleet with the goal of improving the current customer experience and developing improved systems for future designs. You will work directly with vendors and ODM (manufacture partners) to scale your product. Some days you're reviewing a new platform design with your ODM; other days you're deep in logs and telemetry data chasing a failure mode across the fleet. You thrive
on that range.

Basic Qualifications

- Experience in developing functional specifications, design verification plans and functional test procedures
- Bachelor's degree or above in electrical engineering, computer engineering, or equivalent
- Experience in English-language communication skills, both written and verbal
- Experience with design & innovation and research & development
- Knowledge of operating systems, hardware, storage, network, security, database administration and cloud infrastructure
- Experience in server technologies such as, thermal, mechanical, power, and signal integrity
- 5+ years of professional work (non-internship) experience

Preferred Qualifications

- 5+ years of hardware design and validation of components, subsystems and systems experience
- Experience in server technologies: board design, high-speed bus design and signal integrity, failure analysis, server components (CPU, GPU, SSDs, memory), BIOS, BMC, and networking
- Experience developing and executing test procedures for mechanical or electrical systems/components
- Experience working with ODMs/manufacturer through the product development and manufacturing lifecycle
- Experience building predictive failure detection or proactive remediation systems at fleet scale
- Experience with storage/compute/GPU/accelerator platforms including integration, diagnostics, or performance validation
- Familiarity with PCIe topology, NVLink, NVMe, and accelerator interconnects
- Experience with large-scale datacenter or cloud environments

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Los Angeles County applicants: Job duties for this position include: work safely and cooperatively with other employees, supervisors, and staff; adhere to standards of excellence despite stressful conditions; communicate effectively and respectfully with employees, supervisors, and staff to ensure exceptional customer service; and follow all federal, state, and local laws and Company policies. Criminal history may have a direct, adverse, and negative relationship with some of the material job duties of this position. These include the duties and responsibilities listed above, as well as the abilities to adhere to company policies, exercise sound judgment, effectively manage stress and work safely and respectfully with others, exhibit trustworthiness and professionalism, and safeguard business operations and the Company’s reputation. Pursuant to the Los Angeles County Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.



USA, CA, Cupertino - 157,300.00 - 212,800.00 USD annually
USA, WA, Seattle - 136,000.00 - 184,000.00 USD annually

Company Information

Company Name: Amazon Data Services, Inc.

Company Website: https://aws.amazon.com

Company Address: 410 Terry Ave N, Seattle, WA 98109-5210, United States

Amazon Data Services, Inc. is a corporate subsidiary within the Amazon corporate family that functions as an organizational and operational entity supporting Amazon’s cloud infrastructure and related enterprise technology operations in the United States. The company appears in Amazon’s publicly filed subsidiary lists and corporate filings and is associated with the activities required to own, operate, lease, maintain and administer the physical and network infrastructure that underpins Amazon Web Services (AWS) and other Amazon technology operations. As such, Amazon Data Services, Inc. is best understood not as a separate product-brand consumer-facing company, but as a technology-focused legal and operational arm that facilitates the delivery of large-scale cloud computing, storage, and networking services offered by Amazon’s cloud businesses. Overview and scope Amazon Data Services, Inc. operates within Amazon’s broader corporate structure and is engaged primarily in activities that support the provision of cloud computing infrastructure and related services. Public corporate disclosures list the entity among Amazon’s domestic subsidiaries, and business records and filings indicate the company’s role in managing aspects of data center operations and infrastructure assets. While the subsidiary itself does not market a separate set of retail products to end customers under its own brand, it performs essential back-office, property, and operational functions that enable the availability, resilience, and expansion of Amazon’s cloud platforms. Core business activities The core activities attributable to Amazon Data Services, Inc. are centered on infrastructure ownership and operations, including matters commonly associated with data center and cloud platform support: acquisition, leasing and management of data center properties; implementation and maintenance of power, cooling and facility systems; coordination of network connectivity and backbone infrastructure; logistics and physical security for technology sites; and compliance with regional regulatory, environmental and safety requirements for infrastructure operations. In addition, the entity plays a role in contractual and administrative arrangements required to support cloud service delivery—such as vendor and supplier agreements for data center equipment, construction and facility services—and in certain cases holds title or leases for physical locations used by Amazon’s cloud and technology businesses. Relationship to AWS products and services Although Amazon Data Services, Inc. itself does not market end-user cloud services under a distinct consumer brand, its operations are integral to the delivery of Amazon Web Services (AWS). AWS is the public-facing suite of cloud products provided by Amazon, offering on-demand infrastructure and platform services such as compute (Amazon EC2), object storage (Amazon S3), managed databases (Amazon RDS and Amazon DynamoDB), serverless computing (AWS Lambda), networking services (Amazon VPC, AWS Direct Connect), content delivery (Amazon CloudFront), and a broad portfolio of platform, security and analytics services. The physical infrastructure and site operations managed or supported by entities like Amazon Data Services, Inc. are the foundation on which these AWS offerings are hosted and delivered to customers globally. Operational and compliance responsibilities As part of Amazon’s enterprise infrastructure organization, Amazon Data Services, Inc. is involved in ensuring high availability, operational continuity and compliance of critical facilities. This includes meeting industry and regulatory requirements for data center operations, implementing redundancy and disaster recovery planning, and participating in certification processes where applicable. The company’s activities support the technical and operational reliability expected by enterprise and public-sector customers who use cloud services for production workloads. Public communications and regulatory filings from Amazon and AWS describe extensive investments in physical infrastructure, security, and compliance regimes—areas in which data services subsidiaries participate through ownership, management, or contractual arrangements. Role within Amazon’s corporate and legal structure Amazon Data Services, Inc. functions as a corporate vehicle used by Amazon to manage specific infrastructure-related assets and obligations. In large multinational technology companies, such subsidiaries are commonly used for administrative clarity, legal and tax structuring, asset management, and focused operational control of real estate and data center portfolios. Filings from Amazon identify numerous related subsidiaries with distinct legal names; Amazon Data Services, Inc. is one such entity that appears across public records and filings associated with Amazon’s technology operations. Public-facing presence and contact There is no separate consumer website or distinct public product catalog for Amazon Data Services, Inc.; instead, the public-facing information that relates to the company’s operational domain is made available through Amazon and Amazon Web Services channels. For customers seeking products or services supported by the company’s infrastructure, AWS’s official online resources describe the portfolio of cloud services, technical documentation, operational commitments, and compliance information. For corporate, legal or regulatory inquiries about subsidiary entities, Amazon’s corporate filings and investor relations disclosures provide formal references to the company’s legal entities and their roles within the broader Amazon organization.
Visit company vacancy