Only AI Jobs


Sr. System Development Engineer, Cloud AI/ML/storage server teams

ID: 4827

Type: Full-time

Category: Others

Company Name: Amazon Development Center U.S., Inc.

Location: USA, WA, Seattle; USA, CA, Cupertino - Seattle - United States

Salary: 173,900.00 - 235,200.00 USD annually

Visit company vacancy
Job Description

We are seeking an experienced Systems Development Engineer to lead the development of automation software, diagnostic tooling, and fleet health infrastructure for our server platforms. You will work across multiple teams and organizations to build scalable, reliable systems that keep our storage and accelerated (AI/ML) compute fleet healthy — with a vision toward zero-touch operations where automation detects, diagnoses, and resolves issues without human intervention.

You will be a technical leader solving complex architectural problems that may not be well-defined in advance. You will own your team's systems, proactively identify deficiencies, write scalable and robust code to solve issues before they impact customers. You will decompose large, difficult server testability, reliability, and diagnosis problems into straightforward tasks and components — leading delivery yourself and through others in parallel — using a combination of hardware, software, system design, processor architecture, diagnostics, and operations knowledge.

You will collaborate with a variety of roles (SDEs, SDETs, Mechanical/Electrical/Hardware Engineers, TPMs, Managers, Principals) and organizations through server conception, test validation, qualification, launch, and operations — driving high quality and reliability into current and future designs for AWS server solutions. You will also work closely with ODMs and Design Partners to ensure our tooling, diagnostics, and automation requirements are met throughout the hardware development lifecycle (NPI).

Key job responsibilities
Fleet Health & Predictive Infrastructure
- Build and own the automation infrastructure responsible for the health of the server fleet across storage and accelerator (AI/ML) compute platforms
- Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation to identify hardware issues before they cause customer impact
- Drive toward zero-touch operations — building automation that detects, diagnoses, triages, and remediates hardware and software faults without human intervention
- Develop monitoring tools, dashboards, and alerting systems to provide real-time visibility into fleet health across lab and production environments
- Define and track fleet health metrics (failure rates, mean time to detect, mean time to repair, first-time fix rate, predictive accuracy)

Debugging & Troubleshooting
- Debug and resolve complex system-level issues across storage, compute, GPU, networking in production environments
- Troubleshoot Linux boot and runtime failures across x86 and ARM architectures, including PCIe, power, NIC, NVMe, and GPU subsystems
- Perform root cause analysis on hardware failures — correlating across firmware, kernel, driver, and physical layer to isolate faults
- Build diagnostic tooling that automates root cause identification and reduces reliance on manual triage

Systems Development & Automation
- Lead the definition and development of software, automation, and enabling tools for server hardware programs; track and report progress
- Design and build scalable system-level software with focus on durability, availability, security, and diagnostics
- Develop and maintain device drivers for Linux on ARM and x86 architectures
- Build automation solutions using modern programming languages (Python, Ruby, Java, C/C++, etc.)
- Work with OS internals, storage subsystems, and accelerator/GPU software stacks in Linux-based environments
- Build, manage, and deploy CI/CD pipelines for rapid deployment of code changes to org-owned and customer-owned systems

Cross-Team Collaboration
- Work across internal HWEng teams to ensure new server hardware addresses data path and control path functionality needed by dependent service teams
- Work closely with internal customers to identify early any potential problems onboarding new servers — storage or accelerated compute — into their ecosystem
- Engage with ODMs and design partners on testability, diagnostic, and automation requirements during hardware design and development
- Contribute to server design to improve robustness, testability, diagnosability, and reliability
- Partner with datacenter operations teams to close the loop between field failures and design improvements

A day in the life
Systems Development Engineers in AWS Hardware Engineering wear many hats. From orchestration tooling development to hardware integration to kernel driver debugging, we dive deep into problems across the breadth of AWS. Our teams are directly responsible for launching and maintaining server hardware in the fleet — including storage servers powering distributed storage platforms and AI/ML accelerator servers with GPUs. Located in Seattle and Cupertino, we work with internal development teams, ODMs, and design partners to deliver servers deployed in datacenters worldwide.

Basic Qualifications

- 6+ years of non-internship professional software development experience
- 6+ years of systems design, software development, operations, automation, and process improvement experience
- 6+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience
- 5+ years of programming with at least one modern language such as C++, C#, Java, Python, Golang, PowerShell, Ruby experience
- Experience with Linux/Unix
- Experience leading the design, build and deployment of complex and performant (reliable and scalable) software solutions in production

Preferred Qualifications

- Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
- Experience taking a leading role in building complex software or computing infrastructure that has been successfully delivered to customers
- Experience building predictive failure detection or proactive remediation systems at fleet scale
- Experience with Linux kernel driver development
- Experience with storage, compute, GPU/accelerator platforms (NVIDIA), including driver integration, diagnostics, or performance validation
- Experience with distributed storage systems (block, object, or file)
- - Familiarity with server hardware architecture, BMC/IPMI, firmware, PCIe topology, NVLink, and hardware diagnostics
- Experience working with ODMs or hardware design partners through the product development lifecycle
- Experience building zero-touch or self-healing automation for large-scale infrastructure
- Experience working in large-scale datacenter or cloud environments
- Track record of rapidly coming up to speed on new engineering disciplines and making impactful decisions
- Experience with hardware bring-up, validation, and fleet-wide deployment
- Familiarity with telemetry pipelines, anomaly detection, and operational metrics at scale

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Los Angeles County applicants: Job duties for this position include: work safely and cooperatively with other employees, supervisors, and staff; adhere to standards of excellence despite stressful conditions; communicate effectively and respectfully with employees, supervisors, and staff to ensure exceptional customer service; and follow all federal, state, and local laws and Company policies. Criminal history may have a direct, adverse, and negative relationship with some of the material job duties of this position. These include the duties and responsibilities listed above, as well as the abilities to adhere to company policies, exercise sound judgment, effectively manage stress and work safely and respectfully with others, exhibit trustworthiness and professionalism, and safeguard business operations and the Company’s reputation. Pursuant to the Los Angeles County Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.



USA, CA, Cupertino - 173,900.00 - 235,200.00 USD annually

Company Information

Company Name: Amazon Development Center U.S., Inc.

Company Website: https://www.amazon.com

Company Address: 410 Terry Avenue North, Seattle, WA 98109, USA

Amazon Development Center U.S., Inc. is a United States-based legal entity and operating subsidiary affiliated with Amazon.com, Inc., established to support the parent company’s software development, engineering, research and product delivery activities within the United States. As an operating arm within the broader Amazon corporate structure, this entity functions primarily as a development center that consolidates engineering, product management, research and related professional services for Amazon’s array of consumer, retail, cloud and device products. The subsidiary serves as an organizational vehicle for recruiting and employing technical and non-technical staff, conducting product and systems development work, and coordinating cross-functional programs that feed into Amazon’s global platforms and services. At a high level, the core activities of Amazon Development Center U.S., Inc. include software engineering and application development; systems design and infrastructure engineering; data science and machine learning research; product and program management; testing and quality assurance; and operational support for services deployed across Amazon’s businesses. The work performed by teams associated with this development center contributes to the design, build, deployment and maintenance of scalable back-end services, consumer-facing features and enterprise tools. These activities typically span areas such as e-commerce retail systems, inventory and fulfillment technology, payments and checkout systems, recommendation and personalization engines, search and catalog services, advertising technology, cloud infrastructure services, developer tools, voice and device platforms (including Alexa and related embedded software), and mobile and web application front ends. The main products and services supported through the development center are those that reflect Amazon’s primary business lines. This includes contributions to Amazon.com’s retail and marketplace platform (catalog management, customer experience features, seller systems and order processing), Amazon Web Services (cloud compute, storage, data services and developer tooling), Alexa and devices (voice assistant features, device firmware, and cloud-based voice services), Kindle and digital content services (content distribution and reading applications), Prime membership features (media delivery, recommendations, and benefits management), and advertising technology (ad serving, targeting and measurement solutions). In each area, the development center’s personnel typically work on software components, APIs, distributed systems, automation, security, reliability engineering and performance optimization to enable scalable, resilient services used by millions of customers worldwide. Beyond product development, the development center supports organizational activities such as engineering hiring and on-boarding, developer training programs, localized research initiatives, collaboration with academic partners, and participation in global product planning and feature roll-outs. Teams at the center often collaborate with other Amazon business units and international development centers to integrate new features into global releases, implement region-specific regulatory or compliance requirements, and adapt large-scale services to meet diverse customer and market needs. Amazon Development Center U.S., Inc. operates within the broader industry positioning of Amazon, which on official company sources identifies itself as a technology company that operates in e-commerce, cloud computing, digital content, and consumer electronics. The subsidiary’s work aligns with those areas by focusing on technology development and operational engineering that underpin Amazon’s public-facing services and internal platforms. Official Amazon communications highlight continuous investment in product innovation, infrastructure, and customer experience improvements; the activities of development centers and engineering subsidiaries are core enablers of those investments. In practical terms, the development center serves as a hub for multidisciplinary engineering teams that apply modern software-development practices, cloud-native architectures and machine learning methods to solve large-scale problems. Typical operational emphases include building distributed systems with high availability, implementing secure and compliant data-handling practices, optimizing latency and throughput for customer transactions, automating deployment pipelines and monitoring, and using data-driven experimentation to inform product decisions. While Amazon Development Center U.S., Inc. is a legal corporate entity with administrative and operational responsibilities, its identity and output are tightly integrated with Amazon’s global product and service ecosystem. The value proposition of the center lies in providing specialized engineering capacity and localized operational capabilities that accelerate the creation, iteration and delivery of software and services that support Amazon’s customers, sellers, partners and internal stakeholders. As with other Amazon development centers, the subsidiary contributes to both feature development and the foundational technologies that enable Amazon’s large-scale commerce, cloud and device businesses. For public inquiries and general information about Amazon’s corporate structure, product lines and corporate mission, Amazon’s official corporate communications and website provide primary documentation. Amazon publicly states its overarching mission and describes its principal business segments and technology investments, which contextualize the role that U.S.-based development centers and subsidiaries play within the company’s broader strategy and operations.
Visit company vacancy