Principal Architect, System Software - Orbital Data Center
Job details
- Location
- Santa Clara or
- Work type
- Remote
- Compensation
- $272,000 - $431,250/yr
- Posted
- yesterday
- Apply on
- nvidia.wd5.myworkdayjobs.com
About this role
Space-1 is NVIDIA's first Orbital Data Center (ODC) module — a Vera Rubin–class compute platform engineered for low-Earth orbit mission. It is the first step in a multi-generation orbital roadmap to speed up AI adoption. We are looking for a strong technical architect to own end-to-end system software architecture for Space-1 and successor orbital platforms. You will architect the full stack — application to libraries, from data center stack to BMC and BIOS firmware, manageability, and telemetry through the host OS, GPU and CPU drivers, and CUDA — to deliver a production-ready inference platform that operates reliably in the radiation, thermal-cycling, and remote-operations environment of LEO. You will partner closely with the orbital hardware system architecture team, drive customer use cases with constellation operators, align architecture with mission requirements, and bring the best orbital AI products to market. Join us at the forefront of technological advancement.
What you'll be doing:
Own system architecture for inference stack and other applications running on this class of products and make it resilient to any fault happening in space.
Co-architect with the orbital hardware system architecture team to define interfaces, partitioning, and trade-offs across silicon, board, firmware, OS, and AI workload layers for 5-year LEO missions.
Own end-to-end system software architecture for Space-1 and successor Orbital Data Center modules — covering data center stack, BMC firmware, BIOS, host OS, GPU/CPU drivers, CUDA, DCGM, and manageability telemetry as a single integrated stack.
Define the manageability architecture for an unreachable, autonomous data center: remote bring-up, in-orbit firmware update, dual-module redundancy, fault containment, recovery from SEU/SEFI events, and telemetry for fleets ranging from tens to millions of nodes.
Architect rad-tolerant system software behaviors — ECC handling, memory scrubbing, latch-up mitigation, deterministic recovery, and graceful degradation through 5 years and up to ~8,000 thermal cycles in dawn–dusk sun-synchronous orbit.
Drive Redfish, MCTP, PLDM, and constellation-level management protocols across BMC, BIOS, and host software so customers can operate orbital fleets with the same tools they use on the ground.
Define minimum BMC feature set, pin budget, boot architecture (rugged M.2 / VPX-class options), and dual-module redundancy strategy in partnership with platform and mechanical engineering.
Partner with cloud and constellation customers (SpaceX, Blue Origin, Starcloud, Planet, Cowboy Space, and others) to translate mission requirements — orbit, duty cycle, NSA PHIPs security, post-quantum networking (CX9), inference SLAs — into actionable platform software architecture.
Drive reliability and optimization in the system software architecture from an orbital data center viewpoint, including correct operation through eclipse periods and idle-power retention strategies.
Work closely with the bring-up team and resolve issues at Speed of Light from first silicon through first launch. Own quality, reliability, and telemetry performance of the system software delivered with each ODC module shipped to customers.
What we need to see:
15+ years of relevant experience in server/platform system software — spanning compute libraries, BMC firmware, BIOS, host OS, drivers, and manageability
BS, MS, or PhD in EE/CS or related field of education (or equivalent experience).
Working experience in building AI infrastructure and systems in space. Proven record of architecting and delivering platform software for large-scale data centers or mission-critical embedded systems.
Strong knowledge of server architecture, data center manageability, and full-stack integration of firmware with OS and accelerator software. Hands-on experience with data center health management workflows, telemetry, and fault management at scale.
Solid understanding of hardware management interfaces (USB, SMBus/I2C, PCIe) and proficiency with modern management protocols including Redfish, MCTP, and PLDM.
Strong and demonstrable skill in C/C++ and Python.
Experience programming and debugging server platforms, including pre-silicon and platform bring-up environments.
Experience in SCM (e.g. Git, Perforce) and project management tools like Jira.
Excellent written and oral communication skills, good work ethics, high sense of team-work, love to produce quality work, and commitment to finish your tasks every single day.
You are a self-starter who loves to find creative solutions to complicated problems and hands on with coding.
Ways to stand out from the crowd:
Experience architecting platform software for space, aerospace, defense, or other radiation, thermal, and vibration-constrained environments — including SEU/SEFI mitigation, ECC strategy, TID/SEE qualification, and rad-hard design partitioning. Being a part of a start up or initiative directly related to space data centers.
Hands-on experience with autonomous, remote, or unreachable data center operations — in-orbit or in-field firmware update, dual-module redundancy, and recovery without physical access. Hands-on with x86 or ARM (Grace/Vera) system architecture and the NVIDIA AI software stack (CUDA, DCGM, DOCA/OFED, GPU drivers, DGX OS).
Familiarity with NSA PHIPs security, post-quantum networking, and aerospace standards (VPX, MIL-STD shock/vibe, NASA EEE-INST-002).
Proven technical leadership driving large complex programs with 50+ engineers across firmware, OS, driver, and AI stack teams.
Skilled in reviewing hardware schematics and PCB layouts for debugging, design verification, and collaboration with hardware engineers.
NVIDIA's invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern deep learning — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company.” We're looking to grow our company and establish teams with the most thoughtful people in the world. NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. Our invention serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is seeking exceptional individuals like you to help us drive the next wave of artificial intelligence.
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD.You will also be eligible for equity and benefits.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.About NVIDIA
Skip the form. ApplyBolt does it in seconds.
The iPhone app tailors your resume for this role and submits the real application for you. Same process, same confirmation emails, just way less of your day.
- Resume rewritten for this exact role in seconds
- Submits the actual employer form, no shortcuts
- Real confirmation emails land in your inbox
