Role Description
This role offers the opportunity to lead the development of next-generation AI and high-performance computing (HPC) systems that push the boundaries of technology. You will drive the design and implementation of large-scale GPU cluster architectures, from hardware to system software, enabling cutting-edge machine learning workloads. Collaborating across global teams, you will create reference architectures that guide industry deployments and shape the future of AI datacenters. This position combines technical leadership, innovation, and hands-on engineering in a fast-paced, remote-friendly environment. Ideal candidates thrive on solving complex challenges, mentoring high-performing teams, and influencing large-scale computing systems used across multiple industries. You will have a direct impact on AI research and enterprise applications by building systems that redefine performance and scalability.
-
Lead a team designing and developing next-generation HPC and AI cluster architectures using advanced technologies.
-
Build end-to-end systems for high-performance machine learning, including physical architecture, network topology, and system software stacks.
-
Author reference architectures to guide future AI and HPC supercomputing systems.
-
Collaborate with cross-functional teams on cluster architecture, system bringup, and integration of new technologies.
-
Enable deployment of large-scale datacenter systems and ensure performance optimization across hardware and software components.
-
Mentor team members, promote knowledge sharing, and foster a collaborative, innovative engineering environment.
Qualifications
-
Bachelorโs degree in Applied Science, Engineering, or a related field (Masterโs or PhD preferred), or equivalent experience.
-
8+ years of experience in HPC, AI, or related fields, including 3+ years in technical leadership roles.
-
Proven ability to lead high-performing engineering teams, particularly in distributed and diverse environments.
-
Strong software development and automation skills using languages such as Go, Python, or Ansible.
-
Expertise in high-performance computing, datacenter architecture, and distributed systems.
-
Creative problem-solving skills, teamwork, and effective communication across technical and executive stakeholders.
-
Comfortable working in remote-friendly, fast-paced, and globally distributed teams.
-
Bonus: Experience with multi-GPU/multi-node training, HPC storage systems, InfiniBand/RoCE networking, or open-source monitoring tools (Prometheus, Grafana).
Benefits
-
Competitive base salary ranging from $224,000 to $356,500 USD, based on experience and location.
-
Eligibility for equity participation and performance-based incentives.
-
Comprehensive healthcare coverage, including medical, dental, and vision plans.
-
Retirement savings programs and financial wellness support.
-
Flexible work arrangements and generous paid time off.
-
Opportunities to work with cutting-edge AI and HPC technologies on high-impact projects.
-
Professional development programs, mentorship, and team-building initiatives.
-
Inclusive, collaborative work environment that encourages innovation and knowledge sharing.
Company Description