Role Description
This role exists because Hyperstack is scaling its AI cloud platform and building out the infrastructure that powers production ML workloads for thousands of customers. As AI Studio capabilities grow and the platform takes on increasingly complex training, fine-tuning, and inference workloads, we need someone to own the MLOps layer β the systems, tooling, and practices that make large-scale AI workloads reliable, observable, and repeatable in production. Youβll have direct ownership over ML platform reliability, deployment workflow engineering, and the operational standards that underpin how AI workloads run on Hyperstack β end to end.
Role positioning:
-
This is a lead individual contributor role.
-
Youβll set the technical direction for MLOps on the platform.
-
Work directly with Product and Engineering.
-
Take end-to-end ownership of the systems that make AI workloads run in production.
-
No hand-holding, lots of impact.
What youβll be doing:
-
Own the design, implementation, and evolution of core MLOps systems across Hyperstack β including the infrastructure and workflows that underpin AI Studio.
-
Build and improve systems that orchestrate model training, fine-tuning, evaluation, and deployment β engineered for long-running, resource-intensive, GPU workloads.
-
Own production readiness across ML infrastructure β monitoring, alerting, incident response, and continuous improvement based on real-world usage.
-
Define and embed strong MLOps practices across teams β model versioning, reproducibility, deployment safety, rollback strategies, and environment management.
-
Provide technical leadership through architecture decisions, implementation guidance, and shared standards β working closely with Product, Engineering, and cross-functional teams.
Qualifications
-
Proven experience designing, building, and operating production ML infrastructure, platform systems, or MLOps workflows in cloud environments.
-
Hands-on Python development skills, with experience building backend systems, automation, and developer or platform tooling.
-
Experience supporting LLM, generative AI, or fine-tuning workflows in production β including training, evaluation, deployment, inference, and lifecycle management.
-
Production-grade experience with Docker, Kubernetes, CI/CD, and infrastructure-as-code in real, operational environments.
-
Experience owning complex, asynchronous, or resource-intensive workloads end to end β including orchestration, reliability, observability, and incident response.
-
Ability to work cross-functionally and provide technical leadership through influence β shaping standards, direction, and ways of working across engineering teams.
Requirements
-
Exposure to GPU-intensive, distributed, or performance-sensitive ML workloads.
-
Experience building internal developer platforms or tooling that improve experimentation, reproducibility, and delivery speed for ML teams.
-
Background in cloud infrastructure, platform products, or technically complex B2B software.
Benefits
-
Competitive salary and annual discretionary bonus scheme.
-
Employee wellbeing benefits.
-
25 days of holiday, plus public holidays.
-
Flexible working arrangements (remote or hybrid, depending on role and location).
-
Real ownership and autonomy, with the trust to take initiative and experiment.
-
The opportunity to make a visible, meaningful impact as we scale.
-
Clear career progression and growth opportunities in a fast-growing company.
-
A collaborative, international culture built on trust, transparency, and ownership.
-
The chance to help shape NexGen Cloudβs team, culture, and future alongside ambitious, mission-driven colleagues.