Back to Remote jobs > Artificial Intelligence

ML Ops Infrastructure Engineer @Centific

Artificial Intelligence

Salary usd 150,000 per..	Remote Location 🇺🇸 USA Only
Employment Type full-time	Posted 3wks ago

[Hiring] ML Ops Infrastructure Engineer @Centific

3wks ago - Centific is hiring a remote ML Ops Infrastructure Engineer. 💸 Salary: usd 150,000 per year 📍Location: USA

Role Description

Our Vision AI platform runs where the data is generated — on-premises, inside government facilities, and at the network edge — not in a hyperscaler cloud. That means the infrastructure has to be bulletproof: GPU clusters provisioned correctly, Kubernetes workloads scheduled efficiently across heterogeneous compute, storage performing at the throughput AI training and inference demands, and the network capable of handling high-bandwidth, low-latency sensor data at scale.

As our MLOps / AI Infrastructure Engineer, you will own all of it. You will:

Rack, configure, and operate the on-premises compute and GPU infrastructure that powers the platform.
Build and maintain the Kubernetes clusters that orchestrate AI workloads.
Design the networking fabric that ties edge nodes to core compute.
Implement the MLOps pipelines that take models from development to production.
Work directly with our AI/ML engineers, the Lead Architect, and on-site client technical teams to ensure the platform runs reliably in environments that are often air-gapped, physically secured, and subject to strict government compliance requirements.

Qualifications

6+ years of infrastructure engineering experience, with at least 3 years managing GPU compute clusters or HPC environments in production.
Deep hands-on expertise with NVIDIA GPU infrastructure: driver lifecycle management, CUDA, DCGM, MIG, NVLink topologies, and the NVIDIA GPU Operator for Kubernetes.
Production-level Kubernetes administration experience on bare-metal: cluster provisioning, upgrades, CNI/CSI configuration, RBAC, and day-2 operations.
Strong networking fundamentals: BGP, VLAN segmentation, RDMA/RoCE or InfiniBand configuration, load balancing, and firewall policy management.
Hands-on experience with software-defined storage (Ceph, Rook-Ceph, or MinIO) in AI/HPC workload contexts — performance tuning, capacity planning, and failure recovery.
Practical MLOps experience: model serving infrastructure (Triton or equivalent), experiment tracking (MLflow or Kubeflow), and GitOps-based model deployment pipelines.
Working knowledge of NIST SP 800-171 controls and the ability to translate them into concrete infrastructure configurations and audit evidence.
Proficiency with infrastructure-as-code tooling: Terraform or Ansible for reproducible, auditable infrastructure builds.
Strong Linux systems administration skills (RHEL/Rocky Linux or Ubuntu) including kernel tuning, storage I/O optimization, and systemd service management.
Excellent written communication for producing infrastructure runbooks, network diagrams, and compliance documentation in a remote-first environment.

Requirements

Experience with air-gapped or classified network environments and the operational discipline they require (offline package mirrors, USB-controlled media transfers, etc.).
Familiarity with CMMC Level 2/3 assessment processes and evidence collection.
Experience with NVIDIA DGX Systems, BasePOD reference architectures, or NVIDIA AI Enterprise software stack.
Knowledge of distributed training frameworks (PyTorch DDP, DeepSpeed, Megatron-LM) and their infrastructure requirements — useful for supporting AI/ML engineering teammates.
Experience deploying Kubernetes at the edge: K3s, MicroK8s, or NVIDIA Jetson-based edge clusters.
Familiarity with observability stacks: Prometheus, Grafana, Loki, OpenTelemetry, and DCGM Exporter for GPU telemetry dashboards.
US Person status or active security clearance — advantageous for certain client site engagements.
Background in SCADA, ICS, or OT network environments relevant to critical infrastructure clients.

Benefits

Hands-on ownership of some of the most demanding AI infrastructure in the public sector — H200 GPU clusters, high-bandwidth interconnects, and purpose-built on-premises deployments.
A technically rigorous environment where your infrastructure decisions directly affect the reliability of mission-critical government operations.
Competitive, globally benchmarked compensation including base salary, equity, and performance bonus.
Fully remote with async-first culture; periodic travel to client facilities and team on-sites for cluster deployments and planning.
Access to cutting-edge NVIDIA hardware, early access to new GPU generations, and budget for relevant certifications (NVIDIA, CKA/CKS, RHCSA, etc.).
Collaboration with a Lead Architect and engineering team who understand infrastructure as a product — not just a cost center.

Company Description

Centific is a frontier AI data foundry that curates diverse, high-quality data, using our purpose-built technology platforms to empower the Magnificent Seven and our enterprise clients with safe, scalable AI deployment. Our team includes more than 150 PhDs and data scientists, along with more than 4,000 AI practitioners and engineers. We harness the power of an integrated solution ecosystem—comprising industry-leading partnerships and 1.8 million vertical domain experts in more than 230 markets—to create contextual, multilingual, pre-trained datasets; fine-tuned, industry-specific LLMs; and RAG pipelines supported by vector databases. Our zero-distance innovation™ solutions for GenAI can reduce GenAI costs by up to 80% and bring solutions to market 50% faster.

Our mission is to bridge the gap between AI creators and industry leaders by bringing best practices in GenAI to unicorn innovators and enterprise customers. We aim to help these organizations unlock significant business value by deploying GenAI at scale, helping to ensure they stay at the forefront of technological advancement and maintain a competitive edge in their respective markets.

Similar Remote Jobs

AI Engineer • Dry Ground AI Dry Ground AI

Artificial Intelligence $40- $60k Brazil Colombia Philippines

6d ago
Apply See more >

Kickstart Your Job Search

⚡ 12,726 remote jobs added this week

You're seeing 0.4% of available roles

Unlock 152,720 jobs →

Meet JobCopilot: Your Personal Al Job Hunter

Automatically Apply to Remote Jobs

Try it now →

Before You Apply

️

🇺🇸	Be aware of the location restriction for this remote position: USA Only
‼	Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.

Back to Remote jobs > Artificial Intelligence

ML Ops Infrastructure Engineer @Centific

Artificial Intelligence

Salary usd 150,000 per..	Remote Location 🇺🇸 USA Only
Employment Type full-time	Posted 3wks ago

Apply for this position

Unlock 152,720 Remote Jobs

️

🇺🇸	Be aware of the location restriction for this remote position: USA Only
‼	Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.

Apply for this position

Unlock 152,720 Remote Jobs

[Hiring] ML Ops Infrastructure Engineer @Centific

Apply to the best remote jobsbefore everyone else

Apply to the best remote jobs
before everyone else