Back to Remote jobs  >   AI / ML
Certified NVIDIA AI Infrastructure & Kubernetes Platform Engineer @IT Search Corp
AI / ML
Salary usd 100 - 130 p..
Remote Location
πŸ‡ΊπŸ‡Έ USA Only
Job Type full-time
Posted 2d ago

[Hiring] Certified NVIDIA AI Infrastructure & Kubernetes Platform Engineer @IT Search Corp

2d ago - IT Search Corp is hiring a remote Certified NVIDIA AI Infrastructure & Kubernetes Platform Engineer. πŸ’Έ Salary: usd 100 - 130 per hour πŸ“Location: USA

Role Description

We are seeking a highly skilled AI Infrastructure & Kubernetes Platform Engineer with a proven track record in deploying and managing NVIDIA DGX-based AI clusters, orchestrating containerized AI workloads using Kubernetes, and ensuring secure, high-throughput operations across InfiniBand-powered networks. The ideal candidate will hold a combination of Kubernetes certifications (CKA, CKAD, CKS) and NVIDIA certifications (NCA-AIIO, NCP-AIO, NCP-AII, NCP-AIN), coupled with hands-on training in DGX, BlueField, and high-speed network operations. This position plays a key role in supporting AI/ML infrastructure at scale, enabling efficient training and inference for complex models, and integrating NVIDIA's cutting-edge compute, storage, and fabric solutions with modern DevOps practices.

Core Responsibilities

  • AI Infrastructure Operations
    • Deploy and manage NVIDIA DGX BasePODs and SuperPODs for high-performance AI workloads.
    • Oversee DGX system lifecycle operations including provisioning, monitoring, firmware upgrades, and capacity planning.
    • Operate Base Command Manager to manage GPU clusters, schedule workloads, and integrate with MLOps tools.
    • Perform DGX node health validation, NCCL interconnect testing, and NVLink topology verification following new deployments or hardware changes.
  • Kubernetes Platform Engineering
    • Architect secure and scalable Kubernetes clusters optimized for GPU-accelerated workloads using NVIDIA GPU Operator.
    • Leverage expertise from CKA/CKAD/CKS to develop, deploy, and secure AI applications on Kubernetes.
    • Implement CI/CD pipelines and GitOps methodologies for deploying and managing ML workflows.
  • High-Performance Networking & DPUs
    • Administer InfiniBand networks and BlueField DPUs using Unified Fabric Manager (UFM).
    • Enable NVLink/NVSwitch performance across GPU nodes and tune fabric configurations for minimal latency and maximum throughput.
    • Use BlueField for offloading storage, firewalling, and telemetry, enhancing AI workload security and performance.
  • Security & Compliance
    • Apply best practices from the CKS certification to secure containerized AI environments.
    • Configure runtime security, secrets management, network segmentation, and auditing using DPU-enhanced Kubernetes deployments.
    • Support zero-trust architecture initiatives by enforcing workload identity, RBAC policies, and supply chain integrity across AI container images and model artifacts.
  • Monitoring, Telemetry & Optimization
    • Monitor GPU, CPU, and I/O performance using NVIDIA DCGM, Prometheus, Grafana, and Base Command APIs.
    • Tune system performance and model training pipelines for cost-efficiency and throughput.
    • Build and maintain operational runbooks, incident response playbooks, and SLA reporting dashboards covering GPU utilization, thermal thresholds, and fabric health.

Qualifications

  • Certified Kubernetes Administrator (CKA)
  • Certified Kubernetes Application Developer (CKAD)
  • Certified Kubernetes Security Specialist (CKS)
  • NVIDIA Certified Associate: AI Infrastructure & Operations (NCA-AIIO)
  • NVIDIA Certified Professional: AI Infrastructure (NCP-AII)
  • NVIDIA Certified Professional: AI Operations (NCP-AIO)
  • NVIDIA Certified Professional: AI Networking (NCP-AIN)

Requirements

  • Expertise With:
    • DGX System, BasePOD, and SuperPOD Administration
    • BlueField DPU Configuration & Operations
    • InfiniBand Fabric and UFM Management
    • Base Command Manager for workload orchestration
  • Technical Skills:
    • Kubernetes, Helm, GPU Operator, Kubeflow
    • DevOps tools: Ansible, Terraform, GitOps, CI/CD pipelines
    • Storage: NFS, BeeGFS, Lustre
    • Networking: RoCE, InfiniBand, DPU offload, gRPC, RDMA
    • Programming/scripting: Python, YAML, Bash
Before You Apply
️
πŸ‡ΊπŸ‡Έ Be aware of the location restriction for this remote position: USA Only
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Back to Remote jobs  >   AI / ML
Certified NVIDIA AI Infrastructure & Kubernetes Platform Engineer @IT Search Corp
AI / ML
Salary usd 100 - 130 p..
Remote Location
πŸ‡ΊπŸ‡Έ USA Only
Job Type full-time
Posted 2d ago
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 152,720 Remote Jobs
️
πŸ‡ΊπŸ‡Έ Be aware of the location restriction for this remote position: USA Only
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 152,720 Remote Jobs
Γ—

Apply to the best remote jobs
before everyone else

Access 152,720+ vetted remote jobs and get daily alerts.

4.9 β˜…β˜…β˜…β˜…β˜… from 500+ reviews
Unlock All Jobs Now

Maybe later