Role Description
Principal Platform Engineer - Remote. Supporting a production ML platform on Google Cloud.
Experience: 8-10+ years in DevOps with at least 2 of those years operating and maintaining ML production workloads.
Core Responsibilities
-
Infrastructure Management: Design, deploy, and maintain elastic scaling cloud infrastructure (GCP) and containerization tools like Kubernetes for high-performance ML workloads.
-
CI/CD Pipeline Development and maintenance: Build automated pipelines for training, testing, and deploying machine learning models using tools like Jenkins, GitHub Actions, or Airflow.
-
Model Monitoring & Maintenance: Implement observability tools to track model drift, accuracy, latency, and performance degradation in production.
-
Collaboration: Bridge the gap between data engineers, ML engineers, Backend and Frontend engineers to ensure smooth production operation.
-
Deploy tools that empower individual teams to monitor their workloads: ML Observability: Implement comprehensive monitoring for system health (latency/uptime) alongside ML-specific metrics, such as feature drift, prediction accuracy, and data distribution shifts, to ensure long-term model reliability. Non ML workload and production metrics monitoring.
-
Participate in on-call rotation, help manage posture to ensure compliance with standards such as SOC.
Qualifications
-
GCP at depth β IAM, org policies, VPC Service Controls, Secret Manager, Artifact Registry, Cloud DNS. Multi-project estate design (admin / apps / data separation across dev / QA / prod).
-
Kubernetes / GKE at depth β cluster topology, upgrades, node pools, Kustomize overlays, Helm. In-cluster operators: ArgoCD, ESO, cert-manager, argo-rollouts, cloudnative-pg, external-dns, kubescape.
-
Istio service mesh at depth β VirtualServices, Gateways, ingress passthrough, sidecar injection, mTLS (PeerAuthentication), telemetry. Istio-native across every workload.
-
Kong API gateway β Kong Operator + Kong Gateway.
-
Terraform + Atlantis β module design, state management, multi-hundred-file estate, GitOps TF workflow.
-
In depth familiarity with GitHub and GitHub Actions β ArgoCD in production (kustomize-based), release-promotion pipelines, GitHub workflows.
-
Secrets management β GCP Secret Manager + External Secrets Operator; familiarity with SOPS or equivalent gitops-on-secrets patterns.
-
Identity / auth β Auth0 (Terraform provider), Dex (in-cluster), Google Groups for IAM.
-
Networking + security β VPC-SC perimeter design, private GKE, GCP load balancers, in-cluster security scanning, SOC 2 posture, supply-chain hygiene.
-
Data / ML orchestration β operating Airflow in production and an ML-serving stack (Triton, vLLM, LiteLLM, MLflow, Opik).
-
Databases β Cloud SQL for PostgreSQL (regional HA, private IP, SSL-enforced, Google sql-db TF module); BigQuery (datasets, tables, IAM, scheduled MERGE queries); in-cluster PostgreSQL via cloudnative-pg; Elasticsearch in-cluster; ClickHouse (external) a plus; GCS as object / model store.
-
Automation / bootstrap β Ansible (cluster bootstrap and recovery).
-
Scripting β Python, Bash.
Requirements
-
Past experience with Continuous Monitoring of Model Accuracy.
-
Experience with Detecting Data Drift and Concept Drift.
-
Experience Setting Up Alerts for Anomalies or Performance Drops.
-
Experience Logging and Auditing Predictions.
-
Kubernetes certification (CKA / CKAD / CKS), GCP Professional Cloud Architect or Security Engineer certification, ClickHouse ops, Loki.