Machine Learning Engineer - ML Training Platform @Pluralis Research
Software Development
Salary equity-heavy co..
Remote Location
Job Type full-time
Posted 1mth ago

[Hiring] Machine Learning Engineer - ML Training Platform @Pluralis Research

1mth ago - Pluralis Research is hiring a remote Machine Learning Engineer - ML Training Platform. πŸ’Έ Salary: equity-heavy compensation with competitive base salary πŸ“Location: Australia

Role Description

This role involves implementing a novel substrate for training distributed ML models that work under consumer grade internet connection.

  • Design and implement large-scale distributed training systems optimized for heterogeneous hardware operating under low-bandwidth, high-latency conditions.
  • Develop and optimize model-parallel training strategies (data, tensor, pipeline parallelism) with custom sharding techniques that minimize communication overhead.
  • Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes.
  • Implement robust checkpointing, state synchronization, and recovery mechanisms for long-running, fault-prone training jobs.
  • Build monitoring and metrics systems to track training progress, model quality, and system bottlenecks.
  • Architect resilient training systems where nodes can fail, networks can partition, and participants can dynamically join or leave.
  • Design and optimize peer-to-peer topologies for decentralized coordination across non-co-located nodes.
  • Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle management.
  • Profile and optimize communication patterns to reduce latency and bandwidth overhead in multi-participant environments.

Qualifications

  • Strong experience building and operating distributed systems in production.
  • Hands-on expertise with distributed training frameworks (FSDP, DeepSpeed, Megatron, or similar).
  • Deep understanding of model parallelism (data, tensor, pipeline parallelism).
  • Expert-level Python with production experience (concurrency, error handling, retry logic, clean architecture).
  • Strong networking fundamentals: P2P systems, gRPC, routing, NAT traversal, distributed coordination.
  • Experience optimizing GPU workloads, memory management, and large-scale compute efficiency.

Requirements

  • 5+ years of experience in distributed systems and ML large-scale training.

Benefits

  • Equity-heavy compensation with meaningful ownership in a mission-driven company.
  • Competitive base salary for senior engineering roles in Australia.
  • Visa sponsorship available for exceptional candidates.
  • Remote-first with optional access to our Melbourne hub.
  • World-class team β€” team mates were previously at Google, Amazon, Microsoft, and leading startups.
Before You Apply
️
remote Be aware of the location restriction for this remote position: Australia
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Machine Learning Engineer - ML Training Platform @Pluralis Research
Software Development
Salary equity-heavy co..
Remote Location
Job Type full-time
Posted 1mth ago
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 152,720 Remote Jobs
️
remote Be aware of the location restriction for this remote position: Australia
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 152,720 Remote Jobs
Γ—

Apply to the best remote jobs
before everyone else

Access 152,720+ vetted remote jobs and get daily alerts.

4.9 β˜…β˜…β˜…β˜…β˜… from 500+ reviews
Unlock All Jobs Now

Maybe later