Role Description
As a Senior ML Operations Engineer, you will support teams building and deploying AI-driven solutions, helping them overcome common challenges in the ML development, deployment, and support lifecycle. This role offers autonomy in decision-making, encouraging a proactive approach to solving complex issues and finding optimal solutions for scalable systems.
Team Composition
-
1 Engineering Manager
-
4 Senior ML Ops Engineers
Work Environment
The role offers a flexible work schedule, allowing you to adapt your working hours with the requirement to attend all team meetings. The team follows a Scrum-based Agile methodology.
As a qualified expert, you will
-
Serve as a force multiplier for development teams by creating golden paths that remove roadblocks and improve ideation and innovation.
-
Collaborate with other engineers, product managers, and internal stakeholders in an Agile environment.
-
Provide mentorship, technical guidance, and perform code reviews for team members.
-
Design and deliver on projects end-to-end with little to no guidance.
-
Provide support to teams building and deploying AI applications by addressing common pain points in the MLDLC.
-
Learn constantly and be passionate about discovering new tools, technologies, libraries, and frameworks (commercial and open source), that can be leveraged to improve PitchBookβs AI capabilities.
-
Support the vision and values of the company through role modeling and encouraging desired behaviors.
-
Participate in various cross-functional company initiatives and projects as requested.
-
Contribute to strategic planning in a way that ensures the team is building exceptional products that bring real business value.
-
Evaluate frameworks, vendors, and tools that can be used to optimize processes and costs with minimal guidance.
Qualifications
-
Degree in Computer Science, Information Systems, Machine Learning, or a similar field preferred (or equivalent practical experience).
-
5+ years of hands-on software development experience with Python (Java experience with strong Python proficiency also considered).
-
4+ years of experience designing and building distributed software systems and architectures.
-
3+ years of hands-on experience deploying and operating Machine Learning services in production.
-
Experience supporting ML lifecycle operations including post-deployment monitoring and maintenance.
-
Experience in cloud-native stack, with a practical understanding of containerization technologies such as Kubernetes and Docker.
-
Demonstrated experience with SQL and NoSQL database design and implementation.
-
Ability to decompose complex problems into iterative, well-defined solutions.
-
Strong problem-solving abilities with focus on building scalable, efficient, and maintainable systems.
-
Strong communication and collaboration skills, with the ability to engage effectively with internal customers across various cultures and regions.
-
Ability to be a team player who can also work independently.
-
Experience working across multiple development teams is a plus.
Desired Skills
-
Experience with cloud platforms (AWS, Google Cloud Platform, or Azure).
-
Proficiency in GitOps practices and CI/CD pipeline development and management.
-
Integration experience with observability tools (Prometheus, Grafana) and building instrumented, production-ready systems.
-
Experience provisioning and managing Large Language Models through managed services (Azure OpenAI, Google Vertex AI, Amazon Bedrock).
-
Hands-on experience with LLM gateways (LiteLLM) and agentic frameworks (LangGraph, LangSmith, or similar).
-
Practical experience with vector embedding models and vector databases (Pinecone, Weaviate, Milvus, pgvector).
-
Experience building Retrieval-Augmented Generation systems and evaluating both retrieval quality and generation performance.
-
Cloud-native experience with services like Amazon SageMaker, Google Vertex AI, or Azure ML.
-
Familiarity with ML frameworks and tools: PyTorch, TensorFlow, scikit-learn.
-
Experience with data infrastructure: Redis, Elasticsearch, Apache Kafka.
-
ML experiment tracking and model management: Weights & Biases, MLflow, KubeFlow.
-
API development with FastAPI or similar frameworks.
-
Java programming experience is a plus.
Benefits
-
Reveal great tech solutions by joining a team of experts who create custom, cutting-edge tech solutions for world-renowned businesses, fueling client growth.
-
Enjoy the freedom of fully remote work with a flexible working schedule.
-
Empower yourself with a stable workload and a stable income, supported by provided laptops and licensed software.
-
Benefit from performance and merit reviews, elevate your skills with personal development plans, and individual learnings through the corporate library, public speaking support, and more.
-
Work with a team of one mind who cares about what they do and how they do.
-
Join company-wide tech and cultural events, and contribute to meaningful CSR initiatives that resonate with your values.
Interview Process
-
Pre-Screening with the recruiter
-
Tech Interview (ML Fundamentals, Software Engineering) + Live Coding (up to 1.5 hours)
-
ML System Design Interview (up to 60 min)
-
Interview with Engineering Manager (60 min)
-
Final Client Interview (60 min)