Role Description
RV LIFE is looking for a Senior DevOps & Infrastructure Lead to help us stabilize, document, and modernize the infrastructure behind our products. This is a hands-on senior role for someone comfortable inheriting real production systems, reducing operational risk, improving reliability, and moving us toward a documented, secure, automated, infrastructure-as-code operating model.
We run production across DigitalOcean, AWS, Cloudflare, and other hosting providers, and are consolidating onto managed, infrastructure-as-code platforms. We need deep, hands-on expertise across these environments.
This role focuses on the infrastructure path to reliability; application-level architecture changes are handled in partnership with our engineering team. It is not just about keeping servers alive. It is about building durable practices that reduce single-person dependency, improve visibility, and make our systems safer to operate.
This is not a standard 9-to-5 role. Production issues do not keep business hours, so it carries real on-call responsibility: you need to be reachable and able to respond when unforeseen incidents arise.
What You'll Do
-
Administer and improve existing DigitalOcean infrastructure.
-
Support and improve Linux-based production server environments.
-
Migrate self-managed databases onto managed database services, with validated failover, backups, and recovery.
-
Move applications onto managed runtimes (including Laravel Cloud where it fits), replacing manual deploy processes with automated, repeatable pipelines.
-
Expand and harden our use of Cloudflare for edge, static hosting, caching, and security.
-
Build a clear inventory of servers, services, databases, domains, access paths, backups, monitoring, and operational risks.
-
Create and maintain practical runbooks for common and emergency infrastructure workflows.
-
Improve incident response, escalation paths, monitoring, logging, and alerting.
-
Review and improve backup, restore, and disaster-recovery procedures.
-
Identify recurring manual work and convert it into safer procedures, scripts, automation, or infrastructure-as-code.
-
Help define infrastructure-as-code standards and move appropriate infrastructure into repeatable, version-controlled workflows.
-
Work with AWS services where needed (Lambda, VPC, IAM, CloudWatch, S3, SSM/Secrets Manager, queues).
-
Use AI tools to accelerate discovery, documentation, scripting, troubleshooting, and automation, with strong production-safety judgment.
-
Partner with engineering leadership to prioritize infrastructure risk and modernization; track work clearly in Jira/GitHub and communicate proactively about risks, tradeoffs, and blockers.
What Success Looks Like
-
In the first 30-60 days, you'll take ownership of how we see and operate our infrastructure, building on what we already track and closing the gaps.
-
You'll validate and take ownership of what already exists:
-
Our infrastructure inventory and server map
-
Our monitoring and alerting
-
Our DNS / Cloudflare configuration
-
Our prioritized infrastructure risk register
-
You'll create what we're missing:
-
An access and credential map
-
Verified backup and restore status for critical systems (tested, not assumed)
-
Runbooks for the highest-risk operational workflows
-
In the first 90 days, you'll move us toward a durable, consolidated model. Success means:
-
The first core database migrated to a managed service, with a tested restore, plus a clear, sequenced plan for the rest.
-
The first application running on a managed runtime (App Platform or Laravel Cloud).
-
The first static frontend served from Cloudflare Pages.
-
A measurably stronger edge security posture.
-
Critical systems no longer understood by only one person; common tasks have documented procedures; manual processes are being converted to automation; AI is used safely to reduce toil.
Qualifications
-
Senior-level experience operating production infrastructure.
-
Deep, hands-on Linux server administration: operating, securing, and troubleshooting manually managed production servers (LAMP/LEMP, system services, cron, networking, SSH) directly at the command line, not only through a cloud console.
-
Experience with DigitalOcean, Linode, AWS EC2, bare VPS hosting, or comparable environments.
-
Senior database operations: migrating self-managed MySQL to a managed service, replication, backup validation, restore testing, and IO isolation.
-
Strong Cloudflare across DNS, WAF, CDN and caching behavior, page rules, Workers, Pages, and Zero Trust/Access, including traffic routing and origin protection.
-
PHP/Laravel application environments, and experience with a managed Laravel runtime (Laravel Cloud and/or DigitalOcean App Platform).
-
Datadog or a comparable observability platform for monitoring, alerting, dashboards, logs, and incident investigation.
-
Infrastructure-as-code such as Terraform, Pulumi, AWS CDK, Serverless Framework, or CloudFormation.
-
CI/CD pipelines and deployment automation.
-
Practical AWS experience (Lambda, IAM, VPC, CloudWatch, S3, SSM/Secrets Manager, queues).
-
Good judgment around production safety, access control, secrets, backups, and incident response.
-
Willingness to carry real on-call responsibility and respond to production incidents outside normal business hours; this is not a strict 9-to-5 role.
-
A habit of documenting what you learn and creating runbooks others can follow.
-
Practical experience using AI tools (ChatGPT, Claude, Cursor, GitHub Copilot, or similar), with strong judgment about where human verification is required.
-
Ability to work independently in a small, remote engineering organization where practical ownership matters more than bureaucracy.
Nice to Have
-
Experience migrating manually managed services onto managed platforms or IaC.
-
Experience moving static frontends onto Cloudflare Pages.
-
Managed migrations for MongoDB, OpenSearch, or Valkey/Redis.
-
Experience supporting Node.js, React, and React Native alongside PHP.
-
Experience helping organizations reduce infrastructure bus-factor risk.
-
Experience working with external DevOps/security partners or auditors.
Who You Are
-
You are someone who takes ownership without waiting to be told every next step.
-
Is calm and practical during incidents.
-
Can inherit messy systems without being judgmental or reckless.
-
Prefers consolidating on platforms we already run over adding new vendors.
-
Documents as you go.
-
Uses AI as leverage, but does not blindly trust its output; you verify, test, and apply senior judgment before anything touches production.
-
Knows when to automate and when to stabilize first.
-
Communicates clearly with technical and non-technical stakeholders.
-
Understands that reliability is not just uptime: it is visibility, repeatability, recovery, and shared understanding.
-
Wants to leave infrastructure better than you found it.