Role Description
You will join a small team responsible for the stability, performance, and security of our server infrastructure: bare metal, VMs, databases, queues, networking, and infrastructure security. Our philosophy is simple: the team owns its systems end-to-end, and every engineer should be able to diagnose and fix issues in their area of responsibility. This role is for someone who enjoys working with real infrastructure (OS, hardware/virtualization, networks), not just cloud abstractions.
What You'll Do
-
Work primarily with on-premise infrastructure (bare metal and VMs): setup, maintenance, troubleshooting
-
Drive clarity in ambiguous situations by defining requirements, assumptions, and next steps
-
Own automation projects end-to-end (design → rollout → maintenance)
-
Improve how we operate: harden and tune systems and also improve the way the team works in terms of operational hygiene
-
Keep the platform stable, fast, and secure: servers, web servers, databases, queues
-
Investigate production incidents across OS / networking / infrastructure layers, apply temporary mitigations, coordinate with developers and participate in post-mortems
-
Participate in on-call rotations
-
Use AI in all aspects of day-to-day work: researching, troubleshooting, developing
Qualifications
-
4+ years as a DevOps Engineer / SRE (or very close responsibilities)
-
Real, hands-on experience with servers (VMs, bare metal) at the OS level and below: configuring, troubleshooting, digging into “why it’s broken”
-
Confident Linux skills (we use Ubuntu). We expect you to be comfortable with the core tools from Linux Crisis Tools
-
Solid understanding of networking basics; ability to configure and troubleshoot iptables
-
Ansible + Git
-
Experience with Bash or Python scripting for automation/observability
-
Production/on-call experience: diagnosing incidents, restoring service, participating in post-mortems
-
Ownership and attention to detail. Downtime is expensive: five years ago, 10 minutes of downtime cost us $100k — today it’s even more
Nice to Have
-
ClickHouse, MongoDB: what each database is used for, monitoring, troubleshooting performance and slow queries, sharding
-
Kafka: operating clusters at scale (topic moves, broker replacements, tuning)
-
Redis: high-load tuning, replication, sharding, performance monitoring
-
Elasticsearch: configuration, scaling, sharding/cluster management
-
HAProxy / Nginx: load balancing, SSL/TLS, caching, reverse proxying, performance monitoring
-
OS tuning: kernel/network stack/filesystem parameters for high-load systems
-
Full Disk Encryption on LVM: We use Clevis + Tang in production
-
Infrastructure Security: Teleport, HashiCorp Vault
Bonus points
-
Great if you’ve worked with any of the following: VictoriaMetrics and how it differs from the Prometheus stack
-
Complex CI/CD pipelines. We use scripted Jenkins pipelines
-
Bare-metal Kubernetes: provisioning, networking (MetalLB or alternatives), isolation from the internet, scaling across providers (like OVH, Hetzner) and integration with existing infrastructure
-
Flux and GitOps
-
Terraform
Benefits
-
31 days off
-
100% paid telemedicine plan
-
Home Office Setup Assistance: the company offers assistance with purchasing furniture (office chair, office desk, monitor) and other items to create a comfortable workspace.
-
English learning courses
-
Relevant professional education
-
Gym or swimming pool
-
Co-working
-
Remote working