Cloudlinux

Senior IaaS / Kubernetes Platform Engineer (worldwide remote, work anywhere)

Posted Yesterday

Be an Early Applicant

In-Office or Remote

Hiring Remotely in Tbilisi

Senior level

In-Office or Remote

Hiring Remotely in Tbilisi

Senior level

The Senior IaaS/Kubernetes Platform Engineer will design, implement, and operate a multi-tenant Kubernetes platform and infrastructure, focusing on Kubernetes, storage, and networking engineering, reliability, automation, and proactive improvements.

The summary above was generated by AI

CloudLinux is a global remote-first company. We are driven by our principles: do the right thing, employees first, we are remote first, and we deliver high-volume, low-cost Linux infrastructure and security products that help companies to increase the efficiency of their operations. Every person on our team supports each other and does what we can to ensure we all are successful.

Check out our website for more information https://cloudlinux.com/

We are looking for a Senior IaaS / Kubernetes Platform Engineer to join our Infrastructure Department and become a key contributor to the design, implementation, and operation of our private cloud and multi-tenant Kubernetes platform.

Our infrastructure powers 500+ VMs across multiple datacenters, serving 20+ engineering teams. We are in the process of evolving from an OpenNebula-based virtualization platform toward a Kubernetes-native multi-tenant cloud with KubeVirt for VM orchestration — while maintaining reliability and operational excellence throughout the transition.

You will work alongside the existing IaaS Tech Lead and Network Engineer, and must be capable of independently owning and operating the full IaaS stack (compute, storage, networking, bare metal) if needed. This is not a "Kubernetes-only" role — it requires deep infrastructure generalist skills combined with Kubernetes platform expertise.

What You Will Do

Kubernetes Platform Engineering (Primary Focus — 40%)

Design, build, and operate a multi-tenant Kubernetes platform using Cluster API (CAPI) with bare-metal providers (Metal3/Sidero).
Implement hard multi-tenancy using vCluster (Loft Labs) or similar technology, providing isolated Kubernetes API servers per tenant.
Deploy and manage KubeVirt for VM orchestration within Kubernetes, including CPU pinning, NUMA awareness, and HugePages configuration.
Implement GitOps-driven infrastructure using ArgoCD or Flux as the single source of truth for all cluster configurations.
Deploy and manage Policy-as-Code using Kyverno or OPA Gatekeeper for admission control, resource quotas, and security policies.
Build self-service capabilities using Crossplane or similar Kubernetes-native infrastructure provisioning tools.

Storage Engineering (20%)

Operate and optimize Ceph distributed storage clusters (currently 1 PiB raw, 149 OSDs, Quincy 17.2.5).
Manage Rook-Ceph operator deployments at scale on modern Kubernetes (v1.28+).
Implement storage tiering: Ceph for bulk storage, local NVMe for high-IOPS workloads, LINSTOR/DRBD or TopoLVM for ultra-fast replicated storage.
Design and implement per-VM / per-tenant I/O isolation on shared Ceph clusters.
Manage CDI (Containerized Data Importer) for VM image lifecycle in KubeVirt environments.

Networking (15%)

Deploy and manage overlay networks for pod networking, micro-segmentation, and WireGuard/IPsec encryption.
Implement Cluster Mesh for multi-datacenter pod-to-pod connectivity.
Configure Multus CNI and SR-IOV for multi-NIC VM support in KubeVirt.
Work with physical network infrastructure: Juniper switches (JunOS), BGP (eBGP/iBGP), EVPN/VXLAN, VLANs.
Maintain IPSec site-to-site connectivity between datacenters.

Reliability and Operations (15%)

Practice SRE discipline: define and maintain SLOs with error budgets, implement proactive capacity management with 6-12 month forecasting.
Design and execute chaos engineering experiments to validate system resilience.
Participate in on-call rotation for IaaS infrastructure (OpenNebula, Ceph, networking).
Write and maintain runbooks, DRP documentation, and postmortem analyses.
Drive proactive improvement: identify reliability risks, performance bottlenecks, and toil — then propose and implement solutions without waiting for incidents.

Infrastructure as Code and Automation (10%)

Develop and maintain Terraform/OpenTofu modules for multi-cloud infrastructure provisioning.
Write Ansible playbooks for bare-metal server configuration and fleet management.
Automate infrastructure lifecycle: PXE boot images, hardware provisioning (Foreman), IPMI management.
Implement FinOps practices: cost attribution, resource utilization analysis, right-sizing recommendations using OpenCost/Kubecost.

Requirements

Must have

5+ years in infrastructure/platform engineering roles, with at least 3 years operating production Kubernetes clusters (not just deploying apps on K8s, but building and managing the platform itself).
Production experience with at least 3 of the following:

KubeVirt or similar VM-on-K8s technology
Cluster API (CAPI) for declarative cluster lifecycle management
Cilium or Calico (advanced CNI with eBPF or BGP integration)
Rook-Ceph or other Kubernetes storage operators at scale (100+ OSDs) ○ ArgoCD or Flux for GitOps-driven infrastructure management

Deep Linux systems knowledge: kernel tuning, networking stack (iptables/nftables, routing, bonding, VLAN), filesystem operations, performance troubleshooting.
Ceph distributed storage experience: cluster operations, OSD lifecycle, pool management, performance tuning, troubleshooting degraded states.
Infrastructure as Code: Terraform/OpenTofu + Ansible at production scale.
Bare-metal infrastructure experience: IPMI/iDRAC, PXE boot, RAID configuration, hardware diagnostics, datacenter operations.
Networking fundamentals: BGP, VLAN, IPSec/WireGuard, DNS, load balancing.
Strong written and verbal English (B2+ minimum) — documentation, postmortems, and cross-team communication are in English.
Proactive mindset: demonstrated history of identifying problems before they become incidents and driving improvements without being asked.

Nice to have

Experience building multi-tenant Kubernetes platforms (vCluster, Capsule, or custom namespace isolation).
Crossplane or similar Kubernetes-native infrastructure abstraction.
Policy-as-Code: Kyverno, OPA Gatekeeper, or Kubewarden.
Container security: image signing (Sigstore/cosign), runtime security (Falco), sandboxed execution (Kata Containers, gVisor).
SRE practices: SLO/SLI design, error budget policies, chaos engineering (LitmusChaos, Chaos Mesh), incident management frameworks.
FinOps: OpenCost, Kubecost, cloud cost optimization.
Immutable OS experience: Talos Linux, Flatcar Container Linux, or similar.
OpenNebula experience (we are migrating FROM it, so understanding it accelerates the transition).
Experience with LINSTOR/DRBD or TopoLVM for local high-performance storage.
SR-IOV and DPDK experience for hardware-accelerated networking .
Experience migrating from traditional virtualization (VMware, OpenNebula, Proxmox) to Kubernetes/KubeVirt.
Grafana LGTM stack (Mimir, Loki, Tempo) for observability.
Compliance environment experience (SOC2, ISO 27001, NIS2).
Go or Python programming for infrastructure tooling.
Experience with Juniper JunOS switch configuration.

What we’re looking for

Proactive mindset. Our current IaaS workload is still around 50% unplanned work, including incidents and ad hoc support requests. We’re looking for someone who can reduce that through better automation, preventive controls, and more resilient systems.
Platform-minded. You look for ways to replace repetitive support work with scalable solutions, for example, building self-service workflows instead of provisioning VMs manually, or introducing automated QoS policies instead of handling limits case by case.
Able to work across the current and future stack. We operate OpenNebula and Ceph today while moving toward a Kubernetes-native platform. This role requires someone who can keep the current environment reliable while helping build the next stage in a practical way.
Transparent in communication. We value technical discussions, architectural decisions, and incident reviews happening in shared channels and documented formats. That includes ADRs, postmortems, and clear written updates.
Focused on knowledge sharing. You document your work, write runbooks as you go, and help make the platform easier for others to operate and support.
Strong English communication. Documentation, postmortems, Jira updates, Slack discussions, and cross-team collaboration are conducted in English.

Benefits

What's in it for you?

A focus on professional development.
Interesting and challenging projects.
Fully remote work with flexible working hours, that allows you to schedule your day and work from any location worldwide.
Paid 24 days of vacation per year, 10 days of national holidays, and unlimited sick leaves.
Compensation for private medical insurance.
Co-working and gym/sports reimbursement.
Budget for education.
The opportunity to receive a reward for the most innovative idea that the company can patent.

By applying for this position, you consent to the processing of your personal data as described in our Privacy Policy (https://cloudlinux.com/candidate-privacy-notice), which provides detailed information on how we maintain and handle your data.

Top Skills

Ansible

Argocd

Bgp

Calico

Ceph

Cilium

Gitops

Iaas

Ipsec

Kubernetes

Kubevirt

Linux

Opennebula

Terraform

Wireguard

Similar Jobs

FetLife

Head of Engineering & Infrastructure

3 Hours Ago

Remote

Ireland, IRL

Expert/Leader

Cloud • Social Media • Software

The Head of Engineering & Infrastructure will lead engineering and infrastructure teams, ensuring project delivery, accountability, and collaboration while maintaining system reliability and performance for over 12 million users.

Top Skills: CapybaraElasticsearchGithub ActionsHelmKubernetesPostgresRedisRspecRuby On RailsRustScylladbTerraformTypescriptVue

FetLife

Senior Devops Engineer

3 Hours Ago

Remote

Ireland, IRL

Senior level

Cloud • Social Media • Software

As a Senior DevOps Engineer, you'll upgrade infrastructure, manage databases, handle incident response, and enhance security in a remote setting.

Top Skills: CloudflareDatadogElasticsearchFastlyGithub ActionsGCPHelmKubernetesNew RelicPostgresRedisRuby On RailsRustScylladbSentryTerraformTypescriptVue

IDT

Software Engineer

3 Hours Ago

Remote or Hybrid

Senior level

Other

Design and implement a backend public API gateway for secure global money transfers, optimizing for high throughput and low latency. Collaborate with cross-functional teams to establish API standards and take ownership of the full development lifecycle.

Top Skills: AWSCouchbaseGoGrafanaKafkaKubernetesMongoDBPrometheusSqs

What you need to know about the Dublin Tech Scene

From Bono and Oscar Wilde to today's tech leaders, Dublin has always attracted trailblazers, with more than 70,000 people working in the city's expanding digital sector. Continuing its legacy of drawing pioneers, the city is advancing rapidly. Ireland is now ranked as one of the top tech clusters in the region and the number one destination for digital companies, with the highest hiring intention of any region across all sectors.