Cloudlinux
Senior IaaS / Kubernetes Platform Engineer (worldwide remote, work anywhere)
CloudLinux is a global remote-first company. We are driven by our principles: do the right thing, employees first, we are remote first, and we deliver high-volume, low-cost Linux infrastructure and security products that help companies to increase the efficiency of their operations. Every person on our team supports each other and does what we can to ensure we all are successful.
Check out our website for more information https://cloudlinux.com/
We are looking for a Senior IaaS / Kubernetes Platform Engineer to join our Infrastructure Department and become a key contributor to the design, implementation, and operation of our private cloud and multi-tenant Kubernetes platform.
Our infrastructure powers 500+ VMs across multiple datacenters, serving 20+ engineering teams. We are in the process of evolving from an OpenNebula-based virtualization platform toward a Kubernetes-native multi-tenant cloud with KubeVirt for VM orchestration — while maintaining reliability and operational excellence throughout the transition.
You will work alongside the existing IaaS Tech Lead and Network Engineer, and must be capable of independently owning and operating the full IaaS stack (compute, storage, networking, bare metal) if needed. This is not a "Kubernetes-only" role — it requires deep infrastructure generalist skills combined with Kubernetes platform expertise.
What You Will Do
Kubernetes Platform Engineering (Primary Focus — 40%)
- Design, build, and operate a multi-tenant Kubernetes platform using Cluster API (CAPI) with bare-metal providers (Metal3/Sidero).
- Implement hard multi-tenancy using vCluster (Loft Labs) or similar technology, providing isolated Kubernetes API servers per tenant.
- Deploy and manage KubeVirt for VM orchestration within Kubernetes, including CPU pinning, NUMA awareness, and HugePages configuration.
- Implement GitOps-driven infrastructure using ArgoCD or Flux as the single source of truth for all cluster configurations.
- Deploy and manage Policy-as-Code using Kyverno or OPA Gatekeeper for admission control, resource quotas, and security policies.
- Build self-service capabilities using Crossplane or similar Kubernetes-native infrastructure provisioning tools.
Storage Engineering (20%)
- Operate and optimize Ceph distributed storage clusters (currently 1 PiB raw, 149 OSDs, Quincy 17.2.5).
- Manage Rook-Ceph operator deployments at scale on modern Kubernetes (v1.28+).
- Implement storage tiering: Ceph for bulk storage, local NVMe for high-IOPS workloads, LINSTOR/DRBD or TopoLVM for ultra-fast replicated storage.
- Design and implement per-VM / per-tenant I/O isolation on shared Ceph clusters.
- Manage CDI (Containerized Data Importer) for VM image lifecycle in KubeVirt environments.
Networking (15%)
- Deploy and manage overlay networks for pod networking, micro-segmentation, and WireGuard/IPsec encryption.
- Implement Cluster Mesh for multi-datacenter pod-to-pod connectivity.
- Configure Multus CNI and SR-IOV for multi-NIC VM support in KubeVirt.
- Work with physical network infrastructure: Juniper switches (JunOS), BGP (eBGP/iBGP), EVPN/VXLAN, VLANs.
- Maintain IPSec site-to-site connectivity between datacenters.
Reliability and Operations (15%)
- Practice SRE discipline: define and maintain SLOs with error budgets, implement proactive capacity management with 6-12 month forecasting.
- Design and execute chaos engineering experiments to validate system resilience.
- Participate in on-call rotation for IaaS infrastructure (OpenNebula, Ceph, networking).
- Write and maintain runbooks, DRP documentation, and postmortem analyses.
- Drive proactive improvement: identify reliability risks, performance bottlenecks, and toil — then propose and implement solutions without waiting for incidents.
Infrastructure as Code and Automation (10%)
- Develop and maintain Terraform/OpenTofu modules for multi-cloud infrastructure provisioning.
- Write Ansible playbooks for bare-metal server configuration and fleet management.
- Automate infrastructure lifecycle: PXE boot images, hardware provisioning (Foreman), IPMI management.
- Implement FinOps practices: cost attribution, resource utilization analysis, right-sizing recommendations using OpenCost/Kubecost.
Requirements
Must have
- 5+ years in infrastructure/platform engineering roles, with at least 3 years operating production Kubernetes clusters (not just deploying apps on K8s, but building and managing the platform itself).
- Production experience with at least 3 of the following:
- KubeVirt or similar VM-on-K8s technology
- Cluster API (CAPI) for declarative cluster lifecycle management
- Cilium or Calico (advanced CNI with eBPF or BGP integration)
- Rook-Ceph or other Kubernetes storage operators at scale (100+ OSDs) ○ ArgoCD or Flux for GitOps-driven infrastructure management
- Deep Linux systems knowledge: kernel tuning, networking stack (iptables/nftables, routing, bonding, VLAN), filesystem operations, performance troubleshooting.
- Ceph distributed storage experience: cluster operations, OSD lifecycle, pool management, performance tuning, troubleshooting degraded states.
- Infrastructure as Code: Terraform/OpenTofu + Ansible at production scale.
- Bare-metal infrastructure experience: IPMI/iDRAC, PXE boot, RAID configuration, hardware diagnostics, datacenter operations.
- Networking fundamentals: BGP, VLAN, IPSec/WireGuard, DNS, load balancing.
- Strong written and verbal English (B2+ minimum) — documentation, postmortems, and cross-team communication are in English.
- Proactive mindset: demonstrated history of identifying problems before they become incidents and driving improvements without being asked.
Nice to have
- Experience building multi-tenant Kubernetes platforms (vCluster, Capsule, or custom namespace isolation).
- Crossplane or similar Kubernetes-native infrastructure abstraction.
- Policy-as-Code: Kyverno, OPA Gatekeeper, or Kubewarden.
- Container security: image signing (Sigstore/cosign), runtime security (Falco), sandboxed execution (Kata Containers, gVisor).
- SRE practices: SLO/SLI design, error budget policies, chaos engineering (LitmusChaos, Chaos Mesh), incident management frameworks.
- FinOps: OpenCost, Kubecost, cloud cost optimization.
- Immutable OS experience: Talos Linux, Flatcar Container Linux, or similar.
- OpenNebula experience (we are migrating FROM it, so understanding it accelerates the transition).
- Experience with LINSTOR/DRBD or TopoLVM for local high-performance storage.
- SR-IOV and DPDK experience for hardware-accelerated networking .
- Experience migrating from traditional virtualization (VMware, OpenNebula, Proxmox) to Kubernetes/KubeVirt.
- Grafana LGTM stack (Mimir, Loki, Tempo) for observability.
- Compliance environment experience (SOC2, ISO 27001, NIS2).
- Go or Python programming for infrastructure tooling.
- Experience with Juniper JunOS switch configuration.
What we’re looking for
- Proactive mindset. Our current IaaS workload is still around 50% unplanned work, including incidents and ad hoc support requests. We’re looking for someone who can reduce that through better automation, preventive controls, and more resilient systems.
- Platform-minded. You look for ways to replace repetitive support work with scalable solutions, for example, building self-service workflows instead of provisioning VMs manually, or introducing automated QoS policies instead of handling limits case by case.
- Able to work across the current and future stack. We operate OpenNebula and Ceph today while moving toward a Kubernetes-native platform. This role requires someone who can keep the current environment reliable while helping build the next stage in a practical way.
- Transparent in communication. We value technical discussions, architectural decisions, and incident reviews happening in shared channels and documented formats. That includes ADRs, postmortems, and clear written updates.
- Focused on knowledge sharing. You document your work, write runbooks as you go, and help make the platform easier for others to operate and support.
- Strong English communication. Documentation, postmortems, Jira updates, Slack discussions, and cross-team collaboration are conducted in English.
Benefits
What's in it for you?
- A focus on professional development.
- Interesting and challenging projects.
- Fully remote work with flexible working hours, that allows you to schedule your day and work from any location worldwide.
- Paid 24 days of vacation per year, 10 days of national holidays, and unlimited sick leaves.
- Compensation for private medical insurance.
- Co-working and gym/sports reimbursement.
- Budget for education.
- The opportunity to receive a reward for the most innovative idea that the company can patent.
By applying for this position, you consent to the processing of your personal data as described in our Privacy Policy (https://cloudlinux.com/candidate-privacy-notice), which provides detailed information on how we maintain and handle your data.


