Our mission is to build groundbreaking AI products from scratch. We´re looking for a Senior Platform Engineer to architect and build the high-availability, scalable platform that will power our entire AI operation.
Our platform will be built on a multi-region Azure foundation (AKS + Cosmos DB + Event Hubs). We are just starting to build our Platform team, and you will be a founding member. You won´t just be operating a platform; you will be building it from the ground up: from the Terraform code for our AKS clusters to the CI/CD pipelines for our models. This is a hands-on role focused on engineering & automation. We work according to SRE best practices with the goal of creating a platform that will achieve 99.9%+ availability.
What You´ll Do:
- · Build the Platform from Scratch:
- o Code new AKS clusters, networking (VNet), and IAM guardrails using Terraform and Helm charts.
- o Create ´golden´ Docker images, GitOps pipelines (ArgoCD/Flux), automatic node provisioning, and scaling policies for both CPU and GPU workloads.
- o Design and implement the core MLOps infrastructure, including artifact repositories, model registries, and feature stores.
- · Automate for Reliability:
- o Implement and fine-tune our observability stack: Azure Monitor metrics, Prometheus, Grafana dashboards.
- o Build automated recovery mechanisms and chaos engineering tests to proactively find and fix weaknesses in the system.
- · Champion Platform Best Practices:
- o Work with development teams to ensure they are building reliable, observable, and secure applications from day one.
- o Create runbooks and documentation to prepare for future incident management.
Key Responsibilities:
- · IaC Development and Maintenance: Manage our infrastructure state with Terraform Cloud or Atlantis.
- · Kubernetes Operations: Handle version upgrades, manage node pools (including GPU nodes), and define network policies.
- · Data Environment Reliability: Ensure the reliability of our data stores (e.g., Cosmos DB geo-replication, Event Hubs consumer group management).
- · Security Hardening: Implement security best practices, including CVE scanning for Docker images and regular patching of node AMIs.
- · Observability Pipeline: Manage log processing, alerting rules, and capacity forecasting to stay ahead of problems.
- · Support AI Engineers: Provide a self-service platform and tooling that enables AI Engineers to train, deploy, and monitor their models with minimal friction.
What You´ll Bring:
- · 5+ years of experience in a DevOps, SRE, or Platform Engineering role.
- · Deep, hands-on experience with at least one major cloud provider (Azure is a strong plus).
- · Proven experience with containerization (Docker) and orchestration (Kubernetes) in a production environment.
- · Expertise in Infrastructure as Code (Terraform is a must).
- · Strong programming skills in a scripting language (Python is a strong plus).
- · Experience building and maintaining production-grade CI/CD systems.
- · A proactive mindset focused on preventing incidents rather than just reacting to them.
What We Offer:
- · A Green-field Opportunity: You will be building a state-of-the-art AI platform from the ground up, using the best tools for the job.
- · A Modern Toolkit: Work with GitHub, Kubernetes, Managed Grafana, Terraform, and the latest Azure AI services.
- · Real Impact: Your work is the foundation upon which our entire AI strategy is built. You are a critical enabler for the entire team.
- · Focus on Engineering, Not Firefighting: In the initial phase, your role is 100% focused on building and automating, not on reactive, on-call firefighting.
- · A Laid-back, Senior Team: We have one daily stand-up, then we focus on deep work.
- · Competitive Salary.
- · HO-friendly with a cool HQ in Budapest.