Our partner is driving digital transformation and building innovative, customer-focused energy solutions while transitioning to a net zero world. This hands-on role combines site reliability engineering (SRE) principles and observability leadership, to ensure the platforms deliver scalable, highly available, and self-service-ready experiences for both consumers and internal development teams. Join us in shaping the next generation of platform operations, while building a culture of proactive service support in a dynamic and innovative environment.
Tasks
- Ensure the reliability, performance, and scalability of large-scale, cloud-based applications and infrastructure.
- Define and continuously enhance reliability-focused metrics (latency, error rates, saturation, availability) and business/application-specific metrics (e.g., login success rate).
- Develop and implement automated solutions to enhance operational efficiency, including infrastructure as code (IaC) practices for scalable and consistent environments.
- Partner with engineering teams to embed observability and reliability into CI/CD pipelines, ensuring issues are caught before reaching production.
- Monitor applications and infrastructure end-to-end to guarantee smooth and efficient performance, including proactive alerting and tracing.
- Identify and address system issues, driving incident response and root cause analysis to maintain uptime and reliability.
- Collaborate with software developers, engineers, and operations teams to optimize system performance and reduce toil.
- Implement automated solutions for incident resolution to ensure proactive platform stability, with the aim of detecting and resolving issues before they impact consumers.
- This role requires participation in an on-call rotation for P1/P2 major incidents, driving a blameless culture and continuous improvement in incident handling.
Requirements
- Bachelor’s degree in computer science, Software Engineering, IT Operations, or related field
- 5+ years of relevant experience in devops engineering
- Proficient in Python or Ruby or JavaScript (React, Angular, Next.js)
- Skilled in Bash or PowerShell scripting
- Strong communication and stakeholder management
- Proven experience in observability for global, high-availability platforms
- Familiar with monitoring tools (Grafana, ELK, Splunk) and querying observability data
- Logical troubleshooting approach and technical curiosity
- Solid understanding of networking concepts and protocols
- Experience managing Azure DevOps for efficient CI/CD
- Infrastructure experience with AWS or Azure
- Strong Linux/Unix systems knowledge
- Proficient with infrastructure-as-code tools (Ansible, Puppet, Chef, Terraform)
- Skilled in cloud infrastructure services: identity, networking, storage, databases, containers, serverless
Our Offer
- Hybrid work model (2 office + 3 home office)
- Life & health insurance, medical care package
- Modern office with collaboration and chill-out spaces
- Learning and development opportunities to grow your career
- Inclusive, respectful company culture with strong support for diversity
- No travel requirement and no relocation needed