A Site Reliability Engineer (SRE) focuses on building and operating scalable, reliable systems that support business services. The role blends software engineering and systems administration to reduce operational toil and improve resilience. This position suits candidates with strong coding skills, deep systems knowledge and a passion for automation and observability.
This job description outlines the role, duties, and required qualifications for hiring an experienced Site Reliability Engineer to ensure platform availability and performance.
Site Reliability Engineer Job Profile
The Site Reliability Engineer drives reliability and scalability across production services by applying software engineering practices to operations. They design, implement and maintain automation for deployment, monitoring and incident response.
SREs collaborate with development teams to define service-level objectives and continuously improve system performance, resilience, and efficiency through tooling, runbooks, and capacity planning.
Site Reliability Engineer Job Description
The Site Reliability Engineer will be responsible for maintaining high availability of critical services, developing automation to reduce manual intervention and creating robust observability for platform behaviour. This includes designing fault-tolerant architectures, performing root cause analysis and leading post incident reviews to prevent recurrence.
Key activities include building and owning CI CD pipelines, managing infrastructure as code, tuning performance, and ensuring secure, compliant deployments. The SRE will operationalise best practices for release safety, blue-green and canary deployments, and guide teams on scalability trade-offs.
The role requires proactive capacity planning and cost control for cloud resources, driving optimisation through telemetry and analytics. The ideal candidate will champion resilience engineering, mentor peers on incident management processes and contribute to an on-call rota to support 24 7 service delivery.
Site Reliability Engineer Duties and Responsibilities
- Design, implement and operate reliable, scalable systems and services.
- Develop automation for provisioning, configuration and deployments using infrastructure as code.
- Create and maintain monitoring, logging and alerting to provide clear observability.
- Define and measure service level objectives and indicators, and drive improvements.
- Lead incident response and conduct post-incident reviews with actionable remediation.
- Optimise system performance, capacity and cloud costs through continuous analysis.
- Build and maintain CI CD pipelines and deployment strategies such as canary releases.
- Collaborate with software engineers to ensure reliability is considered early in designs.
- Develop runbooks, playbooks and documentation for on-call engineers and teams.
- Mentor junior engineers and promote reliability engineering practices across the organisation.
Site Reliability Engineer Requirements and Qualifications
- Bachelor's degree in Computer Science, Engineering or equivalent experience.
- Proven experience in reliability engineering, platform engineering or SRE roles.
- Strong programming skills in Python, Go, Ruby or similar languages for automation.
- Experience with cloud platforms such as AWS, Azure or Google Cloud and their native services.
- Deep knowledge of containers, Kubernetes and orchestration technologies.
- Familiarity with infrastructure as code tools such as Terraform, CloudFormation or Ansible.
- Proficiency with observability stacks: Prometheus, Grafana, ELK, Jaeger or similar.
- Experience designing and operating CI CD pipelines and release automation.
- Strong troubleshooting skills and experience with incident management processes.
- Excellent communication skills and ability to work cross-functionally in distributed teams.
