Jobgether has ALL remote jobs globally. We match you to roles where you're most likely to succeed and provide feedback on every application to help you learn. No more guesswork, application black holes, or recruiter ghosting in your job search.
For one of our clients, we are looking for a Senior Site Reliability Engineer, remotely from Europe.
As a Senior Site Reliability Engineer, you will be responsible for designing, maintaining, and optimizing reliable and scalable systems. You will track performance metrics, improve system reliability through automation, and ensure best practices for incident management. With your expertise in cloud services, container orchestration, and system performance, you will drive initiatives to enhance the efficiency and robustness of the infrastructure while collaborating closely with engineering teams to design systems built for high availability. This is a key role for someone passionate about building and maintaining resilient systems that ensure seamless operations at scale.
Accountabilities:
- Develop and maintain systems that are reliable, scalable, and efficient.
- Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure optimal system performance.
- Conduct blameless post-incident reviews, identify root causes, and implement preventive actions.
- Automate operational tasks, incident responses, and contribute to system performance optimizations.
- Work with engineering teams to ensure systems are designed for reliability, scalability, and maintainability.
- Continuously evaluate and improve system performance, capacity, and cost efficiency.
- Participate in the on-call rotation, providing troubleshooting and resolution support for critical issues.
Requirements
- Bachelor's degree in Computer Engineering or a similar discipline.
- 5+ years of experience in a Site Reliability Engineer or similar role.
- 3+ years of experience with AWS services and container orchestration tools.
- 2+ years of Kubernetes experience.
- Strong knowledge of observability tools and principles (monitoring, logging, tracing).
- Hands-on experience with Terraform for infrastructure as code.
- Proficiency in at least one programming language (e.g., Python, Go, Java).
- Experience in incident management, postmortem analysis, and risk mitigation.
- Familiarity with messaging systems like SNS, SQS, and experience with CI/CD tools.
- Fluency in English with strong communication skills.
Benefits
- Fully remote role, with flexibility to work from anywhere.
- Competitive salary and performance-based incentives.
- Health insurance coverage.
- Annual wellness and learning credits to support professional growth.
- Work-from-anywhere stipend.
- Annual all-expenses-paid company retreat to an exciting destination.
- Inclusive, diverse, and collaborative work environment.
#LI-CL1