Uptime.com is hiring a Site Reliability Engineer to join our team and help us build reliable, robust software solutions for our customers.

As a Site Reliability Engineer, you will have the opportunity to work remotely and collaborate with a talented team of engineers. You will be responsible for monitoring system performance, troubleshooting technical issues, deploying code changes, and ensuring the best possible customer experience.

We are looking for someone who has extensive experience in cloud infrastructure, distributed systems engineering, scripting, and automation tools such as Docker containers and Kubernetes clusters. Additionally, you should possess the skills needed to manage service outages and ensure system availability by writing scalable software solutions.

If you are a problem solver, have excellent communication skills, and thrive in a fast-paced environment, we want to hear from you!

What you'll do:

1. Design and Manage Cloud-Based Systems:

- Design robust, scalable, and highly available SaaS systems on AWS.

- Utilize AWS services such as EKS, EC2, AWS Lambda to manage web servers and serverless background tasks.

2. Scalability and Redundancy:

- Implement AWS Auto Scaling and Elastic Load Balancing to handle varying loads and traffic spikes.

- Ensure high availability by deploying services across multiple Availability Zones and using Amazon RDS with Multi-AZ deployment for databases.

3. Automated Provisioning and Deployment:

- Use tools like Terraform for infrastructure provisioning and Ansible for configuration management.

- Integrate CI/CD pipelines using tools such as Github Actions and other tools to ensure consistent, repeatable, and efficient deployment processes.

- Employ Git for version control and implement automated testing to ensure reliability.

4. Monitoring and Reliability:

- Set up comprehensive monitoring solutions using Prometheus, Grafana, CloudWatch.

- Configure these tools to track key performance metrics and set up alerts for any critical performance indicators.

- Implement log management using Amazon CloudWatch Logs or the ELK Stack and distributed tracing with AWS X-Ray.

5. Security:

- Secure the SaaS system against DoS and DDoS attacks using AWS WAF and AWS Shield.

- Manage access control with AWS IAM and ensure network security through security groups and network ACLs.

- Develop strategies for both prevention and active mitigation of security threats.

6. Incident Management and Troubleshooting:

- Diagnose and resolve complex technical issues related to website monitoring, including network troubleshooting and performance issues.

- Conduct trace route analysis, review CDN functionality, and check firewall/security group settings to resolve regional issues.

7. Collaboration and Communication:

- Work closely with cross-functional teams to ensure seamless integration and operation of monitoring services.

- Provide clear documentation of processes, configurations, and troubleshooting steps.

- Communicate effectively with team members and stakeholders to maintain a smooth workflow.

8. Continuous Learning and Improvement:

- Stay updated with the latest industry trends, tools, and technologies in cloud infrastructure, DevOps, and website monitoring.

- Participate in continuous learning opportunities and apply new knowledge to improve Uptime.com's services and operations.

Requirements

Educational Background:

• Bachelor’s degree in Computer Science, Information Technology, or a related field (preferred but not required).

Experience:

• At least 5 years of experience in Site Reliability Engineering (SRE) or DevOps roles.

• Proven experience in designing and managing highly available, scalable systems in a cloud environment, particularly AWS.

Technical Skills:

• Linux Administration: Strong knowledge of Linux server administration and scripting languages, particularly Python.

• Containerization: Experienced with containerization technologies such as Docker and Kubernetes.

• Scripting Languages: Proficient in modern scripting languages like Python and Bash.

• Microservices Architecture: Deep understanding of modern microservices-based architectures and operations.

• Cloud Services: Hands-on experience with AWS services, including EC2, RDS (with Multi-AZ deployments), AWS Lambda, Auto Scaling, and Elastic Load Balancing.

• Configuration Management: Familiarity with configuration management tools like Terraform and Ansible.

• Automated Provisioning/Deployment: Knowledge of automated provisioning and deployment tools, CI/CD pipeline integration (e.g., Github actions), version control (e.g., Git), and automated testing.

Operational Skills:

• Defensive Coding: Practices defensive coding techniques and patterns to ensure high availability and reliability.

• Monitoring and Reliability: Experience with monitoring tools like Loki and implementing log management solutions such as Amazon CloudWatch Logs or ELK Stack.

• Incident Management: Strong problem-solving skills with the ability to diagnose and resolve complex technical issues, particularly related to network troubleshooting and performance optimization.

Security:

• Security Best Practices: Knowledge of security best practices, including the use of AWS WAF, AWS Shield, IAM, security groups, and network ACLs to prevent and mitigate DoS and DDoS attacks.

Soft Skills:

• Collaboration: Demonstrates excellent problem-solving skills and strong collaboration abilities with cross-functional teams.

• Communication: Excellent written and verbal communication skills.

• Remote Work: Thrives in a fast-paced, remote work environment.

• Continuous Improvement: Commitment to continuous learning and improvement, staying updated with the latest industry trends and technologies.

Benefits

How We Will Support Your Growth and Success:

Collaborative Environment: Engage in meaningful collaborations with executives, leadership, and cross-functional teams, including engineering, marketing, and business operations. This exposure provides a comprehensive understanding of our business and fosters a holistic approach to problem-solving.
Innovative Industry Exposure: Dive into the dynamic world of monitoring, observability, and SRE.
Supportive Team Culture: Join a team of passionate, dedicated professionals united in our goal to build the best monitoring service in the world. Our supportive culture encourages knowledge sharing, mutual respect, and collective success.
Fully Remote Work Arrangements: Embrace the flexibility of working from home, anywhere in the world.
Unlimited Paid Time Off: Enjoy the freedom of unlimited paid time off, including vacation, sick days, and public holidays – a benefit extended to all employees, including our global contractors.
Family Leave Policies: Comprehensive family leave policies, including maternity and paternity leave.
Diverse and Inclusive Workplace: Be part of a company that values diversity and fosters an inclusive environment where everyone feels valued and empowered.