Site Reliability Engineer

📍 Location: Remote

About the Role

Join our SRE team to ensure the reliability, performance, and availability of our production systems. You will design and implement monitoring solutions, incident response procedures, and automation tools.

Key Responsibilities

  • Design and implement comprehensive monitoring and alerting systems
  • Respond to and resolve production incidents
  • Conduct post-mortem analysis and implement preventive measures
  • Optimize system performance and scalability
  • Automate operational tasks and workflows
  • Define and track SLIs, SLOs, and error budgets
  • Build and maintain disaster recovery procedures

Required Qualifications

  • Bachelor's degree in Computer Science or related field
  • 3-5 years of experience in SRE or DevOps
  • Strong programming skills (Python, Go, or similar)
  • Experience with monitoring tools (Prometheus, Grafana, Datadog)
  • Deep understanding of Linux systems and networking
  • Experience with incident management and on-call rotation
  • Knowledge of distributed systems and microservices

Nice to Have

  • +Experience with chaos engineering
  • +Knowledge of performance testing tools
  • +Understanding of database internals
  • +Experience with capacity planning
  • +Contributions to open-source projects

What We Offer

  • Competitive compensation
  • On-call rotation compensation
  • Flexible working hours
  • Professional development budget
  • Health benefits
  • Work-life balance support

Ready to Apply?

Join our team and work on exciting projects with cutting-edge technologies.

Apply Now

Quick Info

Location
Remote
Employment Type
Full-time