Site Reliability Engineer
📍 Location: Remote
About the Role
Join our SRE team to ensure the reliability, performance, and availability of our production systems. You will design and implement monitoring solutions, incident response procedures, and automation tools.
Key Responsibilities
- ✓Design and implement comprehensive monitoring and alerting systems
- ✓Respond to and resolve production incidents
- ✓Conduct post-mortem analysis and implement preventive measures
- ✓Optimize system performance and scalability
- ✓Automate operational tasks and workflows
- ✓Define and track SLIs, SLOs, and error budgets
- ✓Build and maintain disaster recovery procedures
Required Qualifications
- •Bachelor's degree in Computer Science or related field
- •3-5 years of experience in SRE or DevOps
- •Strong programming skills (Python, Go, or similar)
- •Experience with monitoring tools (Prometheus, Grafana, Datadog)
- •Deep understanding of Linux systems and networking
- •Experience with incident management and on-call rotation
- •Knowledge of distributed systems and microservices
Nice to Have
- +Experience with chaos engineering
- +Knowledge of performance testing tools
- +Understanding of database internals
- +Experience with capacity planning
- +Contributions to open-source projects
What We Offer
- ★Competitive compensation
- ★On-call rotation compensation
- ★Flexible working hours
- ★Professional development budget
- ★Health benefits
- ★Work-life balance support
Quick Info
Location
Remote
Employment Type
Full-time
Other Open Positions
DevOps Engineer
Build and maintain CI/CD pipelines, manage Kubernetes clusters, and implement infrastructure as code.
MLOps Engineer
Design and deploy ML pipelines, manage model lifecycle, and ensure scalable ML infrastructure.
Cloud Architect
Design cloud-native solutions, optimize cloud infrastructure, and lead cloud migration projects.