Senior Site Reliability Engineer
At Crisis Text Line, our mission is to promote mental well-being for people wherever they are. Our technology powers real-time crisis support across the U.S. and globally, connecting people to help when they need it most.
We’re looking for a Senior Site Reliability Engineer to strengthen and scale the infrastructure behind our crisis care platform. In this role, you’ll ensure our systems are reliable, resilient, and observable—ready for every moment someone reaches out. You’ll bridge development and operations, champion automation, and drive a culture of reliability across engineering.
If you want to build systems that deliver help when it matters most, this is your opportunity.
Responsibilities
Automation & Infrastructure as Code
- Develop and maintain automation tools and frameworks to reduce manual operations.
- Implement Infrastructure as Code (IaC) using tools such as Terraform, CloudFormation, or similar.
- Build and manage CI/CD pipelines to support rapid, reliable deployments.
- Create self-service tooling and platforms that empower development teams.
System Reliability & Performance
- Design, implement, and maintain scalable, reliable, and secure infrastructure to support business-critical applications.
- Lead incident response, conduct root-cause analysis, and implement long-term preventive fixes.
- Optimize system performance through capacity planning and resource utilization improvements
Monitoring & Observability
- Design and implement robust monitoring, logging, and alerting systems.
- Build dashboards and metrics that provide visibility into system and service health.
- Establish observability best practices across microservices and distributed systems.
- Reduce alert fatigue through intelligent alerting, automation, and clear runbooks.
Qualifications
Required
- 6–8+ years in Infrastructure, SRE, Platform, or DevOps engineering with strong Python and Linux/Unix fundamentals.
- Advanced AWS expertise (EC2, EKS/ECS, S3, IAM, VPC) and hands-on Kubernetes + Docker in production.
- Proficiency with Terraform and Infrastructure as Code best practices.
- Experience owning CI/CD pipelines, deployment automation, and promotion workflows.
- Experience in observability + reliability skills (Datadog/Prometheus/Grafana, incident response, RCA).
- Security-minded engineering approach, ideally with exposure to regulated or healthcare environments.
Preferred
- Strong architectural thinking and ability to modernize legacy systems.
- Experience with trunk-based development, GitOps practices, or developer tooling.
- Demonstrated mentorship, technical guidance, or influence on engineering best practices.
- Effective collaborator with high ownership, able to operate independently and support developers.
- AWS cost-optimization experience or familiarity with Aurora/database performance.
- Experience working on mission-critical, distributed, or global-scale platforms.
Department Summary:
At Crisis Text Line, the Engineering, Product, and Design teams make up the Build department. Together, we design and deliver the trusted, innovative Crisis Care Platform that supports people in their most critical moments worldwide.
Our Vision is to:
- Deliver the most trusted, innovative, and easy-to-use Crisis Care Platform in the industry and drive unprecedented levels of growth for people in need worldwide.
- Ensure that every user feels a sense of community on our platform, allowing us to build trust and grow our impact.
- Enable our global affiliates to power their crisis support operations in an efficient and dynamic environment
- Provide a services/API-first architecture based on federated sources of data and infuses predictive insights (ML and otherwise) in every aspect of our Platform and Experience.
Reliable High-Speed Internet Required: Must have a stable high-speed internet connection to support seamless remote collaboration, virtual meetings, online job tasks, etc.
The full salary range for this position, across all United States geographies, is $115,192 to $160,018. The upper portion of the salary range is typically reserved for existing employees who demonstrate strong performance over time. Starting salary will vary by location, qualifications, and prior experience; during the interview process, candidates will learn the starting salary range applicable for their location. We pay competitively in the tech-forward nonprofit space and offer a robust benefits package.
Only candidates in the following states will be eligible for employment: CA, CO, CT, FL, GA, IL, IN, IA, MD, MA, MI, MO, NJ, NM, NY, NC, OH, PA, TN, TX, UT, VA, WA.
Benefits:
Crisis Text Line employee benefits are thoughtfully designed using an equity lens, acknowledging that we are all unique human beings with individual life circumstances that require flexibility and support.
Benefits include:
- 20 paid holidays ,including:
- Federal holidays like Juneteenth and Labor Day
- Election day
- Holiday break from Dec 24 through January 1
- 2 renewal days
- 2 floating holidays
- Flexible paid time off, including:
- 15 vacation days
- 3 personal days
- 7 sick days
- Medical, dental, and vision benefits for the staff member and family at no cost to the employee
- 403B retirement plan (the nonprofit equivalent of a 401K): 3% contribution by Crisis Text Line to support building financial wellness, regardless of personal contribution
- 12 weeks paid parental leave (after 6 months of employment)
- Student loan repayment (after 2 years of continuous full-time service)
- Family support through a virtual childcare platform
- Stipends/Allowances
- Mental health (Monthly)
- Internet Service (Monthly)
- Professional Development (Annual)
- Wellness (Annual)
- Home office setup (One-time/First year)
(Benefits are only for US-based employees. International benefits may differ).
#LI-KR1
This is a remote-only position
Please note: The following eligibility applies to U.S.-based roles only.
Only candidates in the following states will be eligible for employment: CO, CT, FL, GA, IL, IN, MD, MA, MI, NJ, NM, NY, NC, PA, TN, TX, UT, VA, WA.