Site Reliability Engineer

Posted: Wednesday, 05 May 2021 - San Francisco
You probably haven’t run into a company like Olark before.

We are 39 people distributed around the globe working together to fundamentally change the way people communicate with businesses. We care deeply about customer service which is reflected in our All Hands Support model. Our passion and dedication for our over 10,000 customers is a direct result of our people centric org model.

You will play a pivotal role in helping define our engineering culture by contributing to our positive organization and fulfilling our company goals of providing the best possible experience to our customers. We do this by architecting and maintaining infrastructure to help thousands of people build stronger connections with their customers via real-time web chat.

Your primary responsibilities will be:
  • Automate Server and Application Provisioning: As part of the engineering team, you will help automate server and application provisioning using configuration management systems on a large scale to support our applications.
  • Build Scalable Infrastructure: You will architect scalable systems to support running our applications. These systems should be resilient to high loads of traffic and hardware failure.
  • Ensure high service availability: You will work towards maximizing the availability of our services to our customers. You will implement and maintain monitoring systems that will ease failure detection and recovery. You will also have primary emergency response responsibilities when infrastructure issues are escalated.
  • Customer Support: Every member of our team spends at least one day per month supporting our customers via Olark chat itself. You'll get a chance to talk directly to our customers and get a first hand experience using the product you'll build!

You will help us solve key engineering challenges:
Our backend infrastructure is composed of hundreds of virtual machines running on Rackspace’s public cloud. We maintain this configuration with Puppet and attempt to minimize busy work by implementing automation as we develop our infrastructure. We’re constantly thinking about how to grow while improving reliability via automation, monitoring, alerting, and self-healing systems. Some of the challenges you will help us solve are:

  • How do we effectively apply Software Engineering principles to infrastructure and operational challenges?
  • How do we build and deploy software that enables minimum latency for real-time messaging on a large scale?
  • How can we architect a distributed backend infrastructure that can serve hundreds of thousands of customers and is resilient to different types failure?
  • How do we design systems that are able to heal themselves after unexpected failures with minimum human interaction?
  • How do we reliably store terabytes of transcripts and customer interactions?
  • How do we maintain real time state across mobile, desktop, and web?
  • How do we engineer systems that allow users with sub-standard network connections have a good real-time chat experience?

  • Your technical background and experience: You are a software developer with excellent Python skills. You have a solid understanding of different concurrency paradigms and have applied them in the past. You are very familiar with web technologies and patterns for scaling software like message queues, caching, and load balancing.
  • You are a strong software developer with an eye for automation opportunities: You have a strong desire to reduce manual, error-prone tasks by automating them through solid engineering practices.
  • Solid understanding of network programming and configuration: You’ve used command line tools to debug networking issues and have an understanding of different networking protocols. You also know about the challenges of building software in the cloud and are familiar with tools to manage infrastructure.
  • Self-directed, but open to collaboration: You are happy to work independently, but know when to ask for help. You know how to give and receive direct feedback. You are a team player and comfortable contributing to projects as well as open to leading projects. You deal well with ambiguity and realize it is an opportunity to experiment.
  • Excellent communicator: You are able to clearly communicate your thoughts, especially in text based communication to both technical and non-technical audiences. You know when a conversation should be in chat, skype, or face-to-face.
  • Empathetic: You realize listening is just as important as speaking your mind. You assume good faith of your teammates when there is conflict and are curious about understanding their perspectives.
  • Comfortable with or open to working remotely: The majority of your team will be distributed and you’ll communicate via remote tools: Slack, email, video conferencing, etc.
  • Meticulous attention to detail: You should be able to review code written by other engineers and find room for improvement. You should be able to write test suites for your code that exercise complicated code paths and prevent production mishaps.  Every time we push out new code is an opportunity to delight thousands of customers (or a risk of making a bad day for them).
  • Write good code: You should have examples of code you have written that is easy to read, maintainable, and testable. You should be able to decompose complicated problems into elegant solutions anyone on the engineering team can understand.  
  • Always learning: You are a curious individual who is constantly learning about new technologies and enjoys sharing what you've learned with your teammates.
  • Significant production experience: You have 2+ years of experience shipping and maintaining production code that is used by 1000s of people. You have run into the edge cases of operating at scale, and can teach us how to avoid them.

You can expect a lot from us:
First off, make sure to read about our team culture at, and our values at You can also get a sense of our history at Beyond what you see there, as a member of the engineering team you can expect:
  • A great remote culture and team: Even though we’re geographically dispersed, our team makes the effort to connect to one another and we provide in person opportunities to further enhance that bond. We genuinely like each other.  
  • A life outside of work: Olarkers generally work 40 hour weeks. Work is a marathon, not a sprint. We are building a company for the long haul.
  • Quality-driven culture: we strive to automate testing and heavily monitor our production system. Our goal is ensuring that any end-user issues are short-lived and limited in scope.
  • One weekly required engineering meeting and team wide sync - we are also very conscious of providing the time to keep ourselves focused.

Olark is committed to diversity in its workforce. Olark is an equal employment opportunity employer and considers qualified applicants without regard to gender, sexual orientation, gender identity, race, veteran or disability status.