SurveyMonkey | Senior SRE | San Mateo, CA | Fulltime | Onsite
We're hiring a Senior SRE to work on our Transit team. The Transit team is responsible for the infrastructure relating to the transit of data - this includes caching, messaging, queueing, and application routing (nginx). We're slowly transitioning to be more of a "tier one" team, which means that we'll be responsible for those systems and more - service discovery, site performance, and more.
We're looking for someone to help us maintain these systems and, possibly more importantly, determine the right path for an AWS migration. One of my favorite parts of this job is that we get to work incredibly closely with engineering teams to help them choose the right tools for the job, and then help them implement their use cases around those tools. Our ideal applicant doesn't necessarily have experience with any of these tools, but they do have general ops and/or application engineering experience and are willing and excited to learn the other side. Ideally you're also excited to mentor - a big component of this job is working with engineering teams and helping them build with the infrastructure in mind.
I find the best way to explain our team is to share some of the problems we've been working on:
# Redis Performance For legacy reasons, we're running multiple instances of redis on a single VM. They've recently started clobbering each other in competition for disk i/o. We've been working on various projects to migrate to new hardware, split the multiple instances to different VMs, and reduce or remove the usage of redis entirely.
# Site Routing As we move more services to AWS, we've had some complications with how we route web requests. How can we ensure the same routing expectations are met in AWS as they are in our on-prem datacenter without massively increasing our operational burden, and having multiple different systems performing the same task in different datacenters?
# Network Traffic Reduction Or, as we've taken to calling it, "the service call diet". This is primarily manifesting as a change in how we make use of our caching systems (other than redis). Some of our services make hundreds of calls to memcache for a single request. This means that any sort of network latency can be magnified 100x over. We've been working with engineering teams to provide caching best practices and to build out a new internal caching library to simplify and abstract our caching infrastructure.
If this sounds like something you'd be interested in, please email me at Email is hidden. Come help us build this team!