Site Reliability Engineer at Thumbtack
San Francisco, CA, US
We're looking for an exceptionally talented engineer to take the lead in managing our growing infrastructure, ensuring our site stays up and performs well, and refining our processes for operating our production systems. Working closely with the rest of our engineering team, you'll have a great deal of authority in designing and implementing the hardware and software systems we use to host, manage and monitor our production environment.

Thumbtack's infrastructure has always been managed by our software engineers, but we're not experts and don't know all the best practices for configuring, monitoring and managing a complex site. Our traffic has effectively outgrown our limited operational know-how (an hour of downtime means real lost revenue at this point), and while there haven't been any major disasters, we recognize it's time to take our operations to the next level. Our Python deploys could be much smoother, our monitoring could be more systematic and accessible, our alerting could be much less noisy.

We're looking for someone to own Thumbtack engineering operations and push us forward. As the authority on operations here, you'll plan and execute how we manage and monitor our site as it grows. You'll continually look for new ways to make our systems more reliable and easier to manage, incorporating third-party tools when available and writing software of your own when nothing else fits the bill. You'll anticipate performance bottlenecks and provision new hardware as necessary. And finally, we'd love to find someone who's excited to learn and grow, expanding skills and expertise as the site continues to grow and develop.

Our current infrastructure:

Our site operates on about a dozen dedicated Linux machines running RHEL5, managed via Puppet
Our main data stores are Postgres (website backend) and Mongo (internal analytics); we also make use of Redis and Memcached
We use Munin, Graphite and a handful of custom tools for monitoring and alerting
We practice continuous deployment using a custom one-click deployment system written in Python. Auxiliary systems are deployed directly via Puppet.

About you:

Expert with Linux administration, security and configuration management
Deep knowledge of the steps involved in serving a web request and experience dealing with the corresponding infrastructure components
Fanatic about monitoring
Enjoy diagnosing and fixing misbehaving and underperforming Linux servers
Fluent with the shell and comfortable writing tools in Python to automate our operations and development processes
Experience tuning database performance is a plus
Comfortable working with a great deal of autonomy
Excited to continually learn, grow and share knowledge

Please visit to return back to our careers page.