Some companies project an image of an engineering paradise, where their technologies are awesome, the staff are stunning, and the keyboards are made of gold. But after joining, you discover that the technologies are crummy, the staff can be terrible, and your right shift key is sticky. It's a company where dumb stuff happens but nobody can fix it, and the future outlook is dull. The paradise was a facade.
Netflix is not like that.
After two years I still find Netflix an amazing place to work. What's surprised me most is the lack of unpleasant surprises. Before I joined, Netflix looked awesome, but I wondered if there was some catch or downside that I'd only learn after working here. I have yet to find one.
How does Netflix do it? I think the culture deck goes a long way to explain how, and it's still true. In short: everyone is professional, we work well together, innovate (freedom and responsibility), and have a good work/life balance. It works well.
This is also the most challenging job of my life, in part because of what I've chosen to work on. So far I've worked on kernel and hypervisor internals, various runtimes and databases, created many new performance tools, and performed distributed systems analysis as an SRE.
Me?...an SRE? Yes, I've volunteered to join the on-call rotation for the Core SRE team, where I get to do distributed systems analysis and SRE work. Core SRE is the central SRE team at Netflix. My colleague, Dave Hahn, gave a good talk about it at AWS re:Invent.
Working on Core issues at Netflix has been a great opportunity for me to grow professionally, expanding from my typical specialty of systems analysis to doing distributed systems analysis at scale. Others on the performance team (which I'm on) also help Core, and our teams are co-located. The kinds of work we do is similar – troubleshooting complex software written by others.
Netflix is composed of many microservices, and each have their own teams who are on-call. Much of the time, service teams have set up early alerts so that they are notified and can fix issues before they affect customers. Core gets paged when there is a customer impacting issue.
When we get paged, we'll often find that a service team is already working on the issue and about to deploy a fix. Sometimes, however, we find that no one is working the issue – you're it! You have to figure out what – in the massive Netflix ecosystem – is broken, and help get it fixed ASAP. At the very least, Twitter is a slot machine of people saying that Netflix is down. If it's a really bad outage, major news networks are speculating about what Netflix is doing to fix it. They're talking about you, and the decisions you make in the next five minutes. Pressure's on!
I was both excited and a little terrified to do Core on-call, but for a few reasons it's been easier than I thought:
- Bad incidents are rare (thanks to planning, architecture, and regular Chaos testing).
- I'm never working an incident alone: other staff join to help, and I can page others.
- We have staff who are experts at distributed systems analysis, from whom I can learn.
- We have specialized tools that make this kind of analysis much easier.
- I'm only on-call about 3-4 days per month.
This has helped me gain some experience in distributed systems analysis at massive scale, and of the new methodologies and metrics involved: upstream vs downstream latency, timeout and failover statistics, error rates, auto-correlations, etc. I look forward to learning more and getting better at this – it's been a highlight so far of my time at Netflix.
Some other highlights from two years at Netflix:
- Building Java CPU flame graphs, helping take this from a crazy idea (modifying hotspot) to production.
- Developing new flame graph features: differentials, and using them for CPI flame graphs, and search.
- Developing a toolkit of Linux perf and ftrace tools: perf-tools, and using them in production.
- Getting some experience with FreeBSD on systems running heavy workloads.
- Learning some Xen internals, and contributing a Xen patch.
- rxNetty vs Tomcat analysis: getting deep on new vs old technologies.
- Many conference talks, including my first BSD conferences, JavaOne, and LinuxCon.
- Contributing to Linux BPF and developing new bcc tracing tools.
- Playing cricket on the Netflix cricket team!
- Working with awesome staff.
Plus a few more highlights I can’t talk about (yet). I also have many interesting projects I'll be working on next, which makes for a long todo list. More highlights to come.
Should you work here?
Check the culture deck. It's not for everyone, and that's ok – being open allows people to self select. You can also read my own perspective about working at Netflix in a previous post, where I talked more about culture and recruiting. It's still true. And as with my last post, no one at Netflix asked me to write this one. I'm writing because I think our company culture is worth talking about, whether it helps you join us, or if it helps you improve the culture at your own company.
If the Netflix culture and company do sound attractive, and you're a senior professional with relevant skills and a top performer, I'd recommend checking our jobs page. Most of these jobs are based in Los Gatos, where we've just opened two new buildings and two more are on the way. These have staff at desks in a semi-open office layout, with private places you can visit to work.
A problem you might have, if you are a senior engineer, is the phenomenon of "the more you know, the more you don't know". If you've dug deep on technologies in the past, you may have been exposed to their vast complexities, and how much you've yet to learn. This can hurt your confidence, and that's a problem when applying for jobs. I don't know of a great solution other than to be self aware that you're falling into this trap, as well as chatting to Netflix staff about this.
As with our engineers, we try to hire the best recruiters in the industry. So if you only talk to one recruiter this year, make it a Netflix recruiter!
Good luck, and come say hi if you join (I'm in D2 near a window overlooking the basketball court). You might also see me at the upcoming SREcon16, where I'm giving a talk on performance checklists for SREs.