Stripe’s Will Larson on engineering and infrastructure management

Former Engineering Manager, Intercom

March 15, 2018

As a startup scales, the importance of infrastructure engineers simply can't be overstated. They're the ones making sure your app is secure, that uptime looks good, and that the rest of your engineering org has the right tools to build features your users need and want.

Will Larson has managed infrastructure teams for some of the biggest names in software. Today he’s leading Foundation Engineering at Stripe. Partnering with the Infrastructure, Data and Developer Productivity teams, his group builds the tools that support every Stripe engineer and keep Stripe reliable and performant.

Previously, Will was an engineering leader at Digg and then at Uber, where he scaled infrastructure engineering from a small team to more than 70. All the while, he’s been sharing his latest thoughts on infrastructure, devtools, management and more, on his Irrational Exuberance blog.

I hosted Will on our podcast, where we talked about how he keeps his team innovating, hiring generalists vs. specialists, managing in a high growth environment and more. If you enjoy the conversation, check out more episodes of our podcast. You can subscribe to it on iTunes or grab the RSS feed in your player of choice.

What follows is a lightly edited transcript of our chat.

Todd Royal: Will, welcome to Inside Intercom. Can you just give us a feel for what you’ve been doing in your career to date, as well as the work you’re doing today at Stripe?

Will Larson: When I graduated from college I was actually teaching English in Japan for a year, and that’s when I started doing a lot of tech blogging. One of the blog posts was about this Yahoo! product called Yahoo! Boss, a build your own search service, and they contacted me. Here I am teaching English – and although I studied computer science, I had no experience – somehow this blog post turned into a job at Yahoo! It was a miracle. I was at Yahoo! for a couple of years, learned a ton there, and then joined Digg. I always describe Digg as Reddit but better. It went bankrupt and the product failed, so there’s no actual evidence that it’s better in any way, but in my heart it certainly is a superior product.

One of the exciting things about a company that is going through a little bit of change, a little bit of chaos, is there is a lot of opportunity, and for me that opportunity was getting to move into management. I stayed at Digg until the company closed and through this talent acquisition where we joined SocialCode. After a few years there I started thinking about what I wanted from my career. Until then, it had been very haphazard, and I thought about what I wanted to do next.

I wanted to work at a company that was growing very quickly. At that time, the fastest-growing company by some definition was Uber, and so I went. One of my friends who had worked at Digg had moved there, and I went to work on the infrastructure engineering team. That was just a transformational experience. I learned more in the two years I was there – going from 200 to 2,000 engineers – than the entire rest of my career really. But, I also got a little bit worn out. Such chaotic growth was a lot to be part of.

About two years ago I joined Stripe, where I’m working with the foundation engineering team. That’s our developer tooling, our data engineering, and our infrastructure engineering group.

The tenets of foundation engineering at Stripe

Todd: What’s the nature of the work you’re doing day to day with the foundation team?

Will: We have such a large remit that we work on quite a few different things, but there are core properties of infrastructure and core things we do for the company. We think about security as our north star. Stripe moves money, so security is important for us. It’s better for us to be down if we’re insecure than it is for us to be up and insecure.

We think about security as our north star.

Secondly, our companies depend highly on us, so we spend a lot of time on the property of reliability. Third, a little bit unusually, we think about usability. Our biggest customers are externally these merchants that depend on us, but internally we have more than 300 engineers that depend on the tooling, platform and infrastructure we provide. How can we make sure that we’re creating as much leverage as possible for these engineers who depend on us? We’re really thinking about building infrastructure as a product to serve them.

Finally, we think about latency. How can we make it fast? How can we make it responsive? As Stripe becomes an increasingly mature company, efficiency will get more important for us, but right now we’re very focused on making sure that we’re proving the very best possible experience to the merchants. Our internal operating efficiency in terms of our infrastructure spend, etc is a little bit lower on our priority list.

Making the shift from maintenance to innovation

Todd: You wrote a great blog post about the struggle for infrastructure engineering teams to shift from working on maintenance and tech debt to delivering innovative features. Why do you think that is, and where do you see the tension between these two modes of working?

Will: That’s a really interesting and important question. When I first joined Uber, I was actually hired as a DevOps manager, which was kind of funny in the sense that I literally didn’t know what DevOps was. One of the things I found as I’m desperately Googling “What is DevOps?”, was a book called “The Phoenix Project”, written by Gene Kim. It’s a great introduction to DevOps and in particular the Kanban style of managing projects.

The core tenet of the book is that you only get value when you ship things. There’s this point in infrastructure management when you are a little bit on fire, when you have so much stuff you need to do. You have to get extremely myopic and extremely focused on just shipping things. There’s no value in doing work, there’s only value in finishing work. This is the core part of toil management of teams that are pretty underwater. You spend so much time making sure that you ship as many things as possible, and part of that means not spending any time on things that don’t add to shipping.

You need to go from just solving problems to actually selecting problems to solve.

This gets really challenging on the flip side, when you come out of this firefighting mode. All of a sudden, you need to figure out what to work on. You need to go from just solving problems to actually selecting problems to solve. One of the challenges is that all of your muscles for learning from your users have atrophied, because you just don’t do it.

Think about something like product management. When you are working through toil there’s not a huge role for product management in infrastructure groups. Then you pop out and all of a sudden you need product managers, but you haven’t probably prioritized hiring them, or you haven’t built this relationship with them. Making that switch from being very focused on managing the completion of projects to user discovery, user validation and building these relationships can be a rocky time for infrastructure groups.

Todd: What you said about only getting value when you ship really resonates with me. At Intercom, one of our engineering principles is that shipping is our company’s heartbeat. There’s a lot of emphasis on making sure that we’re shipping regularly, and it can be a struggle.

To your point on product management, how does engineering at Stripe partner with product management. What are the unique pieces of your development process?

Will: Like most small companies, initially Stripe had this “everyone does everything”, generalist model. Back then engineers were writing all of the technical writing, all of the documentation, writing the code, doing the product management, and reaching out to the business partners and trying to establish partnerships. We had a great culture of the generalist, and that really worked with us for quite a while. One of the special things about Stripe is that so many of the early employees, especially those who are still there at Stripe in increasingly large roles, started out as real generalists, with an incredibly broad skill set.

When you get larger, sometimes it helps to have someone who’s done it before. It helps to have someone who’s not learning on the fly. We’ve moved more and more towards having specialists in certain high-leverage spots across the company. One of those is definitely product management.

Back to the point of us switching from toil to innovation and learning how to actually get real feedback, people don’t like to give negative feedback – even if your software or your tool doesn’t work well. People tend to say what they think you want to hear. Learning how to get people to actually tell you what they actually feel about your software is a skill that product managers are hugely valuable for.

In terms of Stripe’s product management, we’re adding more and more and we’re defining their role more clearly. Historically, the engineer management role and the product management role had a lot of overlap.

There’s a classic challenge where very few infrastructure engineering groups have product management. They tend to rely on the senior engineers or the engineering managers. That’s a blessing and a curse. The blessing is that you are your own customer. You’ve used these tools, and you are building for people whose needs you understand. The curse is that sometimes being your own customer blinds you, particularly as you’ve been in infrastructure longer and longer as opposed to using the tools. It can be easy to get a little bit disconnected from what the users actually need.

This is definitely something that we’re thinking about. How can we use product management? Right now we’re really focused on building the skills of our existing engineers and existing engineering management in product management, but I suspect even in infrastructure we’ll have product management in the long term.

Hiring in a high growth environment

Todd: With the growth you’ve experienced at companies like Stripe and Uber obviously comes a lot of hiring. What are some of the things you’ve been doing at Stripe to stay productive and stay focused during that kind of growth?

Will: When I think about hyper-growth, companies growing this quickly, the most important thing is being very reality-based. What I mean by that is you can add folks to a team that is growing quickly, but it will actually slow the team down. In the long run though that team will be more effective and will have more capacity, but in the short run it’s going to get slower. You have to be very honest about the trade-offs that you’re making in terms of maximizing for certain durations of time. Sometimes these trade-offs are really uncomfortable because there is no acceptable trade-off to make and you have to make a trade-off anyway. Be very consequent in your reasoning and understanding the consequences.

The things I find most important are problem selection, solution validation and pure execution.

That said, there’s a few things that really matter, and the thing that is most important, more important than literally anything else, is onboarding. Stripe has spent a lot of time trying to make our onboarding process as effective as possible. We have this program called Dev Start, which is the onboarding bootcamp model, a little bit similar to what Facebook has done. We get (new hires) to ship a very simple commit on the first day. Also, in the first month they ship something meaningful, something usually that they can see as a user-facing thing. That’s incredibly powerful in terms of building a community of new hires, and as you grow quickly, maintaining a community is one of the very important things to do. It’s also powerful in terms of that rapid experience of getting a commit up, getting it deployed, maybe debugging it if something goes a little bit wrong and actually building the competency with the day to day tools.

Todd: The other thing that comes with hiring is evaluating skill sets and deciding what things you want to focus on as a company for the future. What are the skills that you value most now among engineers in today’s market?

Will: I don’t think a lot about skills in the literal sense. There is a challenge here, which is that there are just so many different technologies. Think about the JavaScript ecosystem, which has so rapidly gone from jQuery to Bootstrap to Angular to React. Now there’s Elm out there. If you had specialized in any one these technologies, in JavaScript, it now it feels like there’s an 18-month lifecycle for different frameworks. It’s evolving so quickly.

I do think there are three meta skills that are incredibly important. Those are problem selection, which means picking valuable problems; solution validation, which is actually making sure that your approach solves the problem; and then third, execution. Something that I think is incredibly important across all of the those is the ability to communicate well and to be an effective collaborator in a larger community of developers around you. In terms of Stripe’s general theory around hiring generalists when possible and specialists when highly leveraged, I do love to get people who have general experience. But the things I find most important in the folks that we hire and the folks that I’ve worked with are problem selection, solution validation and then pure execution.

Todd: The fact that you’re talking about problem selection and solution validation suggests that you’re really looking for engineers who are problem-oriented and focused on outcomes. We’re that way too. Our software engineers are actually called product engineers internally for that reasons. We want to find people who are focused on solving problems rather than just the technology.

Evaluating a jump into management

Todd: Last thing on growth: As you’re hiring and onboarding engineers at this rate, you obviously need management as well to jump in, corral the teams and lead people. From one engineering manager to another, what advice would you have for engineers who are thinking about making that leap into management?

Will: Career advice is always dangerous, because the devil is in the details. I think the most important thing is to move into management for the right reasons. Historically, at many companies there’s the sense that management is a promotion or that there’s different compensation. Increasingly at the top-tier companies in the Valley and I imagine elsewhere, it’s possible to go quite far in your technical career and get compensated equivalently to the management role.

You will get a number of opportunities to move into management. I lept on the first one, and occasionally I regret that.

One of the illusions of management is that you have a lot more control or you have a lot more authority or power. You have a lot more opportunities to make decisions for more folks as a manager, but also the constraints of being a good manager are quite challenging. You have to keep your team happy, you have to make sure your team is doing something they love, you have to keep your peers happy, you have to make sure that the product management or the design or the other engineering manager you’re working with are happy with what you’re doing, and you have to keep the company and your management chain happy. To actually do all of those to me is the hallmark of a wonderful manager, but also doesn’t necessarily give you a ton of flexibility in the decisions you make. It can be more decisions to make but less actual control in doing them at times. Going to management because you want to help is critical. Going into it for control, for compensation, definitely doesn’t.

One of the choices you do get to make is in what company and what environment do you move into management? Picking a company that has a healthy management culture is so important. So much of what you will become as a manager is based on your first experience, and you as an experienced engineer will get a number of opportunities to move into management. I personally lept on the first one that I got, and occasionally I regret that, because I came up in a bit of a firefighting, chaos management culture, and I’ve had to unlearn a lot of those lessons over time. Really try to find a company where you want to become like the managers there, and start your management career in that environment.

Todd: That’s an interesting way to think about it. Are there things that have stood out to you as being indicative of strong management culture or things that you would say engineers thinking about making that move could count as positive signs?

Will: Is there a management culture or are there many different types of management cultures? Something I’ve been thinking about a lot is that in fast-growing companies or very early companies, management is largely around change management. How do you solve more and more and more different problems? This is very much an evolutionary role where you’re always dealing with a new problem and you’re trying to get a reasonable, good enough solution in place and then move on to another problem. At more established or slightly slower-growing companies it’s much more of an optimization game. How do you make the team a little bit happier? How do you get a slightly stronger person in the door to fill in for someone who has left? How do you improve your relationship with someone a little bit more? How do you get a few more story points done in this sprint? It’s a very different feel.

A lot of the challenges that folks have in terms of first moving into management, but also moving between management roles, is not understanding that there’s these two extremely different skill sets, and thinking they can take what they mastered at a more mature company in terms of making the folks who are there happy and iterating incrementally, and apply it to a place where they have to be extremely consequent about solving problems as sufficiently possible and then moving on. They’re extremely different outlooks for success.

Speeding up your time to alignment

Todd: At the end of 2017 you published a Twitter thread of lessons learned, and one that struck me really reads, “In a scaling organization, the ability to be consistently aligned within and across teams is a marker of excellence. Time to alignment is your re-org success metric.” What to you does successful time to alignment look like?

In a scaling organization, the ability to be consistently aligned within and across teams is a marker of excellence. “Time to alignment” is your reorg success metric.

— Will Larson (@Lethain) December 29, 2017

Will: Alignment is about having truly shared goals and everyone agreeing on those shared goals. As you change the team composition, as you move teams to different areas or refocus them on different problems, it damages trust a little bit, and you have to rebuild that trust. When I think about time to alignment, it’s how long does it take to rebuild that trust and for the teams to start operating effectively again. A litmus test here is the time it takes before you can comfortably ask a team that’s just been reorged or moved to do something they don’t want to do without being afraid that they’re going to revolt or get upset with you. How long does it take to actually build that relationship and that trust again after the last change, to be able to stretch them a little bit into doing something uncomfortable?

Alignment is about having truly shared goals.

A lot of this is just communication. Communicating honestly and with some vulnerability. Sometimes one of the hardest parts is that often organizational changes are solving a problem that no one really wants to talk about. Maybe there’s a manager who’s really struggling, but you can’t just say, “We’re doing a re-org because this person is struggling.” Instead you talk about doing a re-org to solve a proxy goal. That’s when you end up with communications that people read and it doesn’t resonate with them. It sounds fake, and it’s because you actually aren’t able to talk about the actual goals. Avoiding having to talk about proxy goals is quite important, but then if you do have to pick a proxy goal – sometimes you really do – make sure you pick ones that aligns as closely with your actual goal as possible so that people don’t read this plan or read this reasoning and think about the classic “They left to go spend time with their family,” which everyone’s thinking, “This is a lie”. You have to be careful about triggering that sensation in folks.

Todd: Are there ways in which you’ve seen that really break down in the past?

Will: The marker of re-orgs not going well is the time to next re-org. As a general rule of thumb, the frequency of re-orgs tends to imply when the re-orgs or the org changes are not going super well. Thinking about user testing and solution validation, I find that you can do this with anything including a re-org proposal. If you find some of the impacted teams, find someone you can trust, and run it by them, they can tell you immediately if it actually resonates. Maybe this is not a marker specifically of what’s wrong, but if you actually go do some user validation with the users of this re-org, they can almost always tell you. Sometimes you have to communicate a little bit to try to get around the proxy goal versus real goal element, but really just asking people if it makes sense, and if they say no, it probably doesn’t and you should keep working on it.

Todd: You’re also an avid reader, and one of the things that you often do is share key takeaways from what you’ve been reading on your blog. What’s one book that you would recommend every engineer read, and why?

Will: The most interesting book I’ve read this year was “Good Strategy, Bad Strategy”, which is a great exploration of why sometimes when we try to solve problems, our strategies just don’t work. But really the book that I’ve been most influenced by is “Thinking in Systems: A Primer”, by Donella Meadows. It’s a book about solving problems, but more about understanding problems. In particular, humans tend to think about things causally, like the server is down because we pushed a piece of bad code, but sometimes problems are much more complicated. You upgrade a small dependency, which made your API a little bit slower, which caused connections to start backing up in your proxy, which caused health checks to start flapping, which caused APIs to start failing 10 percent of the time. And it’s really thinking about that type of problem, where it’s not just A then B, but actually these interrelated events, that I think Thinking in Systems gave me a great set of tools to work on. It’s a book that I would recommend to anyone.

Todd: Lastly, where can our listeners go to keep up with your writing, insights, and any more recommendations that you have?

Will: One of my great regrets in life is that when I was 13 I came up with this online handle, lethain, which I think is what was really cool to do then. Now I have this domain, Twitter alias and email address that is actually nonsensical. It’s not the cyber wizard level of nonsensical, but it still makes literally no sense. I write a lot on my blog at lethain.com. I call it Irrational Exuberance, which is I think a really interesting phenomenon from economics.

Todd: Thanks Will. It was really great talking to you and hopefully we can have you back sometime.

Will: Thank you so much. It was a pleasure to be here.

If you’re interested in continuously shipping and solving valuable problems in the process, come work with us at Intercom. We’re hiring!