
Gene Kim’s keynote from Actifio’s Data Driven 2020 Conference: The Unicorn Project and the 5 Ideals discussed application development performance and how organizations make good to great digital transformation.
Hello everyone at the Data Driven Conference. My name is Gene Kim. I’ve been studying high-performing technology organizations since 1999. That was a journey that started back when I was the founder of a company called, Tripwire, in the information security space. Our goal was to always understand these amazing organizations that simultaneously had the best delivery in performance in development, the best operational stability and reliability in operations, as well as the best posture of security and compliance. We wanted to understand how did these amazing organizations make their good to great transformation so that other organizations could replicate those amazing outcomes.
So you can imagine in a journey that spans 20 years, there were many surprises, but by far the biggest surprise was how it took me into the middle of the DevOps movement, which I think is urgent and important. I think the last time any industry has been disrupted to the extent that our industry is being disrupted today was likely in manufacturing when it was revolutionized through the application of the LEAN Principles. And I think that is exactly what DevOps is. You take those same LEAN Principles, apply them to the technology value stream that we work with every day, and you end up with these amazing patterns that allow organizations to do tens, hundreds, or even hundreds of thousands of deployments per day while retaining world class reliability and security. That’s something that I never would have thought possible 10 years ago.
I’ve learned so much since the Phoenix Project came out in 2013. When we wrote the DevOps handbook, we started talking about our definition of DevOps. It is specifically the architectural practices, technical practices and cultural norms that enable us to increase our ability to deliver applications and services quickly and safely. This enables rapid experimentation and innovation, the fastest possible delivery of value to our customers while ensuring world class security, reliability and stability. And we care about that because that allows us to win in the marketplace.
I love that definition because it doesn’t actually say what DevOps is, but it describes what the outcomes we aspire to are. But there’s a definition that I like even more than that, and it doesn’t come from me. It comes from my friend, Jon Smart, a partner at Enterprise Agility, Deloitte. He says DevOps is simply this: It’s better value sooner, safer and happier. I love this definition because it is as accurate as the definition I gave you, and it’s so difficult to argue against because who wouldn’t want better value sooner, safer and happier?
Maybe just to set the context of why I think this is so important. What we tried to describe in the Phoenix Project was this downward spiral that exists in every organization. If you think about the story of technical debt, there’s an image that evokes this better than any image I can think of is this. It is the accumulation of all the crap that we put into our data centers, our code, application and configurations that slow us down. The definition of that technical debt that Ward Cunningham came up with was, “It is what we do. The phenomenon where it makes it difficult to make changes later.” So this is bad, but not as bad as what it becomes, which is this.
And so, when this happens, everything becomes harder. It leads us to a sense of powerlessness, a sense of preordained failure. It doesn’t matter whether we’re in operations, whether we are in development, whether we are a product manager in the business, whether we are information security or compliance, and it especially affects us if we are in data engineering.
And so, what I wanted to work on for the last three years were the problems I think still remain, which is the absence of all the invisible structures required to fully enable developer productivity.
Second is this orthogonal problem of which DevOps worked on. This is other. The DevOps community rightly pointed out that it was so difficult to get code to where it needed to go, which is in production so that customers get value.
There’s this even bigger problem around data. So often data is stuck in systems of records and it takes weeks, months or even quarters to get where it to where it needs to go, which is in the hand of developers and people in the business unit so that they can make better decisions. What is so problematic is that when you do manage to make the change, you end up breaking thousands of reports in production, so that’s not so good.
Third, there’s often very strong opposition to support these newer, better ways of working. There’s ambiguity of what behaviors we need from our leaders to support such a transformation. These are the problems I wanted to explore in The Unicorn Project.
In The Phoenix Project we had the three ways. We had the four types of work. In The Unicorn project I use an instrument of The Five Ideals, of which describe very clearly the problem and the problem areas that I think are so important.
The first ideal is locality and simplicity. The second ideal is focus, flow and joy. The third ideal is improvement of daily work. The fourth ideal is psychological safety. And the fifth is customer focus.
What I’ll do in the remainder of this presentation is go through each one of these ideals and show you what ideal looks like, and what not so ideal looks like. But before I do that, I just want to take a moment to describe one of the biggest aha moments I’ve had. Something I wish I had learned before The Phoenix Project came out.
This was based on the research that I have done with Dr. Nicole Foresman and Jez Humble since 2013. This is The State of DevOps report. Over those years, we’ve been tracked over 35,000 respondents, really trying to understand what do high performers look like, and what are the behaviors that enable great performance. Decisively, every year, what we’ve found is that the business value of DevOps is even higher than we thought.
For six years in a row, we found that high performers exist and they massively outperform their non-high performing peers. We know they’re doing deployments more frequently. So high performers, they do multiple deployments per day, which is 200 times more frequently than their peers. We know that when they do a deployment, they can complete those deployments in one hour or less.
In other words, how quickly can we go from changes put in diversion control, through integration, through testing, through deployments until it’s actually running in production. They can do it in one hour or less. That’s 100 times faster than their peers.
But they’re getting far better outcomes as well. When they do a deployment, they’re eight times more likely to have it succeed without causing a server outage, service impairment, a security breach or a compliance failure. And when something goes wrong, which Murphy’s Law guarantees will happen, the high performers can fix it in one hour or less. That’s three orders of magnitude faster than their peers.
What we find, again, year over year, is that high performers massively outperform their non-high performing peers. And so, over the years we’ve looked at other dimensions of performance. We know that high performers, because they’re integrating information and security objectives into everyone’s daily work, they’re spending 1/2 the amount of time remediating security issues.
But another dimension that is even more exciting to me is that it’s not just about IT performance or softer delivery performance. It is about organizational performance. We know that high performers are also twice as likely to exceed profitability, market share and productivity goals. And for not for profits, same multiples of performance. They’re twice as likely to achieve organizational and mission goals regardless of how they measure it. Whether it’s customer satisfaction, quantity or quality.
And so, for me, this really motivates something very simple is that if achieving the mission requires technology, DevOps principles and patterns help with achievement of that.
There are other ways to measure organizational performance. In high performers, employees are twice as likely to recommend their organizations as a great place to work to their friends and family. There’s so much research out there that shows that employer and their promoter scores is highly correlated with growth and even profitability.
And so, beyond the numbers, what does this mean for me? To me, it really paints what the opposite of technical debt is. To me it says, “To what degree can we safely, quickly, reliably and securely achieve all the goals, dreams and aspirations of the businesses that we serve?” I think that is extremely important of why we’re in this industry to begin with.
I had mentioned The Five Ideals. What I’d like to do next is go through each one of them and describe why I think they are so important.
The first ideal is locality and then simplicity. I think one of the best case studies that demonstrate the problems around locality and simplicity is the case study of the birth and death a technology company called Sprouter at Etsy.
This is a story about how engineers work together to create value for customers. In 2008, these are the dark days of Etsy. This is when engineers said, “I will not … The holiday season went so poorly that I will probably not be around for the next holiday season.” But even they knew that they had this big problem.
The problem was that in order to create value for customers, two teams would have to work together. The Devs would work in the front end, so in their case, it was PHP. And the DBAs would work on the backend, inside of store procedures, inside of PostgreSQL. That means that in order to shape a feature these two teams would have to communicate, and coordinate, and schedule, and prioritize, and potentially de-conflict in sequence at work.
And so, their goal was to enable these teams to work more independently. And so they created something called, Sprouter. The vision was to allow these two teams to work independently and meet in the middle inside of Sprouter. Sprouter stands for store procedure router. And as Ian Malpass, one of the senior engineers said, “This would require a degree of synchronization and coordination that was rarely achieved to the point where almost every deployment became a mini outage.” If you are doing 30, 50 deployments per day, this is a real problem.
And so, as part of their great engineering rebirth of Etsy, their goal was to kill Sprouter, among many things. One of the things they wanted to do was to kill Sprouter. They wanted to do this by enabling Devs to work, put all the changes in the front end without any reliance on other teams on the backend.
What they found was in every part of the property where they eliminated Sprouter, suddenly, co-deployment lead times went way down and the quality outcomes went way up. What I find so marvelous about this is that this is probably one of the best demonstrations of Conway’s Law. Is that it’s not enough to move teams around on an org chart. We must have a softer architecture that is congruent to it.
In other words, when we go from two teams to three teams, co-deployment lead time went up, quality of outcomes went way down. As we went from three teams to one team, where there was no need to communicate whatsoever, then the quality outcomes went way up, and the lead times went down. That’s amazing because it’s only three, two or one teams.
But in most complex organizations, we’re not talking about three teams, we cold be talking about scores of teams. The way we observe that is by looking at how we do deployments or movement of data. If we want to initiate a deployment, there might be scores of steps that we need to transit through. We have to create environments. We have to create the test data. We have to get the configurations, user accounts are created. Now, middleware, user accounts, change approval boards, enterprise architecture boards, firewall rule changes. So it doesn’t take a lot to go wrong before you’re looking at co-deployment lead times measured in weeks, months or even quarters.
And so, one of the biggest surprises to me in the state of DevOps journey is this, is that architecture is one of the top predictors of performance as measured by these characteristics. To what extent can we make large scale changes to our parts of the system without permission from anyone else? To what degree can we complete our work without a lot of fine grain communication coordination with people outside of our team? To what degree can we deploy and release our service on demand, independent of other services we rely upon? I love this one. To what degree can we do testing on demand without the use of a scarce integrated test environment of which there are never enough, so we’re always waiting in line?
This is a point of tremendous coupling. If we can do all those things, then certainly we can do deployments during normal business hours with negligible downtime.
At one point in my career, I would’ve said, “Architecture never really impacted how daily work’s performed.” And yet, I have now changed my mind, is that there’s nothing more that dictates how we do our daily work than architecture.
Let’s talk about measures. In The Phoenix Project, I think the best measure that describes both is a bus factor. In other words, how many people need to be hit by a bus before the project, service or organization is in grave jeopardy? The bus factor in The Phoenix Project was one. It was Brent. If Brent, no major piece of work could get done without Brent. No outage could be fixed without Brent. And so ideally, you want a bus factor that is much larger. We shouldn’t be reliant on one person like that. We should be reliant upon a team. Or better yet, a team of teams.
So the corresponding metric in The Unicorn Project is the lunch factor. In other words, to get something major done, how many people do you need to take out to lunch? Is it the Amazon ideal of the two pizza team, where each two pizza team is able to work independently to create value for customers? Or do you need to feed everyone in the building? Or worse yet, if you are reliant on lots of other groups, you have to take 43 different people out to lunch, most of them who have never heard of you. So ideally, you want a lunch factor that is very, very low. Ideally, constrained to that two pizza team.
So the ideal is, anyone, if we need to get something done for our customers, we can do what we need to do in one file, one module, one application, one container, and make all our changes needed there.
Not ideal is to make any needed change that you have to stand all the files, all the modules, all the data sources. So we want … That speaks to locality of changes. The ideal, the changes we make can be independently implemented and tested, isolated from all the other components. That’s the notion of composability.
Not ideal is in order for us to get any assurance our changes will actually work, we have to test it in the presence of every other component. That often pulls us into these integrated test environments of which we talked about how that’s the point of tremendous coupling.
So it’s not about just locality of changes, it’s about how we make decisions. In the ideal, every team has the expertise, capability and authority to do what our customers want. Not ideal is that in order to do anything that the customer needs, we have to go up two, over two, and then down two.
A visual depiction of this is is this. Right? In order to make a decision, we have to go up two, over two and down two. This is what some of my friends call, The Square. Right? They would consider themselves lucky if they don’t have to do the return path as well to get two engineers to talk together to solve a customer problem.
This goes to one of two books that are some of my favorites over the last 10 years. One of these is Team of Teams by General Stanley McChrystal, Chris Fussell and Dave Silverman. This is the amazing story of the joint special forces taskforce in Iraq where they were battling a far smaller, nimbler adversary in 2004. It is this amazing story of how they were able to eventually defeat the enemy by pushing decision making to the edges. Moving from a batch system to a pull system. I think it’s one of the most amazing books about what next, new leadership looks like in the nature of work. I would recommend this to anybody, especially in technology.
Last thing I want to mention on the first ideal’s are on data. In the ideal, every team has access to the data they need, on demand, quickly, accurately and securely. Not ideal is in order to get the data I need, I have to wait months for the integration to get setup, for the data transfers to occur, and just you hope and pray that when you finally get it, you don’t break every other report in the organization.
I think this is so important because these days, in most organizations, 30-50% of employees use data or manipulate data as part of their daily work. That’s even a larger population than the software development community. This is the basis of my claim. The data side of the equation might be even more important than the code and DevOps side of the equation.
One of my favorite scenes in The Unicorn Project is and the famous project release, all the prices disappear from the mobile app and the commerce, and the e-commerce site. This is caused by a bite order marker in the pricing data file. It just shows how dangerous it is when we transport dat across the organization. This is something that actually happened to me in The State of DevOps report. So as many say, data is not only the new oil, data is the new soil. So the first ideal is all about locality and simplicity.
The second ideal is what I think the outcomes are when we do it right, which is focus, flow and joy. So much of this is informed by the closure of functional programming language. And so, to put this into context, for decades I self-identified as an Ops person just by getting my graduate degree in computer science with a focus on compiler design. I think that was always motivated by the fact that it was my observation that it was Ops where the saves were made. It was Ops who saved us from terrible developers who didn’t care about quality. It was Ops who protected our data from adversaries because it certainly wasn’t security.
And yet four years ago, after learning closure, the hardest thing I ever had to learn, it reintroduced the joy of programming into my daily life. And so, I think my main lesson here is that development is not only so fun, but you can build so many things with so little effort these days.
The famous French philosopher, Claude Levi-Strauss, would say of certain tools, “Is it a good tool to think with?” I think functional programming, especially as it relates to data, shows that there are better tools to think with.
One of the core things is immutability, the notion that you should be mutating data in place. This was pioneered in the programming languages community, in LISP, ML, Haskell, Clojure, so many others. But this shows up in all our works these days, especially in infrastructure and operations.
If you look at Docker, that is fundamentally all about immutability. You can create a container, but you can’t change it, or at least, it won’t persist. You really need to create a new one.
Kubernetes shows that not just at a component level but at the system level, or the system of systems level. What’s interesting to me is that even things like Apache Kafka of which so many data pipelines rely upon, that is based on an immutable data model about event sourcing. Now the things around us are facts. You can’t change facts. And so, we subscribe to topics to be notified of them.
Version control is fundamentally about immutability. That’s why you get yelled at if you overwrite the commit history. And so it shows that we’re using immutability to create a simpler view of the world, and allows us to build more complex things.
To me, the big lesson is in the ideal when we’re using these better tools to think with, all our best time and energy is focused on solving the business problem, and you’re having fun.
Not ideal, is all your time is spent trying to solve problems you don’t even want to solve, like writing YAML configuration files, or trying to figure out how to escape filenames as how to make files or writing bash scripts.
I’ve got to tell you, one of the biggest and most strange things to occur in this journey is that there are things that I used to enjoy maybe a decade ago, that I now detest. I hate dealing with anything outside of my application. I hate connecting anything to anything else because it’ll take me a week. I hate updating dependencies. I really don’t understand or am not very good at secrets managements. I don’t like Bash, YAML, Patching. Kubernetes is mostly confusing to me and I still can’t write deployment files on my first time. I can’t figure why my Cloud costs are so high.
And so, I don’t mean to diminish any of this work. In fact, it is critical, especially around security. It just says that what used to be fun for me is actually distracting to me these days because I just want to solve the business problems that I want to solve.
I think this is why we are so reliant and get so much value out of platforms. It is amazing to me how much we can do in these developer productivity platforms. We can get things like monitoring, deployment, environment creations, security scans, subscribing to data topics all on demand, self-service, without getting permission from anyone else, without having to open up a ticket. These are conditions that allows us to work with a sense of immediacy and fast feedback, which allows us to have a sense of focus, flow and my thing would be even joy.
And so this is why I make the claim that there’s never been a better time to be in infrastructure and operations. The best days for infrastructure and operations are certainly ahead of us, not behind us.
My basis for this, it comes from the work of Dr. Mihaly Cxikszentmihalyi. He gave, I think, one of the best 10 talks of all time called, Flow, The Secret to Happiness. He’s saying the state of flow is what we feel when we are so engrossed in our work that we enjoy, we’re getting such satisfaction out of it that we lose track of time. We might even lose sense of self. Right? That transcendental experience when we’re truly doing the work that we love.
Flow is what platforms enable because it allows developers to just work on the business problem as opposed to having to Google or stack overflow how to write particularly difficult YAML configuration files.
So focus, flow and joy, ideal is when you can implement and test whether the feature works on your Dev laptop, and to learn whether it works within seconds. Not ideal is that the only way you can determine whether it’ll work is by going to production. Going to production, that’s terrible, or in pre-production having to wait minutes, hours, days or even weeks. These are the things that destroy flow.
So the first ideal is locality and simplicity. Second ideal is all about focus, flow and joy.
The third ideal is all about improvement of daily work. The not ideal is TWWADI. In other words, “The way we’ve always done it.” What we really want is ideal, MTBTT. This is, “Make tomorrow better than today.” This is Google SRE Principle number two.
So improvement of daily work is something that showed up in The Phoenix Project as well. The notion that the improvement of daily work is even more important than daily work, itself.
Let’s go to an example of not ideal. In the DevOps handbook, we wrote about the famous Fremont General Motors manufacturing plant. This is a notorious plant because for decades, it was not only the worst performance automotive plant in North America, it was among the worst performing plants around the world. There were so many documented, because there were no effective procedures in place to detect problems during assembly process, nor were there explicit procedures on what to do when problems were found.
So as a result, there were instances of engines being put in backwards, cars missing steering wheels or tires, or cars having to be towed off the assembly line because they wouldn’t start. I think that’s a pretty persuasive case of not ideal.
What do we want instead? In the ideal, we’re putting as much feedback into our system in as many areas of the system as possible so that we can get sooner feedback, sooner, faster and cheaper which as much clarity between cause and effect.
Why do we do this? It’s because the more assumptions we can invalidate, the more we learn. And the more we learn, the more we are out-learning the competition.
At the heart of a body of knowledge called, Learning Organization, is this notion that in order to win in the marketplace, you must out-learn the competition.
One of the most famous examples of this, with out a doubt is the famous Toyota Andon Cord in the Toyota production. In 2011, I got to spend a week at the University of Michigan. I came to learn about the Toyota production system, including probably the most famous tool from that body of knowledge, which is the Andon Cord. It was amazing to see in the field portion of the training that plants modeled after the Toyota production system have on top of every work center, a cord that everyone is trained to pull when something goes wrong.
So if I create a defective part, I pull the cord. If I get a defective part from someone else, I pull the cord. If I don’t have any parts to work on, I pull the cord. And even if my work takes longer than documented, so if it was supposed to take 55 seconds, but it took a 1:20, I pull the cord.
So everyone knows now, what happens when you pull the cord, is that the entire assembly line stops, if it can’t be resolved within 55 seconds. What I didn’t know was how many times in an average day the Andon Cord is pulled in a Toyota plant.
And so, in your head just think, how many times is the Andon Cord pulled in a typical plant? The answer is this, 3,500 times a day.
My first reaction was, “That’s impossible.” Right? What sort of idiot would pull the cord 3,500 time a day. There are two answers to that. One is that if you don’t fix the system then and there, you’re going to allow technical debt to accrue downstream where it’ll become a lot more expensive or maybe even impossible to fix.
But there is an element to the answer that I think is even more profound, which is that if we don’t put in the systemic fix then and there, we’re going to have the same problem 55 seconds later.
That’s the nature of the daily workaround. Daily workarounds exist in technology work as well, but because our work typically takes longer than 55 seconds, it’s far less visible. But trust me, it is just as destructive as it is in the manufacturing domain. Which gets to my point, is that greatness is never free. We always have to pay down technical debt.
And so, I’m going to share with you the story of where technical debt comes from and why it’s so important to reduce it or eliminate it using only up and down arrows.
Someone once told me 20 years ago, it was when dealing with executives, it was small numbers and primary colors. What I found in my journey is that when you reach a certain level of seniority, that is not sufficient. You must even stick with something simpler, which is up or down.
Here we go. Imagine a situation in your career where you’ve had to get to market. This is when you have to put the pedal to the metal to just work on features. And so maybe this is to be first to market, or maybe it’s because you need to get purity in the marketplace. And so, when you do this, this is when you take shortcuts, and this is when you build up technical debt and risks. When that happens, this is what drives down quality, which drives up the number of defects. The story doesn’t end there.
If we fast forward the clock, what happens is that our feature philosophy goes down. The reason is that the number of defects keep going up, maybe even to the point where we are spending all our time on defects. This is when defects dominate daily work. This is when site reliability tanks. This is when we go slower and slower. Customers leave. Morale plunges and engineers leave because everything is now so hard.
And so, who hasn’t felt this? A friend of mine, he tweeted this at me. He said, “Exactly. I have felt this.” In 2015, a certain reference feature would take 15-30 days. Three years later the same class of feature takes 10 times longer. And so, if you’ve ever felt like this is happening to you, it’s because this is a cause and effect thing we can explain. And so, this even happens if you add more developers, and you have this feeling that you’re going slower and slower.
One of the best examples of this, and that shows the lethality of technical debt is … Comes from Risto Siilasma, the Chairman of Nokia. This is the second book that I recommend to everybody. He wrote this amazing book that’s based on his journey of joining the Nokia board in 2008. He described this amazing story of what it was like to be on the board of a company with a very dominating chairman. There’s this unflinching view of his own effectiveness, and it’s an amazing story about data as well.
But there was one phrase in the book that I thought was profound. He said, he described a scene where as a board director of Nokia, he said when he learned that generating a Symbian build would take 48 hours, it felt like being hit in the head with a sledgehammer because he knew that if it took two days for an engineer to learn whether their feature worked or would have to be redone, that the entire platform they were building upon was an illusion. That all their promises and aspirations were impossible.
This is what actually drove them to Windows Mobile, which actually didn’t treat them so well either. But that was actually a far better bet than staying on the Symbian OS. It is an amazing story.
This is a story, so Nokia eventually went from $98 billion market cap to $7 billion, so that didn’t treat them so well.
But let’s look at the ones that did survive it. Ebay, Microsoft, Google, Amazon, Twitter, LinkedIn, Etsy. These are all companies that were nearly killed by technical debt but were able to overcome them.
One of the most famous examples of this is a Microsoft security stand down. In 2002, just follow the summer of worms, where SQL Slammer or Nimbda, Code Red, were threatening the whole survival of Microsoft, which led to the famous memo that Bill Gate wrote called, The Trustworthy Computing memo, where he said, among many other things, how important security was. And if any developer had to choose between security or between a feature, he said, work on security because the survival of the organization depends upon it.
And so, this is a pattern that you see in every one of those organizations that I shared with you before. They took features down to zero, which allowed them to pay down technical debt, which allowed them to increase quality, which allowed them to drive defects down. Maybe not to zero, but down to something that is manageable on a daily basis, which led to an order of magnitude increase in their ability to ship features. This is what allowed those organizations to survive as opposed to Nokia, who didn’t.
It doesn’t come from just CEOs. It comes from the product community as well. Marty Cagan wrote a fantastic book called, Inspired, How to Create Products that Customers Love. He pioneered and trained generations of product owners on how to build features. He said, “You must reserve 20% of all Dev and Op cycles to pay down technical debt.”
Why does he believe this? Is because he spent two years as VP of Product Management at Ebay. He said, “Because of technical debt, I couldn’t ship a major feature in two years. So if you don’t pay your 20% tax, the inevitable outcome is that you will pay 100% tax, where you will be building no user visible features at all.”
The third ideal, ideal is we are spending 3-5% of developers of our best developers dedicated to improving developer productivity. Google famously has 1,500 developers working at dev productivity. That’s an annual spend of $1 billion a year. Microsoft is probably spending two to three times as much, over 3,000 developers. That’s ideal.
Not ideal is that the only people working on dev productivity, data pipelines are the summer interns and people not good enough to be developers.
What I think is this amazing story of karmic continuance, Satya Nadella, the currency of Microsoft, said in a town hall meeting last year, “If any developer has to choose between working on a feature or working on dev productivity, work on dev productivity because dev productivity is how we use interest to our benefit.” We’re using compounding interest to our benefit, whereas technical debt is to our detriment. That is improvement of daily work being even more important than daily work, itself.
Ideal number four is psychological safety. This is addressed in The State of DevOps report. But one of the things I’ve learned is that psychological safety is being brilliantly and heroically pioneered by the DevOps Enterprise Summit community.
Since 2014, I’ve been studying, my area of passion is studying DevOps, not in The Unicorns, the Facebook, Amazon, Netflix and Googles, but instead in the horses, large complex organizations, the largest brands across every industry vertical that have been around for decades or maybe even centuries.
What has been amazing to me is over the years, we’ve collected, God, now probably over 400 case studies of organizations across almost every industry vertical showing that DevOps is not only just possible, but is being used to win and even survive in the marketplace.
I’m going to share with you a couple of stories that I find particularly meaningful, especially around the use of data.
One is the story from Heather Mickman, Senior Director of Development, at Target. The next is the work that she did when she became the VP of Platform Development at Optum, the world’s largest health care company. And then, Adidas, one of the largest athletic brands.
The amazing story about Heather Mickman at Target was the business problem she set out to solve in 2014, which was this. In order for any developer to get access to system of record data, they would often have to wait six to nine months for the integration teams to create that point to point connection. And so, it meant that they were always waiting.
What was so amazing was that they created something called the API enablement project where they copied and synchronized all the data into this next generation system of record in this NOSQL database so that any development team could add, change and even remove data in this next generation system of record so that they could do it on demand through versioned APIs.
What is astounding is that over the years, hundreds of initiatives use this, whether it was an in store app, ship to store, were all enabled by this data enablement project.
And so in 2019, she’d talk about how she’s doing this again at Optum, the world’s largest health care provider and insurance company, to show that not just in the retail space but this is for the ability to deliver quality health care to citizens all around the world.
It’s not just retail and health care, it’s also in manufacturing. Earlier this year, Daniel Eichten, VP of Enterprise Architecture, and Fernando Cornago, VP of Platforms, described how they were doing the same thing at Adidas to enable manufacturing, and their e-Commerce channels are going directly to the consumer to be able to compete. And so their e-Commerce operation is billions of dollars a year, growing at somewhere between 20 and 30% a year. It just shows how important data is to survive and compete in the marketplace.
But the reason why I bring this up is this astounding dimension of quality that Heather Mickman showed. I got to follow Heather Mickman around at target for three days. This was in 2015. And so, I saw many amazing things. But one of the most amazing things that I saw was this certificate that hung on her desk. It looks like it was printed on an inkjet printer and Print Shop Pro. The modern version of Print Shop Pro is Power Point. This says, Lifetime Achievement Award presented to Heather O’Sullivan Mickman for annihilating TEP and LARB.
That obviously begged the question, “What is TEP and LARB?” She told me TEP stood for the technology evaluation process, and LARB stands for the Lead Architecture Review Board.
Why was that a problem, and why is she getting an award for annihilating it? She said, “Whenever you would want to do something new, you would have to fill out the TEP form, which eventually would lead you to the right to pitch the LARB meeting. You walk into this room and there are all the lead architects there. Development architects and enterprise architects on one table, the ops and security architects on the other table. Then you get peppered with questions. They’ll start arguing with each other. And then they would say come back next month, and here’s 30 more questions for you to answer.”
Her reaction was, “No one in my engineering organization should go through this process. None of the 2,000+ engineers at Target should have to go through this. In fact, why do we even have this process at all?” She said, “No one could really remember. There were some vague memories about some terrible, unspeakable disaster that happened decades ago, but what exactly that disaster was has been lost in the midst of time.” And yet, the process still remained.
And so, due to her endless and relentless lobbying, they did dismantle the TEP and LARB, earning her this incredible certificate that hung proudly on her wall for years and years. This is such a demonstration of what I think is required from leadership. It’s not about commanding, controlling and directing. It really is about guiding and enabling.
This is something that we tested in The State of DevOps report. We asked 15 questions among five axes. Vision, to what degree does the leader understand the vision of the organization? To what extent can they get in front of that? Not only just to be relevant, but to help with the achievement of the large problems facing the organization.
Intellectual stimulation. To what degree does the leaders challenge basic assumptions of how we do work? In other words, just because we did something 20 years ago, doesn’t mean that we need to be doing it the same way now.
Inspirational communication. To what degree does a leader inspire pride in this initiative? To what degree can they overcome fears, especially in functional silos and create coalitions around them a lot that can help overcome these powerful, ancient orders?
Supportive leadership. Personal recognition. Are these the things that come from servant leadership?
We found that these are the factors that double the likelihood of organizations being high performers.
And so, just to put this into perspective, the whole Unicorn Project was essentially a testament to all the amazing work being done by the DevOps enterprise community.
And so, if you follow The State of DevOps report, you will recognize some elements of this. This is the organizational topology model by Dr. Ron Westrom, who 20 years ago discovered that in the best health organizations, patient safety was highly correlated culture. The organizations with the worst patient outcomes had these characteristics.
Information was hidden. Messengers of bad news are “shot”. Bridging between teams is discouraged. Failures are covered up. New ideas are crushed.
Right in the middle are bureaucratic cultures. But in the organization with the best patient outcomes had these generative qualities.
We seek information. We train messengers to tell bad news. Just like in modern engineering organizations, they train engineers to participate or you lead blameless post-mortems. We share responsibility because we know that security’s not just Info Sec’s job, it’s everybody’s job. Just like availability is not just Ops’s job. Right? It is everybody’s job. This requires us to bridge between teams. And when failures happen, and causes a genuine sense of inquiry and new ideas are welcomed.
In the research for The Unicorn Project, it was so fun to revisit the work that was done at Google, in Project Aristotle and in Project Oxygen, which all got rebranded into an initiative called, Project ReWork.
At the heart of this quest was a multiyear study where they wanted to understand what made the best teams great. They found that the top factor was psychological safety. As measured by, what degree do members on a team feel safe to say what they really think, to point out problems without fear of being ridiculed or being punished. That was even a stronger predictor of performance than dependability, structure and clarity, meaning of work, and impact of work. Just showing that psychological safety is so important.
So in the DevOps community we love talking about things like blameless post-mortems or chaos monkeys where we randomly inject faults into production environment. But none of that is possible if we don’t have psychological safety.
Ideal number one was locality and simplicity. Second ideal was focus, flow and joy. Third ideal is improvement of daily work. Fourth ideal is psychological safety. And that gets us to customer focus.
Just to put this into perspective, I want to share about probably the best demonstrations of this I’ve seen in my career.
I got to spend a day with the CEO of Compuware, the famously resurgent mainframe vendor. I went with my friend, Dr. McKrusten. As we’re walking to their offices, I look on the agenda and I see that the first item up is a data center tour. I immediately felt embarrassed. I told Mick, “I’m sorry. I don’t know what we’re getting into here. I’m not sure what we’re going to learn by looking at halon extinguishers.”
And yet, what I saw was such an aha moment, because here’s what we saw. When you walked in the data center, you saw probably 50,000 square feet, but it was mostly empty. There were two Z mainframes, because they are a mainframe software vendor. But what you see in the data center floor are these outlines, like in a murder scene, of where the data center racks used to be. Each one of the murder outlines is a tombstone describing the application and the business process of what used to run there. Things like their old ERP system, their MIA financials, their HR systems, email. There’s a sign that probably says, “Over 14 tons of obsolete equipment removed.”
And so what I think is so amazing about this is I showed the best example of carving out context to feed core.
In other words, Dr. Geoffrey Moore talks about Core vs Context. Core is what creates durable, lasting business advantage that customers are willing to pay for. Context is everything else.
And so, that empty data center represented $8 million of operational spend that could then be reallocated to R&D. So payroll, ERP, that’s very important, but customers don’t care. Right? We don’t pay our premium because someone’s payroll services are world class. They are willing to pay for greatness that R&D creates.
So the biggest danger that Dr. Geoffrey Moore talks about is context starving core. Here is such a great example of shrinking down context to enable core.
Not ideal is that functional silo managers prioritize silo goals over the grandest business goals. Whereas in idea, everyone is looking at the work that they do, having unflinching conversations about, “Does this really create lasting durable business advantage or is this context?” And if it’s context, is this work that we should be doing at all, or is this something we should be relying on an external vendor for because they’re a core competency and it should be ours.
Why do I think this is important? Because we are entering an era in this age of software and data where technical practices and how we can use data to business advantage is without doubt what will create winners and losers in the marketplace.
I love this quote. “The world is changing very fast. Big will not beat small anymore. Instead it is fast beating the slow.” That is why I think The Five Ideals is so important in terms of how we manage our organizations.
Thank you so much for allowing me to be a part of your day at The Data Driven Conference and to our our Actifio hosts. If you’re interested in anything I presented in this session, work with the team from Actifio. They can give you all of the presentation materials.
If you’re interested in getting any of the other materials like the talks from DevOps Enterprise and all the excerpts of the books I’ve written, just simply send an email to realgenekim@SendYourSlides.com, subject line, devops. You’ll get an automated response within a minute or two.
Thank you so much for allowing me to be a part of your day. I’m very much looking forward to the session with the Actifio team tomorrow. Thank you and see you then.
Gene Kim is a multiple award-winning CTO, researcher and author, and has been studying high-performing technology organizations since 1999. He was founder and CTO of Tripwire for 13 years. He has written six books, including The Unicorn Project (2019), The Phoenix Project (2013), The DevOps Handbook (2016), the Shingo Publication Award winning Accelerate (2018), and The Visible Ops Handbook (2004-2006) series. Since 2014, he has been the founder and organizer of the DevOps Enterprise Summit, studying the technology transformations of large, complex organizations.
In 2007, ComputerWorld added Gene to the “40 Innovative IT People to Watch Under the Age of 40” list, and he was named a Computer Science Outstanding Alumnus by Purdue University for achievement and leadership in the profession.
He lives in Portland, OR, with his wife and family.