Amazon recently completed a genuinely massive migration project from Oracle to AWS databases. This is an impressive accomplishment, demonstrating both technical prowess, and sustained corporate willpower. The data model of a large system permeates the entire system, and over time becomes a source a great fragility. The broader industry refers to such projects as Digital Transformations. Outside of the technology industry, these tend to be changes to existing systems to accommodate new business models, often models that are enabled by B2B and B2c experiences on the internet. Connectivity touches everything these days: supply chain, shipping logistics, support and service, manufacturing, R&D, and of course, disruptive new models enabled by connected mobile apps.
The last project that I led was a digital transformation. Multiple existing, overlapping systems, many operating at scale, were causing a highly fragmented customer experience that negatively impacted both sales and support, but also the internal view of the customer.
At least at the beginning of the project, there seemed to be broad agreement that converging and simplifying these systems would lead to real customer satisfaction and increase sales.
My experience as a technology leader up to that point had been working on, or designing internal systems level platforms or products offered for sale. I had a good track record, including at least one system of world class scale, availability and correctness.
How hard could converging a bunch of user space apps be?
Stunningly hard, as it turns out. I gained a lot of understanding and respect for what IT folks live day to day.
If you find yourself in this situation, consider:
Double and Triple Check Your Assumptions About Business Value
I think we got this one mostly right. To do it required a search across the company for the handful of folks with the history and pulse on how things really work and should work Studying the existing systems, and listening carefully to the folks who understand them, helps, at least to understand how things work now. Despite working at the same company for 25 years, and being knowledgeable about the internals of many of our products, I discovered systems that had trained over the years to subtle and devilishly complex business models. This should have been a red flag: Can we change all of these models? All at once? In what order? What will be required, training-wise? What if the changes negatively impact the business? Will they?
Get the Reference Data Model Right
Once you have reverse engineered the existing systems, it’s time to find the data. Sometimes it will be densely packed in a cluster of SQL databases. This is a gift, although it might not always seem so. Sometimes the data will have been copied (“denormalized”), and nobody is quite sure who is master now or how things stay in sync. Or if they do. If your system is old enough, there will be multiple brittle, schema transforming connectors between islands of data, and attempts in time to re-convergence in form of layers, which often contain copies of data.
The reference model is an exercise to identify the set of cohesive (“like”) and decoupled objects which map to real world business and customer things, virtual or physical, from the viewpoint of how would we build this now, with perfect knowledge of the requirements? If the data model is right, the object models and layering will often fall out naturally, and the system will evolve easily over time. If it is not right, each wave of smaller and smaller changes will become increasingly more expensive, and large changes simply impossible. The system will also not be auditable, or provably correct. Or actually correct.
This is the first step in figuring out where, ideally, you would like to go. A side effect of this exercise will be a much deeper understanding of the existing system and the business requirements.
Inventory Everything, Classify It, and Make a Plan
It’s tempting to green light a complete system reboot. Rewrite it all. Engineers love writing new things, with new tools. Iterating on crusty old things, not so much. It’s a mistake. There will be parts of the system so complex and brittle that the correct strategy is to stabilize them, wrap them and put them in surgical maintenance, with a plan that they live forever. This will do some damage to the reference data model. Plan for it. Other parts of the system will be in maintenance until they are replaced entirely. And some parts can and should be written on top of new objects and APIs. The structure and sequencing of this is critical, and will evolve over time.
One thing to pay particular attention to is scale bombs. These are systems that have a component that scales non-linearly with offered load, in a business environment where offered load is growing. Sometimes, non-linearly. One strategy can be to scale up with hardware and hacks, until these systems can be replaced. Sometimes you can get just the nasty load off the old systems, and let them run forever with smaller loads. Sometimes due to schema issues, the scale problems are not alleviated until the all of the load is off the systems. Complete shutdown. Modelling and calculating and monitoring time to explode is critical. These system often take months or quarters , and the very best engineers to fix. And if the business stops, well, the business stops. It won’t be good.
It is also important to get credit for this work. Otherwise this cost will get booked by default to the transformation project, which will, on its own, likely go over budget.
Layers and Re-platforming
There is nothing wrong with Service Oriented Architecture. It is a prerequisite for a decoupled distributed system. What you have to be a little careful of is a layer that adds SOA to part of an existing system where the data model really needs to be changed, but isn’t. Engineers like building layers. The new top edge is often more cleaner, easier to understand, and the systems above it, more maintainable. Once they are rewritten.
In cases where you have a system in surgical maintenance, and the data model doesn’t need to change, or can’t be changed, SOA API layers can make sense.
There are two forms of SOA layers. The simple one re-exposes a chuck of the existing data model, semantically unchanged. The danger here is that you burn considerable time, money and political capital on a disruptive change that accomplishes nothing toward the real business goals of digital transformation. The more complex one attempts to re-converge disjoint systems in the API layer. This can work, and is easier than fiddling with the base layers of the system. But there are real ceilings to this approach, and getting the balance right between fundamentally fixing the data model, and faking it, are some of the most important architectural decisions you will make.
A similar danger lies in re-platforming. Engineers like to re-platform. Move to modern tools. Virtualize. NoSQL. node.js. Re-platforming, in the absence of fundamentally evolving the system toward real business goals, is often a distraction.
On this project, our team acquired a reputation for being good at managing crusty old systems. Unsurprisingly, in retrospect, there were people at the door willing to hand us crusty old systems, with headcount and budget. We took them. Many team members were happy, because they had bigger teams and bigger charters. Many of these systems were taking real load, with real customers, and were quite important to the business. But as it turned out, many of them were time-bombed, or audit-bombed, or compliance-bombed, and contained all manner of other surprises. Take on too many of these, and you will lose focus and turn your digital transformation project into a IT shop. But without the discipline that real IT shops have.
The cost of these things will get booked by default to the transformation project, which will, on its own, have gone over budget.
As teams get very large, in the best case you get more confusion, and in the worst, duplication and deliberate sabotage.
Distributed Apps Need an Architecture
I should have gotten this right earlier, but didn’t. I was focused on the evolving data model, service componentization, layering, making sure we were tracking to convergence over time, and that real business problems were being fundamentally enabled. What I missed is that distributed apps, like distributed systems, have errors. In the case of apps, often subtle and complex errors. Sufficiently complex systems cannot in practice be debugged completely. And if they could, they wouldn’t stay that way for long. It is critical that an architecture for error handling, retries, orchestration, compensation, and that the app – left to run on its own – converges toward correctness. A sort of OK middle state is that errors are caught, and corrected manually. A bad state is that the app silently converges toward incorrectness.
I now believe that most complex distributed apps silently converge toward incorrectness. And Support, and Audit catch and repair some it. You can experience this up close if you try a data layer migration of a layer that’s been around for a good while. A good chunk of it moves, but then the long tail of schema errors festers. It’s often surprising the app seems as correct and available as it does, given the crap in the databases.
It can feel like an unwinnable battle. But not being intentional only makes it worse.
Organization and Teams
For much of my career in the capacity of technical leader, I got lucky. I worked at company that was attracting top talent, and I was working on things that made creating opportunity to learn and grow relatively easy. Over the years, many talented people saved me from myself. This project was different. While we had many experienced and exceptional engineers, we also had a large percentage of early in career people. It was a little hard to recruit and retain the very top tier of experienced talent. The app, while critical to the business, was correctly not the primary focus of the CEO, and wasn’t aligned with our core products and services in the way that, for example, fulfillment or search is at Amazon. Or the feed and ads at facebook, mobile and maps, at Apple and Google. We had scale challenges, but for the most part mega scale was not required. If you were primarily interested in learning new marketable skills, we were not at the cutting edge.
Scale I got.
In a climate like this, you got to deploy your best people very thoughtfully, invest in growing people and remember to explain at team scale what we are doing, and why it matters. Again. The latter was not my comfort zone, and the organization suffered for it.
There exists another organizational challenge in large teams on such projects. If you create a maintenance team, and a new thing team, then the new thing folks miss out on critical knowledge about how things actually work. And if you combine them, you can lose focus on the system in maintenance. Or focus too much on it. We (mostly) erred on the side of combining them, which I think was the best choice. At least at the beginning. Toward the end the folks working on the system in maintenance were beginning to working on migration tools.
The investment and opportunity cost to running the business on the system in maintenance can be very high.
The cost of these things will get booked by default to the transformation project, which will, at this point, be way over budget and late.
At some point, you have to rally to a milestone, suspend disbelief, finish, migrate, plan for and deploy it. Successfully. Without toasting the team. Much. Then, do it again. Until you are done.
Rolling it Out
There was an internal debate about the degree of business process and policy change that the new system (only) would enable. This had the benefit of getting people excited about the transition. But in practice trying to roll out a new system, run migrations, train people new systems and tools, and implement policy changes was too complex, and too risky. I pivoted to a strategy of building near perfect backward compatibility into the new system, rolling it silently, and then demonstrating how new business process and policy (often ready to go under the covers) could be implemented quickly. This was effective, but added a lot of cost.
All up, the cost of keeping the systems in maintenance running, migrations and backward compatibility consumed the bulk of the teams capacity. More than half. That, I hadn’t anticipated. IT folks are laughing.
For the first half of the project, the management chain above me was short, and reported to the CEO. My immediate manager, and his manager understood the strategy, the value and the complexity, and had confidence in the team. I took this for granted, and focused on building things. Until it was gone.
If you have read this far, perhaps you wondering if this was ultimately a success. I retired 5 months ago, after being a technical leader for 6 years on that project. And we were not done. So no – I didn’t feel successful. I would have done things very differently, with the wisdom of hindsight.
That said, we changed business strategy is a very big way, mid project. It was a necessary and correct decision, but cost us 2 years of investment.
We defused a number of scale bombs and improved correctness and availability, and kept a large and growing business growing.
The systems in maintenance teams delivered flawlessly. They kept things running, at scale, and waited for team new thing to catch up. They were always helpful.
The new system supported a fast growing new business with immense scale and unique business model requirements. Successfully.
Existing systems were starting to be migrated to the new system with success.
And a very good team is still working it.