Imagine you’ve opted for a Big Bang rewrite instead of a gradual one.
Here’s how we can visualize your current situation:
You’re in the tunnel. There is no light at the end of the tunnel. At least, not yet.
You soldier on. You are confident there is light at the end of the tunnel, but you don’t know how long the tunnel goes. Your kids are following you, asking: Are we there yet? How much longer must we go on?
You don’t have the faintest clue — but you’ve got to pretend you do. Until, finally, you begin to see rays of light.
The problem is that you can only see the beams of light when you’re nearly out.
The darkness of Big Bang rewrites sucks, why are they so incredibly foolish?
The Hubris of Big Bang Rewrites
I never worked anywhere where there was someone who knew how the complete old product worked, neither from a functional or technical perspective.
This lack of total understanding makes perfect sense. A software product is a complex system that evolved over time. Many people that worked on it, together with their understanding, have since left the company. The reason decisions were made was lost.
There almost never is a single person who touches all parts of a product from both a functional and technical perspective, who is still working at the company. And even if we possessed such a mythical person, because it’s a complex system, they cannot accurately predict how it’s going to behave in all situations.
The reason you’re doing a rewrite is that you don’t know how your current product works from a technical perspective. If you knew how it worked, you would be able to gradually rewrite it. The big bang rewrite becomes an option because all you see is spaghetti, and you don’t know how the spaghetti works.
The lack of knowledge about the old product is precisely why you are in the worst position possible to do a big bang rewrite. You don’t even know how your current product works, so it’s crazy to think you’re capable of doing a complete rebuild without missing some crucial things along the way.
When you opt-for a Big Bang Rewrite, the usually following happens:
You’re gifting your competition a head start of (usually) many years. This is equal to the duration of the full rewrite.
The bigger a project, the more risk and uncertainty, which means the more likely you will experience delays.
The more delays you will experience, the more pressure your team will get. The more pressure your team will get, the more technical short-cuts will be taken.
The more technical short-cuts will be taken, the more likely you will end up with a rebuild that isn’t significantly better than the old one.
At some point someone will be fed up with all the delays and uncertainty. They will pound their fist on the table and force an unrealistic deadline on the development teams.
This deadline often results cutting corners and crippling technical debt, which can make the rebuild even worse than the original product.
To prevent this ticking time bomb from exploding, what should a company be doing instead of tunneling through the darkness?
Time is Ticking When You’re Busy Tunneling
Every day, you’re still in the tunnel. Time is ticking and morale is slipping. The longer it takes, the more miserable everyone becomes. You don’t know how long you can keep the kids happy — marching forward in complete darkness with no end in sight.
Most resources are funneled towards the new product, which isn’t live and can’t be used until it has been fully completed. You’re burning money without getting anything in return. Meanwhile, the old product, still alive and kicking, is being quietly left behind.
Resources are diverted towards the new product, which isn’t live, and nobody can use it until it’s completely done. The old product, full of light and life, actively used and sold, is being neglected. However, every day in the tunnel, people in the organization will begin to care less about the old product.
Why improve something you’re going to sunset in the near future? Every change is now carefully examined through the lens of the following question: Does this change even matter if the rewrite is coming soon?
The engineers working on the old product are also frustrated. They are usually the ones that know best how the old product works, yet they’re doomed to keep it alive until someone finally decides to put it down.
And then, they’ll have to work on a new product, which they have not been a part of and might not even fully agree with.
The Lie of Progress
The problem with a big bang rewrite is simple: nobody can tell you exactly how long iw till take. The 11 Laws of Software Estimation apply, most notably:
“2. No matter what you do, estimates can never be fully trusted.”
“4. Estimates become more reliable closer to the completion of the project. This is also when they are the least useful.”
“10. Breaking all the work down to the smallest details to arrive at a better estimate means you will deliver the project later than if you hadn’t done that.”
The timeline only becomes clear when you’re almost there. But we hate uncertainty, so we fill the void with paper victory plans and charts. These forecasts and plans allows us to pretend to our leaders that we can see the whole path from the beginning to the end. This is a lie. Most of our plans and estimates are just a ritual to produce the illusion of control and satisfy our inner cravings for a false sense of certainty.
Our desire for a false sense of certainty slows down progress. As we do the work, we will discover the work we must do. Instead, we’ve injected our estimates and projections with the fog of speculation. This means we’re over-fitting and sabotaging our plans.
To make things worse, the team working on the rebuild is emotionally invested. They need it to succeed, because they can’t go back to the bullshit of working on the old product — which means they’re unlikely to admit how uncertain and risky things really are. So they paint a confident picture and tell a story they hope is true, and not the real story.
Instead of wishful thinking and spreading false narratives, what should we be doing instead when we’re in the tunnel?
Produce a (False) Total Picture of Progress
Since you’re already in the tunnel, you’ve got the benefit of already having completed work. Hold a session with all teams involved in the rebuild to discuss the following:
You list all all the rough chunks that are necessary for the rebuild (including work that you’ve completed already).
For the work that you’ve completed, you add the actuals. If you have the original unaltered (a priori) estimates, you compare them with the actuals, and calculate how wrong you are on average.
Group uncompleted chunks of work together with items that are similar in size. Now you have estimates for the uncompleted work.
Add some padding for things you’ve missed, preferably based on the items you’ve completed where you most likely also missed stuff. Make sure you also include migration of users or other things (if applicable), and time for E2E testing. Also use a nudge factor based on how wrong your estimates are compared to actuals.
Prepare everyone upfront by saying this exercise will be extremely draining and frustrating. If you can’t involve all teams directly , then make sure you involve a representative and useful set of delegates for the different teams involved. Ideally you finish the session in less than a day, something like 4 hours
The developers who attend the session will say: “Please give us more time. Don’t pressure us, then we will be able to produce a much more reliable estimate.” This is a lie, because no matter what you do your forecast will be wrong.
We want to waste as little time as possible producing it, because we know it will be wrong. No matter how much time you spend, your estimates will be wrong. The forecast you’ll produce will always be extremely noisy and filled with (wrong) assumptions.
We want to have an idea about the order of magnitude. We also want to prevent the following law from kicking in:
“10. Breaking all the work down to the smallest details to arrive at a better estimate means you will deliver the project later than if you hadn’t done that.”
The more time you spend on it, the more you will be busy with over-fitting and injecting noise, and your forecast will still be wrong. You only wasted more time to produce a wrong forecast.
The point isn’t to produce an accurate estimate, because that’s impossible, but to produce a guess of the order of magnitude. Let’s say you’ve completed the exercise with your teams. Congratulations! Now you’ve got an estimate for how long you expect the whole rebuild to take. Remember, this estimate is completely full of shit and certainly wrong.
But at least now you have an indication of the order of magnitude, that you’ve wasted very little time on that you can use to make decisions. In most cases, any guess we will produce most likely underestimates the true length of the complete rebuild, because the last 10-20% takes the most of the time.
Now that you have a guess about the order of magnitude, we explore the implications together.
Does the Rebuild Still Make Sense?
Let’s say the forecast for the rebuild time is 5 years. Then we can talk about:
Is this timeline acceptable?
If it’s not acceptable, how much faster would our rebuild need to be for it to be acceptable?
What can we do to speed up the rebuild?
If we cannot speed up the rebuild to achieve an acceptable time-to-market, what should we be doing instead?
Regarding point 2, adding more new people is rarely the answer, because until those people are up and running, you will lose precious time. Often they lack so much domain knowledge, that they will actually be slowing others down with all their questions.
I’ve been in many situations where companies paid many millions to temporary add new development teams, while those development teams added zero value because their added development speed was canceled out by how much they slowed other teams down.
Some of the questions you should be asking about the current rebuild:
Are we busy with over-engineering? Often when doing a rebuild, people are scared and they want to do it right this time. As a consequence, they set the quality bar extremely high. This is extremely foolish, because:
Every day day it takes to build the new product, means that existing users are exposed to the shitty quality of the old product. The goal isn’t to get it perfect from the first day of the rebuild, but the launch something that’s sufficiently better, and that we can make awesome over time.
Don’t let the perfect be the enemy of the good. Set a quality level that’s better than the old product, and make sure you can improve it over time.
What features should we leave out, or can be added after the first version? Companies usually are not aggressive enough with this, and as a result they spend many months rebuilding features nobody uses in the old product, and nobody will use in the new product either.
Just to hammer the point home: please remove features as aggressively as possible from the first version. Too often I’ve seen features removed close to the completion of projects after the teams already wasted many months on them. Be as aggressive as possible now, or you’ll just have the regret of delivering a lot less later.
After you’ve gone through these points, how much have you reduced the overall timeline? Is this now acceptable or not?
I even worked at a company where they scrapped the rebuild, and restarted a new one. They did a small proof-of-concept where a small team with a handful of people worked on another rebuild with a different architecture in parallel to the existing teams/ In less than 2 months, this rebuild had had more progress than the old one. We used this information to decide to scrap the old rebuild, and go for the other one.
Big bang rebuilds are difficult and treacherous. Most of them fail to deliver the expected value by the time we’re done, so what should we be paying attention to to make sure we’re on the right track?
Rebuilds That Show Quick Progress Are Finished Quickly
The single most important thing you should pay attention to when doing a rebuild is the following: rebuilds that show quick progress, are finished quickly.
This may seem extremely obvious, but let’s say you’re working for 6 months, and you’ve barely seen anything working. Something is going terribly wrong and in all likelihood your rebuild will be a disaster. You’re in the tunnel, and nobody is seeing any light.
All the rebuilds I’ve been a part of that were successful, their progress was quick. The ones that were unsuccessful, their progress was slow. Often because they chose an approach that was too over-engineered or complicated.
When doing rebuilds, it’s absolutely crucial that you follow an approach where you will quickly see light at the end of the tunnel. Light at the end of the tunnel is having something that works. Something that really works - it’s in the hands of real users and they tell you it works for them.
This is why I much prefer gradual rewrites, because you immediately produce value in the hands of users. You’re swiftly seeing light at the end of the tunnel. You dodge the time pressure and stress of big bang rewrites, because everybody relaxes when they see real-world results.
If you can’t do a gradual rewrite, or don’t want to, then at least make sure your rebuild shows quick progress.
Slow rebuilds are incredibly risky, and in all likelihood, you’re building something that’s going to be a failure. Because if the rebuild is slow, you can pretty damn sure that adding new features will also be slow. You’ve missed the point of the rebuild.
You’re often doing the rebuild because you want to build new features faster and with higher quality, not to deliver features slowly as you’re already doing right now with a slightly better quality.
If you’re adding new features slowly, it’s probably better to pull the plug and opt for a gradual rewrite. Don’t let the sunk cost fallacy keep you going on the same path that will set you up for failure in the future as well.
If you’re seeing light at the end of the tunnel, and you’re close: keep going!
If you’re not seeing the light, try to create a situation where you will quickly stumble on rays of light at the end of the tunnel. That’s the way out of the darkness, to make quick progress towards the results we want, not to pretend we’re making progress through pretty Gantt charts.
Special thanks to , and Jasenko Ramljak for their feedback.
Great post! This is why I created the Minimum Viable Replacement Framework and the Watermelon Detector as everyone makes the same mistakes over and over again.
https://youtu.be/Vzy86T7e4o0?si=eK19tkATTq5RJ0qj