How Not To Do A Migration
Over the weekend, LinkedIn migrated their data center from Chicago to LA. To do this, LinkedIn offlined their service. In short, this was the wrong way to do it.
LinkedIn initially scheduled the outage to be five hours but it turned out to be much more (the official LinkedIn twitter account asks people to hold on just a bit more at the six hour mark and the official blog post about the migration has a comment from someone claiming the site is still down at the 13 hour mark. In today’s world of highly available computing, there is no reason why this had to happen.
From what I can tell, LinkedIn did a complete exchange of data centers; they apparently shut down the world to test the synchronization between the two sites. This just has me scratching my head. LinkedIn is a rather large enterprise, and I’m sure they have a few bucks they could set aside for a redundant testing environment which would eliminate the need for a complete outage.
Off the top of my head, there are four others ways I would have solved this problem. Check them out after the jump.
- Catch Me I’m Falling
LinkedIn uses Oracle and MySQL on the backend (slide 9 of this presentation — I’m a data guy, so that’s my initial focus), although pretty much all of their tech stack listed on the deck would support this approach. Both databases support a high degree of replication and failover capabilities (Oracle more than MySQL, but there’s a slight price difference, too). You would be able to sync up the two data centers in a largely automatic way (assuming the configuration is well executed); you could test by picking a smaller pair of instances (one in Chicago, the other in LA) and switching back and forth a few times to make sure the switched database catches up.
I also happen to know that LinkedIn is one of the big proponents of NoSQL, as they use Voldemort for their NoSQL data persistence layer. I’m only familiar with Voldemort in passing, but one of its claims to fame is its support of partitioning and failover. So, the Oracle/MySQL strategy above is still applicable, so long as Voldemort does what it claims it can do…
Net Result:
Some performance hiccups during the selective switchovers, but the sync process is largely invisible to the user — particularly the Oracle one if you choose the option which mines the redo logs and occurs pretty much entirely out of band to the transactional processing. - Half A Loaf’s Better Than None
Remove a significant fraction (like 33-40%) of the operating hardware in the existing data center(s) and use that fraction to do the necessary testing with the new data center. The remaining hardware continues to serve the site to the existing clients; publish what you’re doing so everyone expects the degraded performance and couple the testing period for a nice long weekend in your major client market (like, say, Thanksgiving weekend). Once the test’s complete, roll the new data center in the the load balancing process and return the test fraction to normal service.
Net result:
Some people are inconvenienced by the lower performance, but the service remains available. - More Than Just Books
Rent space and computation power from Amazon sufficient to host the community during the test and sync process.
Net result:
Aside from a different performance profile, it is unlikely anyone would have even noticed. - Leasing More Than Just Office Furniture
So, you many not want to trust Amazon with your business model (a somewhat understandable position); so lease the hardware from a vendor, stand up a redundant environment in the same data center, fail over to the redundant cluster, test the sync on the original hardware, end the test and use whatever mirroring process you used to stand up the redundant environment to catch the original back up, fail back to the original and off you go. Return the leased hardware to the vendor.
Net Result:
No one would ever notice a thing. Leasing all that hardware would be rather expensive as well as resource intensive.
In the interests of full disclosure, I have no knowledge of LinkedIn’s infrastructure nor of their operational processes. There maybe a very good reason why they elected for the outage, but I’m just not seeing it. If anyone reading this has some insight you’d like to share, please do. I’d like nothing better than to be wrong and learn from LinkedIn’s experience.
– Update –
It’s several days later and LinkedIn is still having problems. But I will give them credit for being upfront about it.
Related posts (autogenerated):

