Like many folks I’ve adopted Dropbox as my main cloud document synching and backup service. Dropbox service is typically so slick and transparent that one hardly notices it. One just expects the latest file updates to be there when you switch to another machine. I use it to keep my files synchronized on my four production Mac laptops and my iPad, and happily it even supports OS X 10.4 Tiger on my going on fifteen year old Pismo G4 PowerBooks.
So it was a jolt over last weekend when Dropbox refused to sync my files. Happily, all is well again, and I worked around the problem provisionally by using email.
Dropbox Head of Infrastructure Akhil Gupta has posted a post-mortem blog explaining what caused the issue He notes:
On Friday evening our service went down during scheduled maintenance. The service was back up and running about three hours later, with core service fully restored by 4:40 PM PT on Sunday.
For the past couple of days, we’ve been working around the clock to restore full access as soon as possible. Though we’ve shared some brief updates along the way, we owe you a detailed explanation of what happened and what we’ve learned.
We use thousands of databases to run Dropbox. Each database has one master and two replica machines for redundancy. In addition, we perform full and incremental data backups and store them in a separate environment.
On Friday at 5:30 PM PT, we had a planned maintenance scheduled to upgrade the OS on some of our machines. During this process, the upgrade script checks to make sure there is no active data on the machine before installing the new OS.
A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-replica pairs were impacted which resulted in the site going down.
Your files were never at risk during the outage. These databases do not contain file data. We use them to provide some of our features (for example, photo album sharing, camera uploads, and some API features).
To restore service as fast as possible, we performed the recovery from our backups. We were able to restore most functionality within 3 hours, but the large size of some of our databases slowed recovery, and it took until 4:40 PM PT today for core service to fully return.
For the full report visit here: