2010-09-27

Archiving on hard disk

How I archive at home:

The difference between an archive and a backup


The most important difference is an archive is the master copy and a backup is not a master copy. If you lose the archive copy, you risk losing the content permanently. Because it is is the master copy, it is sensible to have a backup for the master copy.


Media choice


I used to archive to CD-Rs and DVD-Rs. They had a great GiB/$ ratio which was an important factor for me. They were also impervious to some damages: power surge, file-system corruption, fat-finger deletion, etc. But they were inconvenient to handle as you need many of them. It was not so bad 10 years ago when I had my first CD-R drive because my data size was small wrt. to CD-R size. But over the years, my data size grew and grew and I've gotten lazier feeding DVD-Rs into the drive.

Now I archive to harddisk. It has became economical to do so. As of today, the lowest ratio for SATA harddrive [1] on newegg.com is 14.70GiB/$. You'd need at least 2 of them, which brings it to 7.35GiB/$. The cheapest DVD-R media goes for $18.00 for 100 single-layer discs[2], equivalent to 24.27GiB/$. DVD-Rs are still cheaper, but it takes a lot of time to write to them.

Burning one DVD-R takes 15 mins, writing the equivalent amount, 4.37GiB to a harddisk through USB2 (assume write speed is 20MiB/s) takes about 3.7 mins. Let's assume that we are going for time and so we do not bother validating the archive (re-reading), and also the data stored exactly occupies a DVD-R.  The first assumption gives a huge advantage to the DVD-R because it eliminates a huge overhead in validation: the media has to be ejected and re-inserted. The second assumption also eliminates the overhead in changing DVD-R media. So, the time saved is 11.3 mins/4.37GiB = 2.59mins/GiB. If you have to archive 100GiB, copying it to harddrive is quicker by 100*(15-3.7 mins/4.37GiB) = 4.31 hours. In reality, the time saved, thus the time-cost difference, is even more pronounced as you don't have to wait in front of the computer while the data is being copied. I am usually present for less than 5 minutes total: initiate copying, initiate validation. The time difference that matters to you (you not being a slave to the DVD-R drive overlord) is 100*15mins/4.37GiB - 5mins = 5.63 hours/100GiB.

For a time saving of 5.63 hours/100GiB, I'd pay the cost difference of 100*(24.27-7.35)/(24.27*7.35) = $9.49/100GiB for harddrive.


[1] $95 for 1.5 TB HD is $95 for 1396.98 GiB = 14.70GiB/$.
[2] SL DVD+R capacity: 2295104 2KiB sectors = 4.37 GiB. $18 for 100 discs = 24.27GiB/$.


Backup of archive


According to some articles, harddrive is prone to losing its magnetism when stored for a long time. This supposedly affects recent, high-density harddisk because the bits are packed closely together. Other causes that can affect a harddrive: the motor lubrication may evaporate or chemically degrade so as to prevent the disk spinning correctly, a power surge may fried some drive electronics.

Since the archive copy is the master copy, it is prudent to back it up. The easiest way is to have a similarly-sized drive. Then you can just do a bit-by-bit copy from the master archive to the backup archive which is faster than filesystem-level copy if the drive is fully filled.

The backup media should not be of the same make and model. Harddrive is manufactured in batches, and the same defect usually is present in every drive in a batch. If you buy two drives of the same make and model at the same time, they may come from the same batch.



Partitioning archive disk


If the archive disk is big if you may want to partition it into smaller pieces. Partitioning can limit the extent of filesystem corruption as long as you operate on (mount) one partition at a time.

How big should a partition be is up to you. For my data, I'd partition a 500GB drive into 2 250GB slices, and a 1.5TB drive to 3 500GB slices.


Encryption


Encyption is optional. But it can be useful in Windows OS.

Windows eagerly mount all partitions on a disk. This defeats the intention of having multiple partitions. However, encyrpting each partition will let you mount only certain partitions. FreeOTFE or Truecrypt is my encryption software choice on Windows.


Operation & maintenance


After archiving to the master disk, I follow up with doing a bit-by-bit copy of the affected partition to backup disk. This shows another benefit of partitioning: if you have a gargantuan drive, you don't have to touch all.

After copying to the master disk, I took both disks offline. I use SATA drives for my archive disks and have 2 SATA docks. The disks are offline (unpowered, plugged off, in safe storage) most of the time. The master is online only when I need to read or write from it. The backup is online whenever I modify master. Whenever the backup is online, there is a brief window of risk where both drives may be destroyed (e.g.: a power surge).

The window can be eliminated. One way is to copy the master partition to online disk, take the master offline, bring the backup online, write the copy of master partition from online disk to the backup disk. This is too cumbersome for me and I have no need for this kind of data guarantee.

Around every New Year, I do badblocks -n on the whole disk for each disk to rejuvenate the 'magnetism'. This takes a lot of time, about 5.5 hours/100GiB when connected through USB2. An alternative, according to this article, is to simply do a read. This can be accomplished with dd if= of=/dev/null [bs=1M].


Update: A quick reading of Predicting Archival Life of Removable Hard Disk Drives (circa 2008) shows:
  • hard drive can easily survive 20 years (*simulated environment),
  • 2.5" form has better survivability rate than 3.5" HD,
  • storing HD at 20C increases its survivability
I honestly are not bothered by these for my personal usage. I'd likely move over the content to some new media when SATA becomes obselete. I am pretty sure that SATA will become obselete within the next 10 years. I'd keep upgrading the media to keep up ahead of obsolescence. It is no fun trying to retrieve something from an obselete interface.

Badblocks: non-destructive read-write mode

With the -n flag ("non-destructive read-write mode"), badblocks does the following sequence in every iteration:
  1. read data
  2. write test pattern (test pattern is ~0 (0xffff in 32-bit int))
  3. read test pattern
  4. compare test pattern
  5. write data
It will try to write back the original data when terminated with the following signals:
  • SIGHUP
  • SIGINT
  • SIGPIPE
  • SIGTERM
  • SIGUSR1
  • SIGUSR2
Killing it with SIGKILL is dangerous because you have, at best, 3/4 chance of corrupting your disk (the only safe state to SIGKILL is when it does read data).