Backups, backups, backups – Captain's Log Supplemental

Now that Zeta is effectively operational, I’ve turned my master plan to its next stage: not losing data.

Cream’s drives had a very simple arrangement. One drive, designated “Master”, or M:, was the base of all the file shares kept there. A second drive, “Backups”, or Z:, was kept next to it, and a scheduled task would run a robocopy in mirror mode three times a week to sync the drives up. Nice and simple and cheesy, and for bonus points the mirrored drive racked up about half the power on hours over the past few years. For monitoring purposes, the log was saved to one of Cream’s internal drives which was periodically imaged to the host specific backups area on the master drive.

Zeta on the other hand doesn’t have the curse of NT, but I did kind of like this simplicity. It fits my recovery model where having an easily recovered copy of data is desirable, but changes are infrequent enough that rolling up the backups every 2 or 3 days is probably okay. At first, I decided to create one disk as ext4 to function as the backups, because dependable and trusted, while making the other xfs to function as the master, because that’s the default in AlmaLinux 9.

This created one small problem however, in that getting rsync to play nice with the SELinux, POSIX ACLs, and a few extended attributes proved to be a pain in my ass! For SELinux, you can just relabel the drive after. Not something I want to scale up to 8 TB but not too bad for the actual storage use (2 TB) today. But then we’ve got the issue of the POSIX ACLs and extended attributes used on my file share infrastructure.

Turns out that rsync’s --archive flag effectively breaks the flags you would want for synchronizing these, and then leaves you to go fiddle around with permission masks. So, I said fuck that. I was rather disappointed in rsync at that, but let’s face it, acls and xattrs aren’t that popular when 1970s unix permissions are an 80% solution.

After taking suitable backups (one local, one remote) of the critical files, I set about turning to tools that I know how to fuck with. The backup drive was sacrificed to create one disk of a RAID1 Mirror, and since madam allows specifying the drives like missing /dev/sdwhatever or vice versa, it was easy to spawn the array in degraded form. Then sync the data to the array from the master drive, before wiping and adding it to the mirror’s missing slot. About 10 or 11 hours later syncing at max speed, everyone’s all riled up and gone through the reboot test.

How did I migrate the data if rsync was being a bugger, you ask? Well, it’s slow as hell, but cp --archive and tar --acls --selinux --xattrs really does do what you want when you’re Rooty Tooty and want a lossless copy :P.

In the past, I would typically have used LVM2 pools to manage this sort of operation. It’s overly complicated command line administrata, but hey, it works well and it has features I like, such as snapshots and storage pools. The advantage for me of mdadm is that it is very simple to manage thanks to fewer moving parts.

Having been “That guy” at some point in my career who ended up writing the management software my old job used for mdadm software raid in their audio IRDs, and then later extended to custom hardware built ontop of firmware raid, I know how to use mdadm and more importantly, how reliable it is — and how easy it is to recover a mirror without fucking up. Which, you know, is like the number one way your data goes bye, bye when recovering, right next to oh shit the drive died before it was synced. As much as I appreciate LVM2, it’s got enough moving parts that I’m more leery about the failure scenarios. More importantly, I have more experience with mdadm failure and recovery than I do with LVM.

Of course, this does create a new problem and its own solution. Since my backup drive is now in hot sync with the master drive, it is no longer uber idle enough to be considered a ‘backup’. No, it’s redundancy to buy time to replace drives before the entire array goes to the scrap yard.

This doesn’t really change my original recovery scenario: which is “Go buy two drives if one fails,” it just means that there is a higher probability that both drives will actually fail closer together when that happens. What’s the solution to this? Why, my favorite rule of data storage: ALWAYS HAVE A BACKUP! Thus, a third drive will be entering the picture upon which to do periodic backups of the array, and be kept separate and offline when not being refreshed.

In practice though, this will be more like a fourth drive; in the sense of ‘smaller disk, most important data’ and ‘big ass disk, all the data’. My spare archive drives are large enough to easily do the former, and one can basically contain the entire ‘in use’ storage or close to it, but none of my spares for sporadic backup has the capacity to handle the entire array.