Tuesday, April 3, 2012

Getting Started with Cyphertite Remote Backup

Cyphertite is a new remote backup tool from Conformal Systems that, in many ways, is similar to other popular services such as Backblaze, Mozy, Carbonite, and JungleDisk. Linux and FreeBSD users will find similarities with tarsnap. It's a bit of a crowded market, no? However, there are four attributes that set apart Cyphertite:
  1. Storage that's cheaper than AWS at $0.10/GB for storage (no charge for bandwidth)
  2. The amazing combination of compression, deduplication, and client-side encryption
  3. Strong cryptography (White Paper - PDF)
  4. An open source client open for anyone to review
Since cyphertite may not yet be well-known in the general IT industry, I thought it would be helpful to explain exactly what it is and how it works. This post actually grew out of my notes for myself as I began setting it up and using it internally. Since the Windows and Mac OS X clients aren't ready yet, this howto guide assumes a Unix or Linux server (I used OpenBSD). Even if you only need to back up Windows files, you can install Services for NFS for Windows Server, set up OpenBSD on an old spare box or VM and mount the NFS share on it.

Also, please note that cyphertite has a wiki, live chat and email support from Conformal, a forum, and man pages. Although it took me some time to grok everything, I had no problem getting help thanks to these resources. The developers were even kind enough to review this guide for accuracy. If you're not familiar with Conformal, you may also want to check out some of their other initiatives.

How it Works

Cyphertite behaves similar to tar in that it archives files and uses similar flags. However, instead of operating on tarballs, you work with "ctfiles" which are not actually file archives, but more like a representation of the files that have been encrypted and archived remotely. During the backup process, cyphertite: 
  1. Breaks up every file into 256KB chunks.
  2. Records the chunk's SHA-1 hash in a local sqlite database.
  3. Compresses the chunk.
  4. Encrypts the chunk using AES-XTS.
  5. Records the encrypted chunk's SHA-1 hash in the localdb.
  6. Records metadata in a ctfile about the file being archived, including its path, permissions, and list of hashes that belong to the chunks that make it up.
  7. Archives the chunks at the remote backup destination. 
If a chunk's hash pair matches a previously uploaded chunk, it is not uploaded again, thus implementing "chunk-level" deduplication. Another benefit of the hashing is file integrity verification. So while the database is a collection of hashes of the chunks of your files that have been backed up, the ctfiles are necessary to recreate the files from chunks when you're performing a restore. The ctfiles end in .ct and are generally stored remotely. However, since they can grow to a significant size, they are cached locally as well, in what is referred to as the "metadata cache directory." It is possible to disable 'remote mode', in which the ctfiles aren't stored on the remote server. This is called 'local mode', and the ctfiles will not be backed up, so if you lose them, you will not be able to recover your data. In addition, you won't have the ability to cull (delete) old data. Since the ctfiles are encrypted with the same algorithm used to encrypt the chunks, Conformal doesn't have access to this metadata, so there is virtually no downside remote mode. When performing operations on a ctfile, the full path of the ctfile is not passed, simply the name of it. Cyphertite checks its .conf file for the path to the cache directory, or else uses the remote ctfile.

Getting Started  

First, download and install cyphertite, and complete initial setup. Now, let's run a full backup of our /home directory:
  ct -cvRf allhomefiles_1.ct /home The ctfile will be generated and stored in the cache directory, which will be prepended with a timestamp, called a 'tag', so it will look like 20120330-143853-allhomefiles_1.ct. If you have the following line in your cyphertite.conf file: ctfile_remote_auto_differential = 1 Running the same exact command subsequently will create a new incremental ctfile with the same basic file name, but with a new tag, and with a vastly smaller file size. I recommend this for typical usage, as you will save yourself a lot of processing time by only calculating hashes and writing metadata for new or changed files. The new ctfile will thus only reference these new or changed files, and only chunks from these would be uploaded. Running the following command will force a new full or "Level 0" backup, which then creates a new full ctfile: ct -0vRf allhomefiles_1.ct /home 
However, since you are still referencing the original ctfile, cyphertite will reference metadata from it and all subsequent incremental ctfiles, and will only archive new chunks. The new ctfile, however, will contain all the metadata, not an incremental amount.
Think of this as a full backup, and that cyphertite was smart enough to upload only what it needs. Remember that remotely, data is stored just as encrypted chunks, not in any sort of file or path structure. The new level 0 ctfile will reference all the chunks it needs to recreate every file you wanted to back up. So when the original level 0 ctfile, and subsequent incremental ctfiles, are deleted, and the data is culled, everything associated with this new level 0 ctfile will remain intact. Let's list all remote ctfiles (the -m flag stands for 'remote'): ct -mt You can manually delete a ctfile (I'll show how this can be automated shortly). To remove a remote ctfile, check the output from the above command for the exact filename, and type: ct -mef 20120330-143853-allhomefiles_1.ct Note that the -e flag stands for 'erase'. You can then delete the locally cached ctfile as well. To remove all the chunks of data associated with the deleted ctfiles, that are not referenced by any other ctfiles, run: ctctl cull This will also cull anything that doesn't meet the parameters stored in the following options in cyphertite.conf: ctfile_max_differentials = 29 ctfile_cull_keep_days = 30 The first one, despite its name, will run 29 incremental backups, then force a level 0, then run 29 more incremental backups, etc. Assuming a daily job, it will run a full backup once a month, and incrementals in between. The second line will auto-delete old ctfiles for you after the specified time period, as well as their associated data (if unreferenced by other ctfiles). Our example above tells cyphertite to keep only 30 days worth of ctfiles and data.
With the above two options set, you can effectively run cyphertite from a single, repeated command, and it will create new level 0 ctfiles and cull your old data.
In general, you should schedule 'ctctl cull' to run about as often as you run level 0 backups. Culling is an 'expensive' operation as it must identify all the hashes of all the blocks to save. Since most data sets barely changes over time, and culling is a resource-intensive operation, it's usually worse to cull daily than, say, weekly or monthly.

Schedule It

Typical usage would be to run a regular incremental backup daily, with a level 0 backup monthly (configured by the ctfile_max_differentials option as noted above), and culling after the regular backup process is finished. Your crontab would look like this: # min hour DoM month DoW command 0 0 * * * ct -cvRf allhomefiles_1.ct /home # Daily incremental 0 12 1 * * ctctl cull # Monthly cull of old ctfiles 0 12 2 * * ctctl cull # Monthly cull of old data Make sure you run the backup command manually the first couple of times, so that you know how long the operation takes to complete, so that you can schedule the jobs in a way that they don't overlap. This is critical, as both cyphertite and cull need access to the database. 
With the above schedule, as long as you have properly set ctfile_max_differentials and ctfile_cull_keep_days, you will have regular full and incremental backups and will automatically delete old data.

Restoring Data

To restore the entire contents of our /home folder from the most recent backup to the root directory: ct -C / -xf allhomefiles_1.ct One thing to note is that cyphertite mimics tar in its trailing-slash (un)awareness, unlike rsync. When archiving a directory, it matters not if you back up /home or /home/. Either way, it will include the parent directory, so keep this in mind when restoring. Thus I am restoring the /home backup to the root directory, so that I don't end up with /home/home/. Note that the above command will restore the latest version of all the files that have ever been archived using that ctfile, including the original level 0 and all subsequent incrementals, even if those user files were later deleted. If you want to restore only the files referenced in the most recent backup, and not the ones that had been deleted prior to it, ensure you have this option set in your cyphertite.conf: ctfile_differential_allfiles = 1 To restore a specific file (or set of files) from a particular date to a directory named 'recovered', we can operate on a tagged ctfile and use a regular expression or glob: ct -C recovered -xf 20120330-162629-allhomefiles_1.ct *important_file.xlsx The prepended * is necessary because glob matching is on the full path, so withhout it, the glob wouldn't match the full path.
If you don't remember the exact name of the file you want to recover, you can use the cyphertitefb command to enter a shell that will allow you to navigate through a virtual filesystem as stored in a ctfile, and restore just the files you need, using standard unix commands:
$ cyphertitefb allhomefiles_1.ct
ct_fb> cd /home/horace
ct_fb> ls
ct_fb> get important_file-new-version3.23-saved-final.xlsx
ct_fb> quit

The important file will have been copied to your local cyphertite server, where you can then present it back to the user. 
I hope you have found this introduction to this amazing tool useful.

1 comment:

Anonymous said...

Actually the ct files are also encrypted using the same algorithm as the chunks before being sent to the cloud. Conformal has no access at all to the contents.