Snapshots are kind of like the character cousin Eddie from National Lampoon’s Christmas Vacation movie. They’re not the Clark Griswold that backups are, but they have their charm. Snapshots are primitive, utilitarian creatures, easy-to-use and happen in the blink of an eye. Often called the “poor man’s backup”, snapshots can save you in a pinch, cost less than backups, but like Cousin Eddie, they come with some glaring flaws despite our awkward love for them. Backups have a marvelous range of features including things like Continuous Data Protection (“CDP”), replication between sites, application awareness/synchronization, and much more. So when do you choose either? Or both? Let’s explore.
What’s a Snapshot?
A snapshot is a non-application aware feature built into most modern storage arrays or implemented on top of local disk as part of the operating system (i.e. using Logical Volume Manager in Linux). Snapshots happen in a fraction of a second (“blink of an eye”), taking all existing data within the scope of the snapshot (usually at the volume level although some snapshot mechanisms work at the file level) which is otherwise read/write and marking the data read-only. This is often referred to as a “point-in-time copy”. All new writes to the disk are saved separately. This is often referred to as “copy-on-write” because the data to be written is copied elsewhere.
With each subsequent snapshot, the write portion of the disk is marked as read-only and a new write area is utilized. If a portion of data to be read has not changed, it comes from the read-only section of the disk but if it has experienced any changes, it comes from the newly created write section of the disk. Using various mechanisms in your storage array or operating system (including CRON in Linux), you can soon have a series of snapshots at the interval of your choosing (daily, weekly, etc).
However, a snapshot is non-application aware, meaning that transactions in progress are not captured, only what has made it to disk just before the snapshot was taken. As such, databases and other applications that require a “consistent state” may not restore properly when you need them in the future. The restore snapshot copy of a database or other “consistent state” application may not start or has become corrupt requiring extensive repair. Sometimes it will work just fine. It’s a roll of the dice, again highly dependent on the moment or point in time the snapshot occurred and what the database was doing at that exact moment. Nonetheless, once you have snapshots, various features can be leveraged such as “mounting”, “rollback”, or “consolidation” of snapshots.
Snapshot Features: Mounting
Snapshots taken at a volume level can be “mounted” or attached as a separate accessible volume other than the production one it underpins. If you had deleted a file or corrupted a virtual machine (“VM”), you could now just copy between your mounted snapshot volume from a past date/time before the deletion or corruption, over to your production volume and voila! Back in business. You could also use the read-only mount of a given volume to run reports or do other functions where you don’t want a change to occur but want to access the data from that point in time. Sometimes backups team up with snapshots to mount a read-only volume then back it up. This is popular with KVM hypervisors for example (using Amanda, Bacula and other backup software) where there is backup API available like in VMware or Hyper-V or anytime you want to access production data without interfering. The downside to this is that there’s no “change block” awareness with read-only mounted snapshot volumes so you’re backing up everything, every time. Backup software either works with built-in vendor mechanisms or implements its own changed block driver to accommodate, making real backup software much more efficient, granular and easy to manage than mere snapshots and the manual grind of even a simple restore when using snapshot mounted volumes to copy from the past to present.
Snapshot Features: Rollback
A rollback command causes the writes within the scope of the command to be abandoned in favor of an earlier read-only copy of the data which is now made read/write. Depending on your snapshot implementation, this can be done on a per-volume basis or per-file / per-VM basis. This also happens in the blink of an eye, the caveat being that when done on a per-volume basis both unwanted and wanted changes to the data may be lost as the volume “rolls back” to this prior point in time. This “mixed result” is why snapshot mounting is so popular to “restore only what you need” but is also hugely time consuming and administratively burdensome.
Snapshot Features: Consolidation
A consolidation command causes a snapshot to take a series of snapshots as specified in the scope of the command and merges them together. This is a necessary reality since there is danger of corruption, data loss or other issues if snapshot chains get too big particularly with large or rapidly changing datasets. Additionally, long snapshot chains result in poor performance which impacts applications. Consolidation is therefore, a necessary evil, one that impacts performance while being performed and may itself cause corruption or fail resulting in data loss. However, the risk of that is generally limited to operating system implementations of this since storage array vendors do a very good job of having solid implementations. But nothing is perfect, and since you’re playing with one copy of your data (albeit saved and marked in snapshot chains with varying functionality for use & restores), backups are king for a reason, not the least of which is the peace of mind in having a second copy of your data somewhere other than your production array.
Many people I know use both snapshots and backups simply because they do different things that meet different needs, and where they overlap – isn’t a bad thing, it’s just extra peace of mind. The important thing is knowing where they fit. If you have databases, snapshots alone won’t protect them since an application aware backup is required. Conversely, if your operating system or hypervisor doesn’t have changed block API’s and/or functionality then a mounted snapshot may be the only way to backup such data in which case snapshots are working hand-in-hand with an additional backup mechanism – or for cost purposes may be the only line of defense. Most storage systems implementing snapshots also have a replication capability (sometimes referred to as a “snap-vault” and/or standalone replication feature as well), something that is built into most modern backup software too, since disaster recovery (DR) is paramount to business operations.
And hey, even cousin Eddie knew when to use backups over snapshots — or both!
Snapshots: Best Practices
- Limit your snapshot intervals: keep your snapshot intervals to between 4hours and 1 day. Too many and performance will suffer.
- Limit your snapshot chains: keep between 1 and 10 days of snapshots depending on the volume and rate of change of your data. If you have a lot of change or mass volumes of change daily then even shorter windows are advised, after which snapshots should be consolidated.
- Don’t depend on snapshots as a backup mechanism: particularly true for databases and other “consistent state” applications. If you do, it’s a question of when, not if, you will suffer the consequences of data loss, corruption, and/or repair.
- Trust but verify: mount a snapshot, copy some data elsewhere and analyze it. Make sure snapshots are actually working.
- Easy to configure
- Low cost
- Fast execution
- Only one copy of the data
- Doesn’t work with databases and applications requiring a “consistent state”
- Prone to data loss, corruption or repair if mismanaged or as various bugs in operating system level implementations occur
- More than one copy of your data
- Offsite copies of the data via replication to a second repository
- Performance impact limited using integrated or installed changed block drivers
- 15-minute granular “backup passes” to scan for changed data
- Self-service easy restore to any of those granular points in time
- Ability to restore an entire volume, file or virtual machine as needed
- Easy to do to multiple backups/restores simultaneously
- Best in class GUI, CLI and API interfaces
- Ability to integrate with applications that snapshots cannot guarantee a consistent state for like databases & others
- Higher cost due to multiple copies of the dataset and/or cost of bandwidth for offsite replication
- More complex to install initially but easier to operate long term
- May require a per-VM agent to be installed w/reboot required to gain changed block driver functionality