Just Backup Copies Are Not Sufficient Anymore
Every business needs backup copies for their primary datasets, and these copies are typically stored in a backup storage system. However, the most important reason to have backup copies is for recovery. In a disaster-recovery (DR) scenario, it is all about how fast the applications can be recovered. If you always have to rehydrate datasets from a backup storage system to a primary storage system before you can power on VMs, it just takes too long.
Ransomware is an acute type of DR scenario where you need to do rapid experimentation to find clean restore points and data across different backup copies. If this experimentation cannot be done fast enough, then you are left with no choice but to pay up ransom. Hence, ransomware recovery, and more broadly disaster recovery, needs a new type of storage system that can both store backup copies and also instantly run workloads directly from those copies, without having to go through a time-consuming rehydration process. Additionally, the backup storage system needs to save copies as immutable objects so that accidental or malicious corruption does not occur.
Ransomware Recovery Requirements
Here is what is expected from a backup storage system that can help with speedy ransomware recovery after a malware infection has been detected in the primary datasets.
- Time travel – deep history of copies
Ransomware can linger for weeks or months – slowly encrypting the primary datasets. The backup storage system needs to store a sufficient number of backup copies across time to recover from. - Instant VM power-on – for rapid experimentation
It is not always possible to pinpoint which backup copies are infected or not. So, we need to power-on/off applications from different copies to check which applications are good, bad, or partially good. This is an iterative process because different malware infect differently. This rapid experimentation is not possible without instant VM power-on. - Immutable backups
Some backup storage systems expose all their data via an open NFS protocol. This is dangerous because backups can be mutated, and malware can easily encrypt all the backup copies. - Storage of last resort
People will resort to recovering from the backup storage system when all else has failed. If the backup copies themselves are corrupted because of a software bug in the system (not malware), then there is no recourse. Therefore, it is especially important to ensure that backup copies are available and in good shape during an emergency. - Cost efficiency
Of course, the storage system must be cost efficient in steady-state to fit the budget.
These are some of the important properties needed from a backup storage system to address ransomware recovery requirements. Traditional storage systems satisfy some of these requirements, but not all. We need a new filesystem design to tackle all these needs. There are other requirements beyond just the storage system, but we will focus on the storage system in this article.
A New Multi-purpose Cloud Filesystem
We have created a new cloud filesystem, and it is deployed as part of VMware Cloud Disaster Recovery to solve the above challenges. It is challenging to create a storage system that stores backup copies at low-cost in steady state, and instantly run workloads when needed. It is almost like having two different storage systems (one for storing backup copies and one for running workloads) merged into one, along with providing data immutability. We have solved this puzzle and other challenges by using a multitude of techniques to create a new multi-purpose Scale-out Cloud Filesystem (SCFS), and this is described below.
Scale-out Cloud Filesystem Architecture
2-Tier Design
SCFS has a 2-tier design: We use EC2 with local NVMe for IO performance (cache-tier), and we use S3 to store all data (capacity tier). (Note: the SCFS runs on AWS today, but we plan to support other cloud platforms in the future.) The 2 tiers help us independently scale performance and capacity. A small cache tier is sufficient for handling incoming backup data in steady-state (i.e., backup-mode) and keeps the cost low. When needed (i.e., in recovery-mode), the cache tier can be expanded on-demand to a much bigger size so that we can run workloads directly from the filesystem. The 2-tier design allows SCFS to easily switch between backup-mode and recovery-mode.
Log-Structured Filesystem (LFS)
The original idea for LFS was first proposed in 1992 by Mendel Rosenblum, who also happens to be the founder of VMware. The crux of the idea is to append incoming data into a sequential log and do the garbage collection later. Most raw devices (HDD, SSD, S3) are not good at random writes, but all are very good at sequential writes. This makes LFS good for designing a backup storage, or a primary storage, and even an OLTP database.
We employ the LFS techniques to store data in S3. As shown in Figure 1, all incoming data backups are converted to large ~10MB sequential segments, and these large segments are stored as S3 objects, and S3 is excellent at large sequential IOs. This allows backups to be stored in S3 at high-speed.
A big additional benefit of LFS is that all data is appended to a log, and existing data is never overwritten. One of the biggest dangers to stored OLD backup data is incoming NEW backup data. Typical storage systems write new data to random empty locations on disk (think swiss-cheese holes). However, this poses the problem that new backups may overwrite blocks on disk that contain old backups. If the latest backups are infected with ransomware, then that is a very bad situation because you have now lost the old backups, and the new backups are infected.
SCFS uses the LFS design to completely eliminate this overwrite problem. All new incoming data “always” goes to new locations (because it is a log). By design, there is never any danger of overwriting blocks containing old backups.
LFS is like a Christmas tree and has many more amazing gifts (too many to cover in this article). The combination of LFS and the 2-tier designs are what makes SCFS a multi-purpose filesystem.
Immutability
A filesystem also needs metadata (pointers) to remember where all the data blocks are stored. We use content-based crypto-hashes as pointers for data blocks. Content-based crypto-hashes are immutable. Each backup is represented by a tree of crypto-hashes, and the root of the tree is also a crypto-hash (see Merkle-Trees usage in blockchain). All of this makes each backup immutable, and these backups are hidden and not accessible directly to the outside world. Even to recover, a backup is never directly used. We clone the needed backup into a new object for recovery purposes, which leaves the original backup copy untouched. The clone operation is instantaneous no matter how many backups are being cloned or how big they are.
Additionally, SCFS checks the data integrity of each backup copy every single day. The goal is that we want to ensure that your backup copies are ready for use when you need them in a ransomware-attack emergency situation.
Outcomes
Let’s see how SCFS addresses the ransomware recovery requirements:
- Time travel – deep history of copies
Each instance of the SCFS can store a million backups efficiently, and so going back in time is very easy. - Instant VM power-on – for rapid experimentation
The 2-tier architecture allows SCFS to temporarily go from “backup-mode” to “recovery-mode” where we launch bigger EC2 compute instances with bigger caches to get higher performance needed to run the workloads. Additionally, SCFS is live-mounted to VMware Cloud clusters as a secure and hidden NFS datastore. Recovering from backups is as simple as cloning the backup copies, registering the VMs in vCenter, and powering-on the VMs (these steps are fully automated by our software). This rapid cloning and power-on of VMs allows customers to quickly bring up VMs, experiment, power-off, and try other recovery points until the issue is resolved. - Immutable backups
There are a few things SCFS does to protect backups from damage:- Each backup is represented by a set of content-based crypto-hash trees (Merkle-Trees), and this makes each backup immutable.
- All backups are hidden from the normal access modes (like NFS, etc.). Even viewing backups indirectly via UI is done with multi-factor authentication.
- A backup’s data cannot be changed. A backup must first be cloned if there is a need to use that data or modify it.
- VMware Cloud DR also supports multi-factor authentication, role-based access controls, and different administrator domains to protect against ransomware gaining DR administrator privileges.
- Storage of last resort
SCFS uses LFS to eliminate the danger of new backups accidentally overwriting an old backup copy (e.g., due to software bug). Additionally, all data is verified every day to ensure that your backup copies are ready for use when you actually need them in a ransomware-attack emergency situation. - Cost efficiency
SCFS uses many techniques to keep the costs down in steady-state:- Using S3 for storing data is cost-efficient, and it gets better with data-reduction applied on top of it.
- In steady-state “backup-mode”, SCFS uses very little EC2 caching resources to keep the costs low and only launches more resources in the rare “recovery-mode” situations. The 2-tier architecture makes this possible.
Get Rapid Ransomware Recovery
The flexibility provided by the 2-tier design, and the data immutability provided by the LFS techniques and crypto-hashes make the Scale-out Cloud Filesystem very well suited for rapid ransomware recovery. It’s a core component of VMware Cloud Disaster Recovery.
With estimates that one organization gets attacked by ransomware every 11 seconds in 2021[1], now is the time to improve your ransomware recovery capabilities.
- Check out the VMware Cloud Disaster Recovery step-by-step guide to get started on your DR and ransomware recovery journey
- Utilize the new credit card online purchasing option to get hands-on experience with VMware Cloud Disaster Recovery
- Reach out to your VMware partner or VMware salesperson if you have further questions
[1] https://cybersecurityventures.com/cybercrime-damage-costs-10-trillion-by-2025/