Mirage and MongoDB, Part 3: Deep Dive
By Yan Aksenfeld, Member of Technical Staff, VMware
This 4-part series discusses the addition of MongoDB to Mirage.
Part 1 introduced the benefits of adding MongoDB to Mirage to enhance the performance of Mirage when working with a large number of small files.
Part 2 discussed new MongoDB components in Mirage, designing Mirage for high availability, and MongoDB considerations of Mirage installations and upgrades.
This blog post takes a closer look at the underlying technology behind MongoDB with Mirage.
Terminology for This Blog Post
Following are definitions of several technical terms used in this blog post:
- Volume – A logical area of storage on one or more hard disks.
- Deduplicated – Data deduplication is a technique used for efficiently storing copies of repeated data so that the duplicated data itself is not written to a volume more than once.
- SIS volume – A volume that makes use of single-instance storage (SIS). SIS is a method of implementing data deduplication. When SIS is used to write data to a volume, each unique set of data is written once, and further instances are replaced with a small reference that points to the already stored version.
- Reference counter – Used in the deduplication process on a SIS volume. The reference counter keeps track of the number of references to the unique data. The maximum number of references is 255, meaning that if a set of data is referenced 500 times, the reference counter limit is surpassed, and the data will be stored on the volume twice.
MongoDB and Small-File Deduplication
Storing files in a SIS volume results in additional input/output operations per second (IOPS) to ascertain and update the reference counter of each file accessed. It also means having to store another copy of unique data every time the reference counter limit is reached. Storing small files in a MongoDB database in Mirage instead of in a SIS volume addresses these performance issues.
The MongoDB database uses a very fast, indexed table of key-value entries. Each small file is stored as a record in the database, where the record’s key is the unique signature of the file, and the record’s value is the content of the file. As described in Part 1, each file saved to the MongoDB database is small, under 64 KB, making it possible to store all of these files in a database.
Unlike traditional volume-based deduplication, an advantage of MongoDB is that files are deduplicated system wide. This leads to some space savings because Mirage keeps only a single copy of each small file. Now, small files will never appear more than once in an entire Mirage implementation.
When Small Files Are Stored in MongoDB
Small files are stored in MongoDB during:
- The centralization of new CVDs (image backup).
- Steady-state (incremental) uploads.
- Base or app layer captures, downloads, and updates.
- Driver library imports.
- The restore process.
- The flattening process for Mirage snapshots.
- CVD integrity checks.
Mirage Performance with MongoDB
MongoDB improves the performance of Mirage because small files are stored in a highly deduplicated and high-performance database. This keeps a substantial number of the frequently accessed files in RAM. Download operations (or reads) experience input/output (I/O) consumption improvements because small files are read from cache. Upload operations (or writes) to Mirage volumes are reduced by 50 to 80 percent, sometimes reaching a ten-fold improvement in performance when small files are written to MongoDB. MongoDB in a Mirage implementation reduces the need to access Mirage storage volumes, resulting in faster access times.
Upload times are also improved because disk write times are decreased by better deduplication rates. Note that this improvement will be noticeable only after MongoDB is populated with a substantial number of files. This population of files can take a couple of weeks from the initial date of a Mirage upgrade or fresh installation.
During the initial stage of writing data into MongoDB, the system will not show a decrease in the number of IOPS or the speed of uploads. The write I/O rate of the disks where MongoDB is located will be increased for this initial stage.
After most files are in MongoDB, and the database size has reached a steady state, the I/O rate to the disk will decrease, and performance will be substantially improved.
MongoCacheSizeGB is an important configuration parameter that was introduced in Mirage 5.6. It allows you to set the amount of memory dedicated to the MongoDB service. Large implementations (above 5,000 CVDs) might benefit from increasing the default value of 4 GB memory. This will increase the performance of MongoDB on large systems as more files are cached in memory. If you notice the MongoDB service constantly using 4 GB of RAM, consider testing with a larger amount of cache to improve performance.
Note: Increase this value only if the service routinely reaches the default 4 GB memory usage limit.
To set the parameter, use Mirage CLI on the Mirage Management server:
- Open a command prompt.
- Run the command C:\Program Files\Wanova\Mirage Management Server\Wanova.Server.Cli.exe localhost.
- Run the command setConfigParam MongoCacheSizeGB <NEW_VALUE>, where <NEW_VALUE> is replaced by a number representing the new memory cache size in GB.
As we discussed in Part 2, multiple Mirage Management servers are recommended so you have a highly redundant MongoDB database. We want to recap and remind you that each Mirage Management server runs a separate MongoDB replica node.
When you install a Mirage Management server, it will be part of a MongoDB replica set called MirageRS. This replica set is the component protecting the MongoDB database, and is responsible for keeping all redundant MongoDB nodes up to date. The replica set and its components are created as follows:
- When adding a single Mirage Management server, the replica set is created and the first MongoDB node is added to it. This node is automatically marked as the primary node.
- When adding a second Mirage Management server, a second node is added to the replica set, this node is marked as secondary, and the Arbiter service is enabled. (Each additional Mirage Management server is added as a secondary node.)
- When there are two or more nodes in the replica set, the primary node begins a replication process to each secondary node. You will notice the growth of the MongoDB database size on these secondary nodes until they reach the size of the MongoDB database on the primary node.
- The status of a secondary node will remain Down in the Mirage Console until the replication is complete. After replication, the secondary node will be marked as Up. For more information about MongoDB replication, see the Replication section of the online MongoDB Manual.
The Arbiter, controlled by the Arbiter service, does not hold any data. Its sole responsibility is to mediate a vote during an election process. If a node has failed, and an election of the primary node is initiated, the Arbiter determines which node becomes the new primary node. It makes sure the secondary node chosen as the new primary is fully synchronized, available to the Mirage Management servers, and marked as Up. This is especially useful when two nodes have different, and conflicting, sets of data—such a situation is sometimes called a split brain scenario. For more details about the Arbiter, see the Replica Set Arbiter section of the online MongoDB Manual.
Summary of Part 3
The following topics were discussed in Part 3 of the blog post series on Mirage and MongoDB:
- Technical details of small Mirage files in MongoDB
- Additional information about performance
- The relationship between replica nodes and the election process