A easy and fast way to backup FlashArray with Rubrik - Part 2: backup and restore deep dive

[ NOTE: machine translation with the help of DeepL translator without additional proofreading and spell checking ]


[Authors: Marco Hieronymus and Marcel Düssil]


In the course of this blog series, we decided to split the content into two posts. This post focuses on the necessary configurations within Rubrik, VMware and Pure Storage. The first post, on the other hand, will cover backup and restore operations and will provide deep technical insights.


Technical overview


Basically, "everyone is just boiling hot water", whether CommVault, Cohesity, Veritas, Veeam, ArcServe, EMC Avamar, EMC NetWorker, IBM TSM ... Rubrik also uses the given interface for virtual machine backups of the hypervisor. Every backup tool uses the vStorage API for Data Protection (VADP for short) with VMware vSphere. VADP is the successor to VMware Consolidated Backup (VCB) and has been the common standard for backing up vSphere since vSphere 4. Together with Changed Block Tracking (CBT), VADP is a best practice for incremental and fast data protection.

In the course of my research, I found an interesting comparison of VADP and VCB. I think it is extremely interesting to see what is taken for granted today with the features of VADP and was only possible to a limited extent or not at all with VCB.

Source: VMware Knowledge Base

Hyper-V naturally works with volume shadow copies (VSS) and not VADP for backups. Here, each hypervisor type/vendor (Nutanix AHV, Citrix XenServer, ...) offers its own developed mechanisms and interfaces.


So at this point I want to be clear: Rubrik has not revolutionized the market in the backup/restore process and reinvented the wheel. In my eyes, the art lies in the user experience and in the work with the tools. There are some products that can be used intuitively by everyone, while others are more complex and less easy to use.


So, with the following technical insight, I might speak for every backup tool with VADP.


Rubriks Backupworkflow


(1) Connect to vCenter and initiate the backup job.

(2) Run pre-backup script inside the VM on the hypervisor to put the application in a consistent state.

(3) Create a VMware snapshot. See section below: How Snapshots work.

(4) Using the Pure Storage REST API, create a storage snapshot of the datastore and the underlying volume on the Pure. The datastores can also be distributed across multiple physical arrays (assumes connectivity of all arrays on the rubric).

(5) Removing the VMware snapshot.

(6) Running a post-snap script in the VM.

(7) Mount the volume snapshot to the ESXi server. A new volume is created in the background from the volume snapshot created in step (4). A rescan is performed on the ESXi HBA to discard the new volume. The new volume is then presented to the host and mounted. A datastore is then created from the volume. Important: the created storage volumes and datastore are only temporary.

(8) Create a Proxy VM. The temporary Proxy VM is based on the resources as the original (template), where the Proxy VM can be placed on any ESXi server with networking connectivity to the arrays. The created Proxy VM is powered off and no networking is configured. The datastore created in step (7) is now connected to the Proxy VM.

(9) Tapping the data from the proxy VM. VMware CBT is used here to transfer only changed blocks to the last backup.

(10) Cleanup proxy VM.

(11) Unmount the temporary datastores/volumes.

(12) Clean up the array volumes/snapshots.

(13) Execute post-backup script (if configured).


The practice


Backup


There is, of course, also the possibility to create an instant snapshot "on demand" with Rubrik. This operation can be performed on individual VM objects, groups and SLA plans. Below we will create a snapshot on our SLA domain.

As we can see, the VMware snapshot existed for just 6 seconds and was dissolved instantaneously. Thus, the size of the delta file is minimal.

Grown delta files can have significant impact on high frequency systems. An example would be that redo/delete operations for snapshots can cause timeouts to occur for systems/clients. In addition, backups could also fail when shutting down during snapshot operations.


How Snapshots work


When a snapshot is taken, ideally the guest operating system is quiesced. If the virtual machine is powered on when the snapshot is taken, VMware Tools are used to decommission the file system on the virtual machine. File system decommissioning is a process of placing the data on the hard disk of a physical or virtual computer in a state suitable for backups. This process includes operations such as flushing modified buffers from the operating system's in-memory cache to the hard disk.


When a snapshot is created, it consists of the following files: -.vmdk and --delta.vmdk.


A collection of the above snapshot files for each virtual disk is associated with the virtual machine at the time of the snapshot. These files are called child disks, redo logs, or delta links/files. The child disks can later be considered as parent disks for future child disks. From the original parent disk, each child disk represents a redo log that links back to the original step by step from the current state of the virtual disk.

HINT: If the virtual hard disk is larger than 2TB, the redo log file has the format --sesparse.vmdk.vmsd.


This .vmsd file is a database of virtual machine snapshot information and is the primary source of information for the snapshot manager. The file contains line entries that define the relationships between snapshots and the child disks for each snapshot.

Snapshot.vmsn


A .vmsn file contains the current configuration and optionally the active state of the virtual machine. If the virtual machine's memory state is captured, the state of a powered-on virtual machine can be restored. With other than memory snapshots (keyword: crash-consistent), only the state of a powered-off virtual machine can be restored. Memory snapshots take longer to create than a "simple" snapshot.

How do snapshots work?


The VMware API enables VMware and third-party products to perform operations on virtual machines and their snapshots.


The following is a list of common operations that can be performed on virtual machines and snapshots using the API:


CreateSnapshot: Creates a new snapshot of a virtual machine. This also updates the current snapshot.


RemoveSnapshot: Removes a snapshot and deletes all allocated memory.


RemoveAllSnapshots: Removes all snapshots associated with a virtual machine. If no snapshots are associated with a virtual machine, this operation simply returns a success message.


RevertToSnapshot: Changes the execution state of a virtual machine to the state of that snapshot. This is equivalent to the Snapshot Manager's Go To option when using the vSphere/VI Client GUI.


Consolidate: Merges the hierarchy of redo logs. This option is available in vSphere 5.0 and later.

A request to create, remove, or restore a snapshot for a virtual machine is sent from the client to the server through the VMware API. The request is forwarded to the VMware ESX host that currently hosts the virtual machine.


If the snapshot includes the memory option, the ESX host writes the virtual machine memory to disk.

HINT: The virtual machine is frozen for the duration of the memory write operation. The duration of the freeze cannot be determined in advance and depends on the performance of the disk in question and the amount of memory written.

If the snapshot includes the decommission option, the ESX host prompts the guest operating system to decommission the disks using VMware Tools.


Depending on the guest operating system, the decommissioning operation can be performed by the synchronization driver, the vmsync module, or the Microsoft Volume Shadow Copy (VSS) service.

During snapshot removal, the process can take a long time if the child disks are large. This can lead to timeout error messages from VirtualCenter or the VMware Infrastructure Client.


This is where the benefits of storage snapshot-based backups come in handy.


In parallel, we can observe on the Pure that the storage snapshot was created according to the attributes of the source volume. It is important to note here that the storage capacity is not completely consumed by the data reduction.

The created volume is then mounted accordingly as vSphere Datastore and prepared for the connection to the Proxy VM. The activity log here shows the running active process.

We can see here that no "guest credentials" have been stored. This would be mandatory for the creation of an application consistency (working memory backup). We therefore create a crash consistent backup and the backup will not fail at this point. This would have to be considered for corresponding systems such as databases. Here it is important to create a corresponding consistency, because it is not impossible for corruption to occur during the restore.

After the HBA rescan on the ESX host, the corresponding volume is given a new signature and registered as a datastore. The above proxy VM is then created directly with the corresponding vDisks from the clone volume datastore and the actual backup is created from the proxy VM.

HINT: At this point, it is also possible to create a dedicated ESX as a host for proxy VMs and thus take the backup load from the production hosts again, if necessary.

After the backup, the proxy VM and the datastore are immediately cleaned up (unregister/delete). Again, each step must be tracked accordingly in the Activity Log.


In the vSphere task console, we can see that the whole process (despite the myriad of constraints and the lack of performance of the test environment) was completely finished within 4 minutes.

The GUI section also reports the "Latest Snapshot" in the overview, as well as in the calendar overview, after the backup has been completed. Very charming and clearly solved.

Our backup job is hereby completed.


Restore


A great added value of Rubrik is a fast indexing of the files in a kind of Google search database. This means that files can be quickly searched for and found within the appliance using keywords.

Of course, we have several options for restoring:

  • Mount Virtual Machine: it is possible to start a VM directly from the backup without a restore. Here an NFS datastore is mounted in the virtual environment and the system is started (without guest VM adjustment). All changes after disconnecting the mount are discarded again. Here we noticeably benefit from the computing power of the rubric nodes.

  • Mount Virtual Disks: Similar to the previous option, it is also possible to provision individual virtual hard disks to extract guest system content, for example.

  • Instantly Recover: Restarting the VM on the rubric cluster with full network connectivity.

  • Export: the export allows to perform a restore to the hypervisor datastore at this point.

  • Recover Files: This is probably the most common way to recover data. Fortunately, mostly only guest OS file stores are needed. Recover Files can be used to restore files granularly to the original destination or by downloading individual files.

We show accordingly in the series below Instantly Recover, Mount Virtual Disks and Recover Files.

Recover Files


The recovery of individual files can be performed via a self-explanatory wizard. There is not much to say here. The guest files are extracted from the backup and made available for download.


Instantly Recover & Mount Virtual Disks


Let's say a critical business application running in a VM is down. This failure largely affects all your users. With Rubrik, you can perform an instant recovery with no data loss. Simply select your latest protection point (snapshot) and perform a live mount with just a few clicks.


Rubrik will then transfer all relevant hot blocks to SSD and present this VM to your VMware environment, powering up the machine in seconds. And yes, even if your business application is several terabytes in size, this process only takes a few seconds.


Instantly Recover


With Instant Recover you only have to specify the host for the mount. In the background, the machine is then automatically registered with the original name + timestamp of the snapshot + ascending number of the VM. The "defective" or old machine is switched off and renamed.


Mount Virtual Disk


For a vDisk mount a few more "clicks" are necessary. First we have to select the respective vDisk for the mount, choose a mount VM (yes!: we are mounting to a Veeam server here - this works! ) and then the mount starts directly.



Within vSphere we can now see a datastore (NFS mount from Rubrik) presenting a vDisk to the mount VM.

In the guest OS (PURE-VEEAMVBR-1), we can use Disk Management to initialize this volume, assign a drive letter, and access the files directly through Explorer.

Unmounting the vDisk is then done again via the GUI section. I did not have to set the volume to offline within Windows before, the process went through without problems here as well.

It is of course possible to move the data from the NFS datastore directly to a persistent production storage/datastore via vSphere Boardmittel and Storage vMotion.

The option "Remove local entry after Storage vMotion" would clean up the configuration (i.e. the mount) accordingly at this point on the configuration side.


Pure Storage Auditing


I hope Pure's auditing functions for executed operations and sessions are already known. If not, you should definitely take a look at this. You can find it in the GUI under System > Users, as well as in the CLI via pureaudit. Here all operations and logins on the system are logged and can be used for configuration tracking, login processes as well as troubleshooting.

That's all there is to say about Pure Storage and rubric integration. We hope that the simplicity and, above all, the potential and benefits of both solutions came across well in the articles.


Finally, there is still something to say here: it remains exciting - we will also report on the integration with Pure's object storage aka FlashBlade and rubric accordingly in the near future.