Home > Blogs > VMware Consulting Blog

Planning a DRP Solution for VMware Mirage Infrastructure

Eric_MonjoinBy Eric Monjoin

When I first heard about VMware Mirage in 2012—when Wanova was acquired by VMware and Mirage was integrated into our portfolio—it was seen more as a backup solution for desktops, or a tool for migrating from Windows XP to Windows 7. And, with an extension, it was possible to easily migrate a physical desktop to a virtual one, so most of the time when we had to design a Mirage solution, the question of DRP or HA came up as, “Why backup a backup solution?” Mirage was not seen as a strategic tool.

This has changed, and VMware Mirage is now totally integrated as an Extended UNIX Code tool to manage user desktops through different use cases. Of course, we still have backup and migration use cases, but we also have more and more customers who are using it to ensure desktops conform to IT rules and policies to ensure infrastructure reliability for Mirage. In this post we’ll describe how to design a reliable infrastructure, or at least give the key points for different scenarios.

Let’s first have a look at the different components of a Mirage infrastructure:

EMonjoin Basic Mirage Components

Figure 1 – Basic VMware Mirage Components

  1. Microsoft SQL Database—The MS SQL database contains all the configurations and settings of the Mirage infrastructure. This component is critical; if the Microsoft DB fails, then all Mirage transactions and services—Mirage Management Server service and Mirage Server service—
  2. SMB Shared Volumes—These could be any combination of NAS, Windows Server, desktop files, apps, base layers, or USMT files—all stored on theses volumes (except small files and meta-data.)
  3. Mirage Management Server—This is used to manage the Mirage infrastructure, but also acts as a MongoDB server instance on Mirage V5.4 and beyond. If it fails, administration is not possible until a new one is installed, but there’s no way to recover desktops since small files stored in the MongoDB are no longer available.
  4. Mirage Server—This is used by Mirage clients to connect into. Often, many Mirage servers are installed and placed behind load-balancers to provide redundancy and scalability.
  5. Web Management—A standard Web server service can be used to manage Mirage using a Web interface instead of the Mirage Management Console. The installation is quite simple and does not require extra configuration, but note that data is not stored on the Web management server.
  6. File Portal—Similar to Web management above, it is a standard Web server service used by end users to retrieve their files using a Web interface, and again, data is not stored on the file portal server.
  7. Mirage Gateway—This is used by end users to connect to Mirage infrastructure from an external network.

Now, let’s take a look at the different components of VMware Mirage and see which components can be easily configured for a reliable and redundant solution:

  • Mirage Management Server—This is straightforward, and actually mandatory, because with MongoDB, we need to install at least one more management server, and the MongoDB will synchronize automatically. The last point is to use a VIP on a load-balancer to connect to, and to route traffic to any available management server. The maximum number of Mirage management servers is seven due to MongoDB restrictions. Keep in mind that more than two members can reduce performance as you must wait for acknowledgement from all members for each writing operation to the database. The recommended number of management servers is two.
  • Mirage Server—By default we install at least two Mirage servers or more; one Mirage server per 1,000 centralized virtual desktops (CVDs) or 1,500 (depending on the hardware configuration), plus one for redundancy and use load-balancers to route client traffic to any available Mirage server.
  • Web Management and File Portal—Since these are just Web applications installed over Microsoft IIS servers, we can deploy them on two or more Web servers and use load-balancers in order to provide the required redundancy.
  • Mirage Gateway—This is an appliance and is the same as the previous component; we just have to deploy a new appliance and configure load-balancers in front of them. Like the Mirage server, there is a limitation concerning the number of connections per Mirage gateway, so do not exceed one appliance per 3,000 endpoints, and add one for resiliency.

Note: Most components can be used with a load-balancer in order to get the best performance and prevent issues like frequent disconnection, so it is recommended to the set load-balancer to support the following:

  • Two TCP connections per endpoint, and up to 40,000 TCP connections for each Mirage cluster
  • Change MSS in FastL4 protocol (F5) from 1460 to 900
  • Increase timeout from five minutes to six hours

Basically, all Mirage components can be easily deployed in a redundant way, but they rely on two other external components, both of which are key: the Microsoft SQL database and the SMB shared volumes, both of which work jointly. This means we have to pay special attention to which scenario is privileged:

  • Simple backup
  • Database continuity
  • Or full disaster recovery

The level of effort required is not the same and depends on the RPO/RTO required.

So let’s have a look on the different scenarios available:

  1. Backup and RestoreThis solution consists of performing a backup and restore of both Microsoft SQL database and storage volumes in case a major issue occurs on either component. This solution seems relatively simple to implement and looks inexpensive as well. It could be implemented if the attending RPO/RTO is not high. In this case, you have a few hours to restore the service, and there is no need to restore data that has been recently backed up. Restoring lost data backed up in the last couple of hours is automatic and quick. Remember, even if you lose your Mirage storage, all data is still available on the end-users’ desktop; it will just take time to centralize them again. However, this is not an appropriate scenario for large infrastructures with thousands of CVDs as it can take months to re-centralize all the desktops. If you want to use this solution, make sure that both the Microsoft SQL database and the SMB volumes are backed up at the same time. Basically, this means stopping Mirage services, performing a backup of the database using SQL Manager to get a snapshot of the storage volumes, and stopping MongoDB from backing up files. In case of failure, you have to stop Mirage (if it has not already done that by itself) and restore the last database backup and revert to the latest snapshot on the storage side. Keep in mind you must follow this sequence: first, stop all mirage services, and then the MongoDB services.
  1. Protect Microsoft SQL DatabaseSome customers are more focused on keeping the database intact, and this implies using Microsoft SQL clustering. However, VMware Mirage does not use ODBC connections, so it is not aware of having to move to a different Microsoft SQL instance if the main one has failed. The solution resides in using Microsoft SQL AlwaysOn technology, which is a combination of the Microsoft SQL clustering and the Microsoft failover cluster. It provides synchronization between “non-shared” volumes among nodes, but is also a virtual IP and virtual network name that will move to the remaining node in case of disaster, or during a maintenance period.
  2. Full Disaster Recovery/Multisite Scenario—This last scenario concerns customers who require a full disaster recovery scenario between two data centers with a high level of RPO/RTO. All components are duplicated at each data center with load-balancers to route traffic to a Mirage management server, Mirage server, or Web management/File portal IIS server. This implies using the second scenario in order to provide Microsoft SQL high availability, and also to perform a synchronous replication between two storage nodes. Be aware that synchronous replication can highly affect storage controller performance. While this is the most expensive of the scenarios since it requires extra licenses, it is the fastest way to recover from a disaster. An intermediate scenario could be to have two Mirage management servers (one per data center), but to shut down Mirage services, and replicate SQL database and storage volumes during the weekend.

EMonjoin Multi-Site Mirage

Figure 2 – E.g. Multi-Site Mirage Infrastructure

For scenario two and three, the installation and configuration of Microsoft SQL AlwaysOn in a VMware Mirage infrastructure is explained further in the white paper.


Eric Monjoin joined VMware France in 2009 as PSO Senior Consultant after spending 15 years at IBM as a Certified IT Specialist. Passionate for new challenges and technology, Eric has been a key leader in the VMware EUC practice in France. Recently, Eric has moved to the VMware Professional Services Engineering organization as Technical Solutions Architect. Eric is certified VCP6-DT, VCAP-DTA and VCAP-DTD and was awarded vExpert for the 4th consecutive year.

Virtual SAN Stretch Clusters – Real World Design Practices (Part 2)

Jonathan McDonaldBy Jonathan McDonald

This is the second part of a two blog series as there was just too much detail for a single blog. For Part 1 see: http://blogs.vmware.com/consulting/2016/01/virtual-san-stretch-clusters-real-world-design-practices-part-1.html .

As I mentioned at the beginning of the last blog, I want to start off by saying that all of the details here are based on my own personal experiences. It is not meant to be a comprehensive guide to setting up stretch clustering for Virtual SAN, but rather a set of pointers to show the type of detail most commonly asked for. Hopefully it will help you prepare for any projects of this type.

Continuing on with the configuration, the next set of questions regarded networking!

Networking, Networking, Networking

With sizing and configuration behind us, the next step was to enable Virtual SAN and set up the stretch clustering. As soon as we turned it on, however, we got the infamous “Misconfiguration Detected” message for the networking.

In almost all engagements I have been a part of, this has been a problem, even though the networking team said it was already set up and configured. This always becomes a fight, but it gets easier with the new Health UI Interface and multicast checks. Generally, when multicast is not configured properly, you will see something similar to the screenshot shown below.

JMcDonald VSAN Pt2 (1)

It definitely makes the process of going to the networking team easier. The added bonus is there are no messy command line syntaxes needed to validate the configuration. I can honestly say the health interface for Virtual SAN is one of the best features introduced for Virtual SAN!

Once we had this configured properly the cluster came online and we were able to configure the cluster, including stretch clustering, the proper vSphere high availability settings and the affinity rules.

The final question that came up on the networking side was about the recommendation that L3 is the preferred communication mechanism to the witness host. The big issue when using L2 is the potential that traffic could be redirected through the witness in the case of a failure, which has a substantially lower bandwidth requirement. A great description of this concern is in the networking section of the Stretched Cluster Deployment Guide.

In any case, the networking configuration is definitely more complex in stretched clustering because the networking across multiple sites. Therefore, it is imperative that it is configured correctly, not only to ensure that performance is at peak levels, but to ensure there is no unexpected behavior in the event of a failure.

High Availability and Provisioning

All of this talk finally led to the conversation about availability. The beautiful thing about Virtual SAN is that with the “failures to tolerate” setting, you can ensure there are between one and three copies of the data available, depending on what is configured in the policy. Gone are the long conversations of trying to design this into a solution with proprietary hardware or software.

A difference with stretch clustering is that the maximum “failures to tolerate” is one. This is because we have three fault domains: the two sites and the witness. Logically, when you look at it, it makes sense: more than that is not possible with only three fault domains. The idea here is that there is a full copy of the virtual machine data at each site. This allows for failover in case an entire site fails as components are stored according to site boundaries.

Of course, high availability (HA) needs to be aware of this. The way this is configured from a vSphere HA perspective is to assign the percentage of cluster resources allocation policy and set both CPU and memory to 50 percent:
JMcDonald VSAN Pt2 (2)

This may seem like a LOT of resources, but when you think of it from a site perspective, it makes sense; if you have an entire site fail, resources in the failed site will be able to restart without issues.

The question came up as to whether or not we allow more than 50 percent to be assigned. Yes, we can set it to use more than half consumed, but there might be an issue if there is a failure, as all virtual machines may not start back up. This is why it is recommended that 50 percent of resources be reserved. If you do want to configure a utilization of more than 50 percent of the resources for virtual machines, it is still possible, but not recommended. This configuration generally consists of setting a priority on the most important virtual machines so HA will start up as many as possible, starting with the most critical ones. Personally, I recommend not setting above 50 percent for a stretch cluster.

An additional question came up about using host and virtual machine affinity rules to control the placement of virtual machines. Unfortunately, the assignment to these groups is not easy during provisioning process and did not fit easily into the virtual machine provisioning practices that were used in the environment. vSphere Distributed Resource Scheduler (DRS) does a good job ensuring balance, but more control was needed rather than just relying on DRS to balance the load. The end goal was that during provisioning, placement in the appropriate site could be done automatically by staff.

This discussion boiled down to the need for a change to provisioning practices. Currently, it is a manual configuration change, but it is possible to use automation such as vRealize Orchestrator to automate deployment appropriately. This is something to keep in mind when working with customers to design a stretch cluster, as changes to provisioning practices may be needed.

Failure Testing

Finally, after days of configuration and design decisions, we were ready to test failures. This is always interesting because the conversation always varies between customers. Some require very strict testing and want to test every scenario possible, while others are OK doing less. After talking it over we decided on the following plan:

  • Host failure in the secondary site
  • Host failure in the primary site
  • Witness failure (both network and host)
  • Full site failure
  • Network failures
    • Witness to site
    • Site to site
  • Disk failure simulation
  • Maintenance mode testing

This was a good balance of tests to show exactly what the different failures look like. Prior to starting, I always go over the health status windows for Virtual SAN as it updates very quickly to show exactly what is happening in the cluster.

The customer was really excited about how seamlessly Virtual SAN handles errors. The key is to operationally prepare and ensure the comfort level is high with handling the worst-case scenario. When starting off, host and network failures are always very similar in appearance, but showing this is important; so I suggested running through several similar tests just to ensure that tests are accurate.

As an example, one of the most common failure tests requested (which many organizations don’t test properly) is simulating what happens if a disk fails in a disk group. Simply pulling a disk out of the server does not replicate what would happen if a disk actually fails, as a completely different mechanism is used to detect this. You can use the following commands to properly simulate a disk actually failing by injecting an error.  Follow these steps:

  1. Identify the disk device in which you want to inject the error. You can do this by using a combination of the Virtual SAN Health User Interface, and running the following command from an ESXi host and noting down the naa.<ID> (where <ID> is a string of characters) for the disk:
     esxcli vsan storage list
  2. Navigate to /usr/lib/vmware/vsan/bin/ on the ESXi host.
  3. Inject a permanent device error to the chosen device by running:
    python vsanDiskFaultInjection.pyc -p -d <naa.id>
  4. Check the Virtual SAN Health User Interface. The disk will show as failed, and the components will be relocated to other locations.
  5. Once the re-sync operations are complete, remove the permanent device error by running:
    python vsanDiskFaultInjection.pyc -c -d <naa.id>
  6. Once completed, remove the disk from the disk group and uncheck the option to migrate data. (This is not a strict requirement because data has already been migrated as the disk officially failed.)
  7. Add the disk back to the disk group.
  8. Once this is complete, all warnings should be gone from the health status of Virtual SAN.
    Note: Be sure to acknowledge and reset any alarms to green.

After performing all the tests in the above list, the customer had a very good feeling about the Virtual SAN implementation and their ability to operationally handle a failure should one occur.

Performance Testing

Last, but not least, was performance testing. Unfortunately, while I was onsite for this one, the 10G networking was not available. I would not recommend using a gigabit network for most configurations, but since we were not yet in full production mode, we did go through many of the performance tests to get a baseline. We got an excellent baseline of what the performance would look like with the gigabit network.

Briefly, because I could write an entire book on performance testing, the quickest and easiest way to test performance is with the Proactive Tests menu which is included in Virtual SAN 6.1:

JMcDonald VSAN Pt2 (3)

It provides a really good mechanism to test different types of workloads that are most common – all the way from a basic test, to a stress test. In addition, using IOmeter for testing (based on environmental characteristics) can be very useful.

In this case, to give you an idea of performance test results, we were pretty consistently getting a peak of around 30,000 IOPS with the gigabit network with 10 hosts in the cluster. Subsequently, I have been told that once the 10G network was in place, this actually jumped up to a peak of 160,000 IOPS for the same 10 hosts. Pretty amazing to be honest.

I will not get into the ins and outs of testing, as it very much depends on the area you are testing. I did want to show, however, that it is much easier to perform a lot of the testing this way than it was using the previous command line method.

One final note I want to add in the performance testing area is that one of the key things (other than pure “my VM goes THISSSS fast” type tests), is to test the performance of rebalancing in the case of maintenance mode, or failure scenarios. This can be done from the Resyncing Components Menu:

JMcDonald VSAN Pt2 (4)

Boring by default perhaps, but when you either migrate data in maintenance mode, or change a storage policy, you can see what the impact will be to resync components. It will either show when creating an additional disk stripe for a disk, or when fully migrating data off the host when going into maintenance mode. The compliance screen will look like this:

JMcDonald VSAN Pt2 (5)

This represents a significant amount of time, and is incredibly useful when testing normal workloads such as when data is migrated during the enter maintenance mode workflow. Full migrations of data can be incredibly expensive, especially if the disks are large, or if you are using gigabit rather than 10G networks. Oftentimes, convergence can take a significant amount of time and bandwidth, so this allows customers to plan for the amount of data to be moved while in or maintenance mode, or in the case of a failure.

Well, that is what I have for this blog post. Again, this is obviously not a conclusive list of all decision points or anything like that; it’s just where we had the most discussions that I wanted to share. I hope this gives you an idea of the challenges we faced, and can help you prepare for the decisions you may face when implementing stretch clustering for Virtual SAN. This is truly a pretty cool feature and will provide an excellent addition to the ways business continuity and disaster recovery plans can be designed for an environment.


Jonathan McDonald is a Technical Solutions Architect for the Professional Services Engineering team. He currently specializes in developing architecture designs for core Virtualization, and Software-Defined Storage, as well as providing best practices for upgrading and health checks for vSphere environments

Virtual SAN Stretch Clusters – Real World Design Practices (Part 1)

Jonathan McDonaldBy Jonathan McDonald

This is part one of a two blog series as there was just too much detail for a single blog. I want to start off by saying that all of the details here are based on my own personal experiences. It is not meant to be a comprehensive guide for setting up stretch clustering for Virtual SAN, but a set of pointers to show the type of detail that is most commonly asked for. Hopefully it will help prepare you for any projects that you are working on.

Most recently in my day-to-day work I was asked to travel to a customer site to help with a Virtual SAN implementation. It was not until I got on site that I was told that the idea for the design was to use the new stretch clustering functionality that VMware added to the Virtual SAN 6.1 release. This functionality has been discussed by other folks in their blogs, so I will not reiterate much of the detail from them here. In addition, the implementation is very thoroughly documented by the amazing Cormac Hogan in the Stretched Cluster Deployment Guide.

What this blog is meant to be is a guide to some of the most important design decisions that need to be made. I will focus on the most recent project I was part of; however, the design decisions are pretty universal. I hope that the detail will help people avoid issues such as the ones we ran into while implementing the solution.

A Bit of Background

For anyone not aware of stretch clustering functionality, I wanted to provide a brief overview. Most of the details you already know about Virtual SAN still remain true. What it really amounts to is a configuration that allows two sites of hosts connected with a low latency link to participate in a virtual SAN cluster, together with an ESXi host or witness appliance that exists at a third site. This cluster is an active/active configuration that provides a new level of redundancy, such that if one of the two sites has a failure, the other site will immediately be able to recover virtual machines at the failed site using VMware High Availability.

The configuration looks like this:

JMcDonald Stretched Virtual SAN Cluster 1

This is accomplished by using fault domains and is configured directly from the fault domain configuration page for the cluster:

JMcDonald Stretched Virtual SAN Cluster 2

Each site is its own fault domain which is why the witness is required. The witness functions as the third fault domain and is used to host the witness components for the virtual machines in both sites. In Virtual SAN Stretched Clusters, there is only one witness host in any configuration.

JMcDonald Stretched Virtual SAN Cluster 3

For deployments that manage multiple stretched clusters, each cluster must have its own unique witness host.

The nomenclature used to describe a Virtual SAN Stretched Cluster configuration is X+Y+Z, where X is the number of ESXi hosts at data site A, Y is the number of ESXi hosts at data site B, and Z is the number of witness hosts at site C.

Finally, with stretch clustering, the current maximum configuration is 31 nodes (15 + 15 + 1 = 31 nodes). The minimum supported configuration is 1 + 1 + 1 = 3 nodes. This can be configured as a two-host virtual SAN cluster, with the witness appliance as the third node.

With all these considerations, let’s take a look at a few of the design decisions and issues we ran into.

Hosts, Sites and Disk Group Sizing

The first question that came up—as it almost always does—is about sizing. This customer initially used the Virtual SAN TCO Calculator for sizing and the hardware was already delivered. Sounds simple, right? Well perhaps, but it does get more complex when talking about a stretch cluster. The questions that came up regarded the number of hosts per site, as well as how the disk groups should be configured.

Starting off with the hosts, one of the big things discussed was the possibility of having more hosts in the primary site than in the secondary. For stretch clusters, an identical number of hosts in each site is a requirement. This makes it a lot easier from a decision standpoint, and when you look closer the reason becomes obvious: with a stretched cluster, you have the ability to fail over an entire site. Therefore, it is logical to have identical host footprints.

With disk groups, however, the decision point is a little more complex. Normally, my recommendation here is to keep everything uniform. Thus, if you have 2 solid state disks and 10 magnetic disks, you would configure 2 disk groups with 5 disks each. This prevents unbalanced utilization of any one component type, regardless of whether it is a disk, disk group, host, network port, etc. To be honest, it also greatly simplifies much of the design, as each host/disk group can expect an equal amount of love from vSphere DRS.

In this configuration, though, it was not so clear because one additional disk was available, so the division of disks cannot be equal. After some debate, we decided to keep one disk as a “hot spare,” so there was an equal number of disk groups—and disks per disk group—on all hosts. This turned out to be a good thing; see the next section for details.

In the end, much of this is the standard approach to Virtual SAN configuration, so other than site sizing, there was nothing really unexpected.

Booting ESXi from SD or USB

I don’t want to get too in-depth on this, but briefly, when you boot an ESXi 6.0 host from a USB device or SD card, Virtual SAN trace logs are written to RAMdisk, and the logs are not persistent. This actually serves to preserve the life of the device as the amount of data being written can be substantial. When running in this configuration these logs are automatically offloaded to persistent media during shutdown or system crash (PANIC). If you have more than 512 GB of RAM in the hosts, you are unlikely to have enough space to store this volume of data because these devices are not generally this large. Therefore, logs, Virtual SAN trace logs, or core dumps may be lost or corrupted because of insufficient space, and the ability to troubleshoot failures will be greatly limited.

So, in these cases it is recommended to configure a drive for the core dump and scratch partitions. This is also the only supported method for handling Virtual SAN traces when booting an ESXi from a USB stick or SD card.

That being said, when we were in the process of configuring the hosts in this environment, we saw the “No datastores have been configured” warning message pop up – meaning persistent storage had not been configured. This triggered the whole discussion; the error is similar to the one in the vSphere Web Client.
In the vSphere Client, this error also comes up when you click to the Configuration tab:

JMcDonald Stretched Virtual SAN Cluster 4

In the vSphere Client, this error also comes up when you click to the Configuration tab:

JMcDonald Stretched Virtual SAN Cluster 5

The spare disk turned out to be useful because we were able to use it to configure the ESXi scratch dump and core dump partitions. This is not to say we were seeing crashes, or even expected to; in fact, we saw no unexpected behavior in the environment up to this point. Rather, since this was a new environment, we wanted to ensure we’d have the ability to quickly diagnose any issue, and having this configured up-front saves significant time in support. This is of course speaking from first-hand experience.

In addition, syslog was set up to export logs to an external source at this time. Whether using the syslog service that is included with vSphere, or vRealize Log Insight (amazing tool if you have not used it), we were sure to have the environment set up to quickly identify the source of any problem that might arise.

For more details on this, see the following KB articles for instructions:

I guess the lesson here is that when you are designing your virtual SAN cluster, make sure you remember that having persistence available for logs, traces and core dumps is a best practice. If you have a large memory configuration, this is the easiest way to install ESXi and the scratch/core dump partitions to a hard drive. This also simplifies post-installation tasks, and will ensure you can collect all the information support might require to diagnose issues.

 Witness Host Placement

The witness host was the next piece we designed. Officially, the witness must be in a distinct third site in order to properly detect failures. It can either be a full host or a virtual appliance residing outside of the virtual SAN cluster. The cool thing is that if you use an appliance, it actually appears differently in the Web client:

JMcDonald Stretched Virtual SAN Cluster 6

For the witness host in this case, we decided to use the witness appliance rather than a full host. This way, it could be migrated easily because the networking was not set up to the third site yet. As a result, for the initial implementation while I was onsite, the witness was local to one of the sites, and would be migrated as soon as the networking was set up. This is definitely not a recommended configuration, but for testing—or for a non-production proof-of-concept—it does work. Keep in mind, that a site failure may not be properly detected unless the cluster is properly configured.

With this, I conclude Part 1 of this blog series; hopefully, you have found this useful. Stay tuned for Part 2!


Jonathan McDonald is a Technical Solutions Architect for the Professional Services Engineering team. He currently specializes in developing architecture designs for core Virtualization, and Software-Defined Storage, as well as providing best practices for upgrading and health checks for vSphere environments

 

VMware vRealize Automation 7.0 – Finally the most desirable features and capabilities in a VMware Private Cloud Automation Solution have arrived!

By Cory Allen, RadhaKishna (RK) Dasari and Shannon Wilber

Anyone who has previously worked with vRealize Automation 6.x or earlier versions of VMware Cloud Automation Center 5.x, understands historically just how challenging it can be to manage the overall planning, design, deployment and architecture of an end-to-end Private Cloud Automation solution.

Our Professional Services Engineering team over the past few years have worked extensively with VMware engineering organizations and many diverse customer Cloud Automation deployment scenarios globally.  Our team extensively tests and researches the most effective and proven methodologies to implement vRealize Automation as a solution help mitigate and reduce historical implementation challenges.

So guess what?  We are really excited and impressed with the new vRealize Automation 7.0… :)

VMware finally has a new Private Cloud Solution and Product with the launch of vRealize Automation 7.0, that delivers a much easier and user friendly deployment method using the new slick vRealize Automation Installation Wizard. This new installation and configuration wizard enables a simplified and centralized deployment unlike anything you have seen in vRealize Automation before. The challenges of the installation and configuration are a thing of the past, having been solved in a whole new way with this new release of vRealize Automation.

In particular, during early testing and validation our team had performed during betas, we quickly became very impressed with this new feature and capability and immediately realized just how valuable the new vRealize Automation Installation would be for efficiency, ease of use, intuitive, and stability for the VMware field delivery teams and customers globally. The most significant change our team observed that was very cool, was that you do not have to do so many manual installation tasks anymore – which can get really tedious during the deployment of a complex highly available VRA solution.

So now let’s get to the best part – where the fun begins – some basic questions and answers.

  1. How has the installation and configuration of vRealize Automation 7.0 changed from previous versions to include 6.x and 5.x?
    Response: Well, good news! The new installation features and capabilities offered include the option to deploy a Minimal Deployment and an Enterprise Deployment.
  • The vRealize Automation Installation Wizard Minimal Deployment offers a simple non-distributed installation without high availability shown here:
    PSE Deployment 1
  • The vRealize Automation Installation Wizard provides an option for an Enterprise Deployment distributed with or without high availability options shown here:

PSE Expanded Deployment

  1. Now I have two options to perform the installation, so what is the big deal?

    Response: That is not all the installation has to offer, we are just getting started! The new installation wizard now has a cool new Gadget called the “Prerequisite Checker” available within the Wizard. This new feature leverages a new IaaS Management Agent to install on each IaaS Hosts so that the “Prerequisite Checker” that can automatically from a central location within the Wizard, automate the installation and configuration of all Microsoft Windows Server IaaS Prerequisites. No longer do architects and engineers have to manually logon to each server, configure the prerequisites, reboot, or worse – forget to install all required prerequisites. The new vRealize Automation Installation Wizard Prerequisite Check Takes care of it all. Just be careful not to get to comfortable here, there is more fun.

The vRealize Installation Wizard Installation Prerequisites Page to Download Management Agents is shown here:

PSE Deployment 3

vRealize Installation Wizard New “Prerequisite Checker” scanning IaaS Hosts within the Enterprise Deployment option:

PSE Deployment 4

The vRealize Installation Wizard New “Prerequisite Checker” identifies Windows IaaS Server Hosts that may not have “Some prerequisites are not met” notification. The wizard has a “Fix” button that enables this cool new tool too automatically fix each IaaS Host with a simple click of a button.

PSE Prerequisite Checker 1
The cool new “
Prerequisite Checker” will notify you when all Windows IaaS Machines have had all prerequisites installed configured. Guess what? If for some reason a Windows IaaS Machine goes offline during the process, simply wait for the machine to come back online and click “Run” or “Fix” a second time for where you last left off. Pretty cool stuff huh?

PSE Prerequisite Checker 2

 

PSE Prerequisite Checker 3

  1. Wow! That is pretty cool. After the “Prerequisite Checker” is done completing installation tasks, what other cool automated and fancy things can vRealize Automation 7.0 do? What about authoring infrastructure or applications? Has anything changed to enable an easier and more streamlined method for developing infrastructure and applications?Response: Check out the new (and unified) blueprint design canvas for authoring infrastructure virtual machines and applications with Machine Types and Software Components which now include Application Services as standard service. The new blueprint designer is probably the coolest, and most significant, new feature that has been implemented that allows for building simple virtual machines to complex application based blueprints all within a single design canvas.

New Unified Blueprint Design Canvas for Authoring Infrastructure and Applications

PSE Unified Blueprint Design

  1. Now that the first question has been addressed for new unified blueprints, tell me how Application services works with the new service authoring model.Response: Here is an overview of how the new Application Services works compared to previous versions of vRealize Automation Application Services in the 6.x versions:Application Services formerly Application Director used to be an optional application blueprint authoring feature that was a separate, external virtual appliance that had to be manually deployed, configured, and integrated with vRealize Automation 6.x environments. Following the former deployment methods in older Application Services versions, application blueprints had to be manually created as a separate component from single and multi-machine blueprints in vRealize Automation 6.x. Now in vRealize Automation 7.0, Application Services is a standard feature that is offered as part of the overall deployment and is available within the unified blueprint. Application services is no longer a separate virtual appliance or external component. Application Services, which now includes Software Components, have integrated authoring capabilities in the vRealize Automation 7.0. Application Services and runs as a service on each vRealize Automation 7.0 appliances, scaling as appliances are added into the overall scalable architecture.

New Capabilities to Create Applications and Software Components in the Unified Blueprint

PSE Unified Blueprint 2

VMware’s customers will be very impressed that they can now create multi-tier blueprints with combined infrastructure and application dependencies to author software components.

Create a Software Component such as an Apache Web Server Service

PSE Unified Blueprint 3

  1. Wow that is very cool, I can now create infrastructure virtual machines, create Software Components and then add my applications by dragging and dropping my applications onto the design canvas? How can I add networking or security policies to my unified blueprints?Response: In addition, another super cool new feature and capability with the unified blueprint is the option to integrate NSX On-Demand Networking and Security components into these blueprints. This is a radical new networking and security enhancement within the blueprint design canvas that allows authoring of services all in one single unified designer within vRealize Automation 7.0 that can include NSX features and Network Components to include:
  • On-Demand Private Networks
  • On-Demand NAT Networks
  • On-Demand Routed Networks

Security Components:

  • On-Demand Security Groups
  • Existing Security Groups
  • Security Tags

New Blueprint with NSX On-Demand Networking and Security “Drag & Drop” Authoring Functionality

PSE Unified Blueprint 4


Cory AllenCory Allen has been in the IT world for over 18 years and is a Technical Solutions Architect for VMware. He has been with VMware since June of 2015 working on the PSE Cloud Automation Team. Before coming to VMware, he was with Parker Kannifin where he was the Hybrid Cloud Architect where his main task was to Design and Build a Private Internal Cloud that was fully automated and able to scale to the whole organization worldwide.

RK DasariRadhaKrishna (RK) Dasari is a Technical Solutions Architect for the Professional Services Engineering team. He specializes in developing architecture designs and service kits for vRealize Automation and vCloud Air. Prior to VMware, RK had a career spanning thirteen years at Dell as a software developer, software architect and pre-sales solutions architect.

Shannon WilberShannon Wilber is a Technical Solution Architect at VMware with eighteen years of information technology experience with architecture design, implementations and infrastructure management for commercial and enterprise scale IT solutions for Software Defined Data Centers and Hybrid Cloud environments. Proven team leadership and technical experience to guide complex IT projects through the planning, design, implementation, migration and optimization stages for diverse IT solutions for customers globally. Shannon is also a VMware Certified Professional (VCP5-DCV), EMC Proven Professional and EMC Certified Cloud Infrastructure Specialist. 

VMware vRealize Operations Python Adapter – A Hidden Treasure

Jeremy WheelerBy Jeremy Wheeler

Even more power comes out of VMware vRealize Operations when enabling the vRealize Operations Python Adapter, adding additional intelligent monitoring and action capabilities.

To do this, execute the following steps:

Image 1:

JWheeler Image 1

  1. Select ‘Solutions’
  2. Select ‘VMware vSphere’
  3. Select ‘vCenter Python Adapter’

Add your vCenters, and match what you configured under the ‘vCenter Adapter’ section above #3 in image 1.

What Does This Do for Me?

When viewing the default dashboard ‘Recommendations’ you might see something such as the following in your ‘Top Risk Alerts For Descendants’

Image 2:

JWheeler Image 2

By selecting the alert, you will be presented with another dialog to dig into, which is an object we should inspect:

Image 3:

JWheeler Image 3

After I select ‘View Details’ it will present me with the object details of the virtual machine ‘av_prov1’.

Image 4:

JWheeler Image 4

Without Python Adapters configured you will not see the ‘Set Memory for VM’ button; with it configured it will be visible under the ‘Recommendations’ section.

Image 5:

JWheeler Image 5

After selecting ‘Set Memory for VM’ you will be presented with a new dialog (Image 5). Here we can see what the new memory recommendation would be and adjust or apply it. Additionally, if you want the changes to happen now, you can select Power-Off/Snapshot. Without powering off the virtual machine, vRealize Operations will attempt to hot-add the additional memory if the OS will support it.

Image 6:

JWheeler Image 6

Once you select ‘Begin Action’ you will see the dialog in Image 6.


Jeremy Wheeler is an experienced senior consultant and architect for VMware’s Professional Services Organization, End-user Computing specializing in VMware Horizon Suite product-line and vRealize products such as vROps, and Log Insight Manager. Jeremy has over 18 years of experience in the IT industry. In addition to his past experience, Jeremy has a passion for technology and thrives on educating customers. Jeremy has 7 years of hands-¬‐on virtualization experience deploying full-life cycle solutions using VMware, CITRIX, and Hyper-V. Jeremy also has 16 years of experience in computer programming in various languages ranging from basic scripting to C, C++, PERL, .NET, SQL, and PowerShell.

Jeremy Wheeler has received acclaim from several clients for his in-¬‐depth and varied technical experience and exceptional hands-on customer satisfaction skills. In February 2013, Jeremy also received VMware’s Spotlight award for his outstanding persistence and dedication to customers and was nominated again in October of 2013

Cloud Pod Architecture and Cisco Nexus 1000V Bug

Jeremy WheelerBy Jeremy Wheeler

I once worked with a customer who owned two vBlocks between two data centers. They ran Nexus 1000V for the virtual networking component. They deployed VDI, and when we enabled cloud pod architecture, global data replication worked great; however, all of our connection servers in the remote pod would show red or offline. I found that we could not telnet to the internal pod or remote pod connection servers over port 8472. All other ports were good. VMware Support confirmed that the issue is with the Nexus 1000V and found that there was a bug in the N1KV and a TCP Checksum Offload function.

The specific ports in question are the following:

VMware View Port 8472 – The View Interpod API (VIPA) interpod communication channel runs on this port. View Connection Server instances use the VIPA interpod communication channel to launch new desktops, find existing desktops, and share health status data and other information.

Cisco Nexus 1000V Port 8472 – VXLAN; Cisco posted a bug report about 8472 being dropped at the VEM for N1KV: Cisco Bug: CSCup55389 – Traffic to TCP port 8472 dropped on the VEM

The bug report mentions TCP Checksum being the root cause and offloading only 8472 packets. If removing the N1KV isn’t an option, you can disable TCP Offloading.

To Disable TCP Offloading

  • In the Windows server, open the Control Panel and select Network Settings Change Adapter Settings.
    JWheeler Ethernet Adapter Properties 1
    Right-click on each of the adapters (private and public), select Configure from the Networking menu, and then click the Advanced tab. The TCP Offload settings are listed for the Citrix adapter.JWheeler Ethernet Adapter Properties 2

I recommend applying the following:

  • IPv4 Checksum Offload
  • Large Receive Offload (was not present for our vmxnet3 advanced configuration)
  • Large Send Offload
  • TCP Checksum Offload

You would need to do this on each of the VMXNET3 Adapters on each connection server at both data centers. Once disabled (it did cause nic to blip), we were able to Telnet between the data centers on port 8472 again.

After making these adjustments you should be able to login to the View Admin portal and see all greens for remote connection servers. I have tested and validated this, and it works as intended. For more information I recommend you read Understanding TCP Segmentation Offload (TSO) and Large Receive Offload (LRO) in a VMware environment (2055140).


Jeremy Wheeler is an experienced senior consultant and architect for VMware’s Professional Services Organization, End-user Computing specializing in VMware Horizon Suite product-line and vRealize products such as vROps, and Log Insight Manager. Jeremy has over 18 years of experience in the IT industry. In addition to his past experience, Jeremy has a passion for technology and thrives on educating customers. Jeremy has 7 years of hands-¬‐on virtualization experience deploying full-life cycle solutions using VMware, CITRIX, and Hyper-V. Jeremy also has 16 years of experience in computer programming in various languages ranging from basic scripting to C, C++, PERL, .NET, SQL, and PowerShell.

Jeremy Wheeler has received acclaim from several clients for his in-¬‐depth and varied technical experience and exceptional hands-on customer satisfaction skills. In February 2013, Jeremy also received VMware’s Spotlight award for his outstanding persistence and dedication to customers and was nominated again in October of 2013

Horizon View 6.2 and Blackscreens

Jeremy WheelerBy Jeremy Wheeler

With the release of Horizon View 6.2 and the vSphere 6.0 Update 1a comes new features – but also possible new issues. If you have an environment running Horizon 6.2 and anything below vSphere 6.0 Update 1, you might see some potential issues with your VDI desktops. VMware has introduced a new video driver (version 6.23) in View 6.2 that greatly improves speed and quality, but to utilize this fully you need to be on the latest vSphere bits. Customers who have not upgraded to the latest bits have reported VDI desktops black-screening and disconnecting. One fix for those difficult images is to upgrade/replace the video driver inside the Guest OS of the Gold Image.

To uninstall the old video driver inside your Gold Image Guest OS follow these steps:

  1. Uninstall the View Agent
  2. Delete Video Drivers from Windows Device Manager
    • Expand Device Manager and Display Adapters
    • Right-click on the VMware SVGA 3D driver and select Uninstall
      JWheeler Uninstall
    • Select the checkbox ‘Delete the driver roftware for this device.’
      JWheeler Confirm Device Uninstall
  3. Reboot and let Windows rescan
  4. Verify that Windows in using its bare-bone SVGA driver (if not keep deleting the driver again)
  5. Install View Agent 6.2

Note: Do NOT update VMware tools or you will have to repeat this sequence unless you upgraded the View Agent.

Optional Steps:

If you want to update the video driver without re-installing the View Agent, follow these steps:

  1. Launch View Agent 6.2 installer MSI (only launch the installer, do not proceed through the wizard!)
  2. Change the %temp% folder and sort the contents by the date/time
  3. Look for the most recent long folder name, for example:
    JWheeler Temp File Folder
  4. Change into the directory and look for the file ‘VmVideo.cab’
    JWheeler VmVideo
  5. Copy ‘VmVideo.cab’ file to a temp folder (i.e., C:/Temp)
  6. Extract all files form the VmVideo.cab file. You should see something like this:
    JWheeler Local Temp File
  7. You can execute the following type of syntax for extraction:
    – extract /e /a /l <destination><drive>:\<cabinetname>
    Reference Microsoft KP 132913 for additional information.
  8. You need to rename each file, so remove the prefix ‘_’ and anything after the extension of the filename. Example:
    JWheeler Local Disk Temp Folder 2
  9. Install View Agent 6.2 video drivers:
    1. Once rebooted in the device manager expand ‘Display Adapter’
    2. Right-click on the ‘Microsoft Basic Display Adapter’ and click ‘Update Driver Software’
    3. Select ‘Browse my computer for driver software’
    4. Select ‘Browse’ and point to the temp folder where you expanded and renamed all the View 6.2 drivers
    5. Select ‘Next and complete the video driver installation.

After completing these steps of re-installing the View Agent, and/or replacement video drivers, you will need to do the following:

  1. Power-down the Gold Image (execute any power-down scripts or tasks as you normally do)
  2. Snapshot the VM
  3. Modify the View pool to point to the new snapshot
  4. Execute a recompose

Special thanks to Matt Mabis (@VDI_Tech_Guy) on discovering this fix.


Jeremy Wheeler is an experienced senior consultant and architect for VMware’s Professional Services Organization, End-user Computing specializing in VMware Horizon Suite product-line and vRealize products such as vROps, and Log Insight Manager. Jeremy has over 18 years of experience in the IT industry. In addition to his past experience, Jeremy has a passion for technology and thrives on educating customers. Jeremy has 7 years of hands-¬‐on virtualization experience deploying full-life cycle solutions using VMware, CITRIX, and Hyper-V. Jeremy also has 16 years of experience in computer programming in various languages ranging from basic scripting to C, C++, PERL, .NET, SQL, and PowerShell.

Jeremy Wheeler has received acclaim from several clients for his in-¬‐depth and varied technical experience and exceptional hands-on customer satisfaction skills. In February 2013, Jeremy also received VMware’s Spotlight award for his outstanding persistence and dedication to customers and was nominated again in October of 2013

Configuring NSX-v Load Balancer for use with vSphere Platform Services Controller (PSC) 6.0

Romain DeckerBy Romain Decker

VMware introduced a new component with vSphere 6, the Platform Services Controller (PSC). Coupled with vCenter, the PSC provides several core services, such as Certificate Authority, License service and Single Sign-On (SSO).

Multiple external PSCs can be deployed serving one (or more) service, such as vCenter Server, Site Recovery Manager or vRealize Automation. When deploying the Platform Services Controller for multiple services, availability of the Platform Services Controller must be considered. In some cases, having more than one PSC deployed in a highly available architecture is recommended. When configured in high availability (HA) mode, the PSC instances replicate state information between each other, and the external products (vCenter Server for example) interact with the PSCs through a load balancer.

This post covers the configuration of an HA PSC deployment with the benefits of using NSX-v 6.2 load balancing feature.

Due to the relationship between vCenter Server and NSX Manager, two different scenarios emerge:

  • Scenario A where both PSC nodes are deployed from an existing management vCenter. In this situation, the management vCenter is coupled with NSX which will configure the Edge load balancer. There are no dependencies between the vCenter Server(s) that will use the PSC in HA mode and NSX itself.
  • Scenario B where there is no existing vCenter infrastructure (and thus no existing NSX deployment) when the first PSC is deployed. This is a classic “chicken and egg” situation, as the NSX Manager that is actually responsible for load balancing the PSC in HA mode is also connected to the vCenter Server that use the PSC virtual IP.

While scenario A is straightforward, you need to respect a specific order for scenario B to prevent any loss of connection to the Web client during the procedure. The solution is to deploy a temporary PSC in a temporary SSO site to do the load balancer configuration, and to repoint the vCenter Server to the PSC virtual IP at the end. Both path are summarized in the workflow below.

RDecker PSC Map

Environment

NSX Edge supports two deployment modes: one-arm mode and inline mode (also referred to as transparent mode). While inline mode is also possible, NSX load balancer will be deployed in a one-arm mode in our situation, as this model is more flexible and because we don’t require full visibility into the original client IP address.

Description of the environment:

  • Software versions: VMware vCenter Server 6.0 U1 Appliance, ESXi 6.0 U1, NSX-v 6.2
  • NSX Edge Services Gateway in one-arm mode
  • Active/Passive configuration
  • VLAN-backed portgroup (distributed portgroup on DVS)
  • General PSC/vCenter and NSX prerequisites validated (NTP, DNS, resources, etc.)

To offer SSO in HA mode, two PSC servers have to be installed with NSX load balancing them in active/standby mode. PSC in Active/Active mode is currently not supported by PSC.

The way SSO operates, it is not possible to configure it as active/active. The workaround for the NSX configuration is to use an application rule and to configure two different pools (with one PSC instance in each pool). The application rule will send all traffic to the first pool as long as the pool is up, and will switch to the secondary pool if the first PSC is down.

The following is a representation of the NSX-v and PSC logical design.

RDecker PSC NSX

Procedure

Each step number refers to the above workflow diagram. You can take snapshots at regular intervals to be able to rollback in case of a problem.

Step 1: Deploy infrastructure

This first step consists of deploying the required vCenter infrastructure before starting the configuration.

A. For scenario A: Deploy two PSC nodes in the same SSO site.

B. For scenario B:

  1. Deploy a first standalone Platform Services Controller (PSC-00a). This PSC will be temporary used during the configuration.
  2. Deploy a vCenter instance against the PSC-00a just deployed.
  3. Deploy NSX Manager and connect it to the vCenter.
  4. Deploy two other Platform Services Controllers in the same SSO domain (PSC-01a and PSC-02a) but in a new site. Note: vCenter will still be pointing to PSC-00a at this stage. Use the following options:
    RDecker PSC NSX Setup 1RDecker PSC NSX Setup 2

Step 2 (both scenarios): Configure both PSCs as an HA pair (up to step D in KB 2113315).

Now that all required external Platform Services Controller appliances are deployed, it’s time to configure high availability.

A. PSC pairing

  1. Download the PSC high availability configuration scripts from the Download vSphere page and extract the content to /ha on both PSC-01a and PSC-02a nodes. Note: Use the KB 2107727 to enable the Bash shell in order to copy files in SCP into the appliances.
  2. Run the following command on the first PSC node:
    python gen-lb-cert.py --primary-node --lb-fqdn=load_balanced_fqdn --password=<yourpassword>

    Note: The load_balanced_fqdn parameter is the FQDN of the PSC Virtual IP of the load balancer. If you don’t specify the option –password option, the default password will be « changeme ».
    For example:

    python gen-lb-cert.py --primary-node --lb-fqdn=psc-vip.sddc.lab --password=brucewayneisbatman
  3. On the PSC-01a node, copy the content of the directory /etc/vmware-sso/keys to /ha/keys (a new directory that needs to be created).
  4. Copy the content of the /ha folder from the PSC-01a node to the /ha folder on the additional PSC-02a node (including the keys copied in the step before).
  5. Run the following command on the PSC-02a node:
python gen-lb-cert.py --secondary-node --lb-fqdn=load_balanced_fqdn --lb-cert-folder=/ha --sso-serversign-folder=/ha/keys

Note: The load_balanced_fqdn parameter is the FQDN of the load balancer address (or VIP).

For example:

python gen-lb-cert.py --secondary-node --lb-fqdn=psc-vip.sddc.lab --lb-cert-folder=/ha --sso-serversign-folder=/ha/keys

Note: If you’re following the KB 2113315 don’t forget to stop the configuration here (end of section C in the KB).

Step 3: NSX configuration

An NSX edge device must be deployed and configured for networking in the same subnet as the PSC nodes, with at least one interface for configuring the virtual IP.

A. Importing certificates

Enter the configuration of the NSX edge services gateway on which to configure the load balancing service for the PSC, and add a new certificate in the Settings > Certificates menu (under the Manage tab). Use the content of the previously generated /ha/lb.crt file as the load balancer certificate and the content of the /ha/lb_rsa.key file as the private key.

RDecker PSC Certificate Setup

B. General configuration

Enable the load balancer service and logging under the global configuration menu of the load balancer tab.

RDecker PSC Web Client

C. Application profile creation

An application profile defines the behavior of a particular type of network traffic. Two application profiles have to be created: one for HTTPS protocol and one for other TCP protocols.

Parameters HTTPS application profile TCP application profile
Name psc-https-profile psc-tcp-profile
Type HTTPS TCP
Enable Pool Side SSL Yes N/A
Configure Service Certificate Yes N/A

Note: The other parameters shall be left with their default values.

RDecker PSC Edge

D. Creating pools

The NSX load balancer virtual server type HTTP/HTTPS provide web protocol sanity check for their backend servers pool. However, we do not want that sanity check their backend servers pool for the TCP virtual server. For that reason, different pools must be created for the PSC HTTPS virtual IP and TCP virtual IP.

Four pools have to be created: two different pools for each virtual server (with one PSC instance per pool). An application rule will be defined to switch between them in case of a failure: traffic will be send to the first pool as long as the pool is up, and will switch to the secondary pool if the first PSC is down.

Parameters Pool 1 Pool 2 Pool 3 Pool 4
Name pool_psc-01a-http pool_psc-02a-http pool_psc-01a-tcp pool_psc-02a-tcp
Algorithm ROUND-ROBIN ROUND-ROBIN ROUND-ROBIN ROUND-ROBIN
Monitors default_tcp_monitor default_tcp_monitor default_tcp_monitor default_tcp_monitor
Members psc-01a psc-02a psc-01a psc-02a
Monitor Port 443 443 443 443

Note: while you could use a custom HTTPS healthcheck, I selected the default TCP Monitor in this example.

RDecker PSC Edge 2 (Pools)

E. Creating application rules

This application rule will contain the logic that will perform the failover between the pools (for each virtual server) corresponding to the active/passive behavior of the PSC high availability mode. The ACL will check if the primary PSC is up; if the first pool is not up the rule will switch to the secondary pool.

The first application rule will be used by the HTTPS virtual server to switch between the corresponding pools for the HTTPS backend servers pool.

# Detect if pool "pool_psc-01a-http" is still UP
acl pool_psc-01a-http_down nbsrv(pool_psc-01a-http) eq 0
# Use pool " pool_psc-02a-http " if "pool_psc-01a-http" is dead
use_backend pool_psc-02a-http if pool_psc-01a-http_down

The second application rule will be used by the TCP virtual server to switch between the corresponding pools for the TCP backend servers pool.

# Detect if pool "pool_psc-01a-tcp" is still UP
acl pool_psc-01a-tcp_down nbsrv(pool_psc-01a-tcp) eq 0
# Use pool " pool_psc-02a-tcp " if "pool_psc-01a-tcp" is dead
use_backend pool_psc-02a-tcp if pool_psc-01a-tcp_down

RDecker PSC Edge 3 (app rules)

F. Configuring virtual servers

Two virtual servers have to be created: one for HTTPS protocol and one for the other TCP protocols.

Parameters HTTPS Virtual Server TCP Virtual Server
Application Profile psc-https-profile psc-tcp-profile
Name psc-https-vip psc-tcp-vip
IP Address IP Address corresponding to the PSC virtual IP
Protocol HTTPS TCP
Port 443 389,636,2012,2014,2020*
Default Pool pool_psc-01a-http pool_psc-01a-tcp
Application Rules psc-failover-apprule-http psc-failover-apprule-tcp

* Although this procedure is for a fresh install, you could target the same architecture with SSO 5.5 being upgraded to PSC. If you plan to upgrade from SSO 5.5 HA, you must add the legacy SSO port 7444 to the list of ports in the TCP virtual server.

RDecker PSC Edge 4 (VIP)

Step 4 (both scenarios)

Now it’s time to finish the PSC HA configuration (step E of KB 2113315). Update the endpoint URLs on PSC with the load_balanced_fqdn by running this command on the first PSC node.

python lstoolHA.py --hostname=psc_1_fqdn --lb-fqdn=load_balanced_fqdn --lb-cert-folder=/ha --user=Administrator@vsphere.local

Note: psc_1_fqdn is the FQDN of the first PSC-01a node and load_balanced_fqdn is the FQDN of the load balancer address (or VIP).

For example:

python lstoolHA.py --hostname=psc-01a.sddc.lab --lb-fqdn=psc-vip.sddc.lab --lb-cert-folder=/ha --user=Administrator@vsphere.local

Step 5

A. Scenario A: Deploy any new production vCenter Server or other components (such as vRA) against the PSC Virtual IP and enjoy!

B. Scenario B

The situation is the following: The vCenter is currently still pointing to the first external PSC instance (PSC-00a), and two other PSC instances are configured in HA mode, but are not used.

RDecker Common SSO Domain vSphere

Introduced in vSphere 6.0 Update 1, it is now possible to move a vCenter Server between SSO sites within a vSphere domain (see KB 2131191 for more information). In our situation, we have to re-point the existing vCenter that is currently connected to the external PSC-00a to the PSC Virtual IP:

  1. Download and replace the cmsso-util file on your vCenter Server using the actions described in the KB 2113911.
  2. Re-point the vCenter Server Appliance to the PSC virtual IP to the final site by running this command:
/bin/cmsso-util repoint --repoint-psc load_balanced_fqdn

Note: The load_balanced_fqdn parameter is the FQDN of the load balancer address (or VIP).

For example:

/bin/cmsso-util repoint --repoint-psc psc-vip.sddc.lab

Note: This command will also restart vCenter services.

  1. Move the vCenter services registration to the new SSO site. When a vCenter Server is installed, it creates service registrations that it issues to start the vCenter Server services. These service registrations are written to a specific site of the Platform Services Controller (PSC) that was used during the installation. Use the following command to update the vCenter Server services registrations (parameters will be asked at the prompt).
/bin/cmsso-util move-services

After the command, you end up with the following.

RDecker PSC Common SSO Domain vSphere 2

    1. Log in to your vCenter Server instance by using the vSphere Web Client to verify that the vCenter Server is up and running and can be managed.

RDecker PSC Web Client 2

In the context of the scenario B, you can always re-point to the previous PSC-00a if you cannot log, or if you have an error message. When you have confirmed that everything is working, you can remove the temporary PSC (PSC-00a) from the SSO domain with this command (KB 2106736​):

cmsso-util unregister --node-pnid psc-00a.sddc.lab --username administrator@vsphere.local --passwd VMware1!

Finally, you can safely decommission PSC-00a.

RDecker PSC Common SSO Domain vSphere 3

Note: If your NSX Manager was configured with Lookup Service, you can update it with the PSC virtual IP.

Resources:


Romain Decker is a Senior Solutions Architect member of Professional Services Engineering (PSE) for the Software-Defined Datacenter (SDDC) portfolio – a part of the Global Technical & Professional Solutions (GTPS) team.

User Environment Manager 8.7 working with Horizon 6.2

By Dale Carter

With the release of VMware User Environment Manager 8.7 VMware added a number of new feature, all of which you will find in the VMware User Environment Manager Release Notes.

However, in this blog, I would like to call out two new features that help when deploying User Environment Manager alongside VMware Horizon 6.2. VMware’s EUC teams did a great job in my opinion getting these two great features added or enhanced to work with Horizon 6.2 in the latest releases.

Terminal Server Client IP Address or Terminal Server Client Name

The first feature, which has been enhanced to work with Horizon 6.2, is one I think will have a number of benefits. This feature gives support for detecting client IP and client names in Horizon View 6.2 and later. With this feature it is now possible to apply conditions based on the location of your physical device.

An example would be if a user connects to a virtual desktop or RDS host from their physical device in the corporate office, an application could be configured to map a drive to corporate data or configure a printer in the office. However, if the user connects to the same virtual desktop or RDS host from a physical device at home or on an untrusted network, and launches the same application, then the drive or printer may not be mapped to the application.

Another example would be to combine the Terminal Server Client IP Address or Terminal Server Client Name with a triggered task. This way you could connect/disconnect a different printer at login/logoff or disconnect/reconnect depending on where the user is connecting from.

To configure a mapped drive or printer that will be assigned when on a certain network, you would use the Terminal Server Client IP Address or Terminal Server Client Name condition as shown below.

DCarter Drive Mapping

If you choose to limit access via the physical client name, this can be done using a number of different options.

DCarter Terminal Server Client Name 1

On the other hand, if you choose to limit access via the IP address, you can use a range of addresses.

DCarter Terminal Server Client 2

Detect PCoIP and Blast Connections

The second great new feature is the ability to detect if the user is connecting to the virtual desktop or RDS host via a PCoIP or Blast connection.

The Remote Display Protocol setting was already in the User Environment Manager, but as you can see below it now includes the Blast and PCoIP protocols.

DCarter Remote Display Protocol

 

This feature has many uses, one of which could be to limit what icons a user sees when using a specific protocol.

An example would be maybe you only allow users to connect to their virtual desktops or RDS hosts remotely using the blast protocol, but when they are on the corporate network they use PCoIP. You could then limit applications that have access to sensitive data to only show in the start menu or desktop when they are using the PCoIP protocol to connect.

Of course you could also use the Terminal Server Client IP Address or Terminal Server Client Name to limit the user from seeing an application based on their physical IP address or physical name.

The examples in this blog are just a small number of uses for these great new and enhanced features, and I would encourage everyone to download User Environment Manager 8.7 and Horizon 6.2 to see how they can help in your environment.


Dale is a Senior Solutions Architect and member of the CTO Ambassadors. Dale focuses in the End User Compute space, where Dale has become a subject matter expert in a number of the VMware products. Dale has more than 20 years experience working in IT having started his career in Northern England before moving the Spain and finally the USA. Dale currently hold a number of certifications including VCP-DV, VCP-DT, VCAP-DTD and VCAP-DTA.

For updates you can follow Dale on twitter @vDelboy

Improving Internal Data Center Security with NSX-PANW Integration

Dharma RajanBy Dharma Rajan

Today’s data center (DC) typically has one or more firewalls at the perimeter securing it with a strong defense, thus preventing threats to the DC. Today, applications and their associated content can easily bypass a port-based firewall using a variety of techniques. If a threat enters, the attack surface area is large. Typically the low-priority systems are often the target, as activity on those may not be monitored. Today within the DC more and more workloads are being virtualized. Thus the East-West traffic between virtual machines within the DC has increased substantially compared to the North-South traffic.

Many time threats such as data-stealing, malware, web threats, spam, Trojans, worms, viruses, spyware, bots, etc. can spread fast and cause serious damage once they enter. For example, dormant virtual machines can be a risk when they are powered back up because they may not be receiving patches or anti-malware updates, making them vulnerable to security threats. When the attack happens it can move quickly and compromise critical systems which needs to be prevented. It is also possible in many cases that the attack goes unnoticed until there is an event that triggers investigation, by which time valuable data may have been compromised or lost.

Thus it is very critical that the proper internal controls and security measures are applied at the virtual machine level to reduce the surface area of attack within the data center. So how do we do that and evolve the traditional data center to a more secure environment to overcome today’s data center challenges without additional costly hardware.

Traditional Model for Data Center Network Security

In the traditional model, we base the network architecture with a combination of perimeter-level security by way of Layer 2 VLAN segmentation. This model worked, but as we virtualize more and more workloads, and the data center grows, we are hitting the boundaries when it relates to VLANs with VLAN sprawl, and also the increased number of firewall rules that need to be created and managed. Based on RFC 5517 the maximum number of VLANs that can be provisioned is 4,094. All this adds complexity to the traditional network architecture model of the data center. Other key challenges customers run into in production data centers is too many firewall (FW) rules to create, poor documentation, and the fear of deleting FW rules when a virtual machine is deleted. Thus flexibility is lost, and holes remain for attackers to use as entry points.

Once security is compromised at one VLAN level, the spread across the network—be it Engineering VLAN, Finance VLAN, etc.—does not take very long. So the key is not just how to avoid attacks, but also—if one occurs—how to contain the spread of an attack.

DRajan Before and After Attack

Reducing Attack Surface Area

The first thing that might come to one’s mind is, “How do we prevent and isolate the spread of an attack if one occurs?” We start to look at this by keeping an eye on certain characteristics that make up today’s data centers – which are becoming more and more virtualized. With a high degree of virtualization and increased East-West data center traffic, we need certain dynamic ways to identify, isolate, and prevent attacks, and also automated ways to create FW rules and tighten security at the virtual machine level. This is what leads us to VMware NSX—VMware’s network virtualization platform—which provides the virtual infrastructure security, by way of micro-segmenting, today’s data center environments need.

Micro-Segmentation Principle

As a first step let’s take a brief look at the NSX platform and its components:

DRajan NSX Switch

In the data plane of the NSX vSwitch that are vSphere Distributed Switches (vDS) and FW hypervisor extension modules that run at the kernel level and provide Distributed Firewalling (DFW) functionality at line rate speed and performance.

The NSX Edge can provide edge firewalling functionality/perimeter firewall to the Internet-facing side. The NSX controller is the control plane-level component providing high availability. The NSX manager is the management-level component that communicates with vCenter infrastructure.

By doing micro-segmentation and applying the firewall rules at the virtual machine level we control the traffic flow on the egress side by validating the rules at the virtual machine level, avoiding multiple hops and hair pinning as the traffic does not have to make multiple hops to the physical firewall to get validated. Thus, we also get good visibility of traffic to monitor and secure the virtual machine.

Micro-segmentation is based on the startup principal: assume everything is a threat and act accordingly. This is “zero trust” model. It is indirectly saying entities that need access to resources must prove they are legitimate to gain access to the identified resource.

With a zero trust baseline assumption—which can be “deny by default” —we start to relax and apply certain design principles that enable us to build a cohesive yet scalable architecture that can be controlled and managed well. Thus we define three key design principles.

1) Isolation and segmentation – Isolation is the foundation of most network security, whether for compliance, containment or simply keeping development, test and production environments from interacting. Segmentation from a firewalling point of view refers to micro-segmentation on a single Layer 2 segment using DFW rules.

2) Unit-level trust/least privileges What this means is to provide access to a granular entity as needed for that user, be it a virtual machine level or something within the virtual machine.

3) And the final principle is ‘Ubiquity and Centralized Control’. This helps to enable control and monitoring of activity by using the NSX Controller, which provides a centralized controller, the NSX manager, and the cloud management platforms that provide integrated management.

Using the above principle, we can lay out an architecture for any greenfield or brownfield data center environment that will help us micro-segment the network in a manner that is architecturally sound, flexible to adapt, and enables safe application enablement with the ability to integrate advanced services.

DRajan Micro Segmentation

 

Dynamic Traffic Steering

Network security teams are often challenged to coordinate network security services from multiple vendors in relationship to each other. Another powerful benefit of the NSX approach is its ability to build security policies that leverage NSX service insertion, with Dynamic Services chaining and traffic steering to drive service execution in the logical services pipeline. This is based on the result of other services that make it possible to coordinate otherwise completely unrelated network security services from multiple vendors. For example, we can introduce advanced chaining services where―at a specific layer—we can direct specific traffic to, for example, a Palo Alto Networks (PANW) virtual VM-series firewall for scanning, threat identification, taking necessary action quarantine an application if required.

Palo Alto Networks VM-series Firewalls Integration with NSX

The Palo Alto Networks next-generation firewall integrates with VMware NSX at the ESXi server level to provide comprehensive visibility and safe application enablement of all data center traffic including intra-host virtual machine communications. Panorama is the centralized management tool for the VM-series firewalls. Panorama works with the NSX Manager to deploy the license and centrally administer configuration and policies on the VM-series firewall.

The first step of integration is for Panorama to register the VM-series firewall on the NSX manager. This allows the NSX Manager to deploy the VM-series firewall on each ESXi host in the ESXi cluster. The integration with the NetX API makes it possible to automate the process of installing the VM-series firewall directly on the ESXi hypervisor, and allows the hypervisor to forward traffic to the VM-series firewall without using the vSwitch configuration. It therefore requires no change to the virtual network topology.

DRajan Panorama Registration with NSX

To redirect traffic the NSX service composer is used to create security groups and define network introspection rules that specify traffic from guests who are steered to the VM-series firewall. For traffic that needs to be inspected and secured by the VM-series firewall, the NSX service composer policies redirect traffic to the Palo Alto Networks Next-Gen Firewall (NGFW) service. This traffic is then steered to the VM-series firewall and is processed by the VM-series firewall before it goes to the virtual switch.

Traffic that does not need to be inspected by the VM-series firewall, for example, network data backup or traffic to an internal domain controller, does not need to be redirected to the VM-series firewall and can be sent to the virtual switch for onward processing.

The NSX Manager sends real-time updates on the changes in the virtual environment to Panorama. The firewall rules are centrally defined and managed on Panorama and pushed to the VM-series firewalls. The VM-series firewall enforces security policies by matching source or destination IP addresses. The use of Dynamic Address Groups allows the firewall to populate members of the Dynamic Address Groups in real time, and forwards the traffic to the filters on the NSX firewall.

Integrated Solution Benefits

Better security – Micro-segmentation enables reduced surface area of attack. It enables safe application enablement and protection against known and unknown threats to protect virtual and cloud environments. The integration enables easy identification and isolation of compromised applications faster.

Simplified deployment and faster secure service enablement – When a new ESXi host is added to a cluster, a new VM-series firewall is automatically deployed, provisioned and available for immediate policy enforcement without any manual intervention.

Operational flexibility – The automated workflow allows you to keep pace with VM deployments in your data center. The hypervisor mode on the firewall removes the need to reconfigure the ports/vSwitches/network topology; because each ESXi host has an instance of the firewall, traffic does not need to traverse the network for inspection and consistent enforcement of policies.

Selective traffic redirection – Only traffic that needs inspection by VM-series firewall needs redirection.

Dynamic security enforcement – The Dynamic Address Groups maintain awareness of changes in the virtual machines/applications and ensure that security policies stay in tandem with changes in the network.

Accelerated deployments of business-critical applications – Enterprises can provision security services faster and utilize capacity of cloud infrastructures, and this makes it more efficient to deploy, move and scale their applications without worrying about security.

For more information on NSX visit: http://www.vmware.com/products/nsx/

For more information on VMware Professional Services visit: http://www.vmware.com/consulting/


Dharma Rajan is a Solution Architect in the Professional Services Organization specializing in pre-sales for SDDC and driving NSX technology solutions to the field. His experience spans Enterprise and Carrier Networks. He holds an MS degree in Computer Engineering from NCSU and M.Tech degree in CAD from IIT