Home > Blogs > vCloud Architecture Toolkit (vCAT) Blog > Tag Archives: Service Providers

Tag Archives: Service Providers

Virtualizing perimeter security in the Service Provider world

Perimeter security is one area of the Service Provider world which has not seen the same adoption of virtualization in the way that first servers, then networking and latterly storage have. In this first in a series of posts we’ll look at some of the hidden challenges, as well as the benefits of bringing virtualization to the datacenter perimeter.

VMware 20th Anniversary iconIt should be no surprise that we’re big fans of virtualization here at VMware. Over our twenty-year history the question has changed from “what can we virtualize?” to “is there anything we can’t virtualize?”. During that time there have been big changes in other parts of our industry too. Gone are the custom hardware appliances, replaced by generic x86 based platforms. Take the trusty firewall or load-balancer, once a collection of custom components and code, now powerful x86 based “servers” with generic but still highly performant interface cards running custom operating systems or specialized Linux-based distributions.

In the Service Provider world, a physical appliance is a necessary evil. Necessary because it performs a crucial role, but evil because it physical nature means inventory, space, power and environmental challenges. Should the Service Provider carry stocks of different sized devices, or order from their supply chain against each customer order? How many units is sufficient, and how should they be kept up to date with changing code releases while they sit on a shelf waiting to be deployed?

Since these appliances are now x86 devices with standard interfaces, they can be delivered as virtual appliances which can be deployed in much the same way as any other virtual machines. Like other virtual machines, that means they can be deployed as needed, and typically, the latest version can be downloaded from the vendor’s website whenever its needed.

So, that’s great, problem solved! Deploy virtual perimeter firewalls, proxies or load-balancers whenever you need one. That was easy…

Except it’s not quite that simple. Let’s look at the traditional, physical, perimeter security model.

Over on the left we have the untrusted Internet connected to our perimeter firewall appliance. In this simple illustration, we won’t worry about dual vendor or defense in depth, we’ll treat the single firewall as an assured boundary device. That means, that once the traffic leaves the inside of the firewall, we trust it enough to connect it to the compute platform where our virtualized workloads run. There’s a clear demarcation or boundary here. The Internet is on the outside, and our workloads are on the inside. We can see the appeal of virtualizing devices like the firewall for Service Providers though, so let’s look at the same illustration if we simply virtualize the firewall.

Although nothing much changes, the firewall is still an assured boundary between the untrusted traffic outside on the Internet and the trusted traffic inside. There is one subtle difference though. Now, the untrusted Internet traffic is connected, unfiltered to the virtualization platform.

That little bit of red line is either an acceptable risk, or it’s a big deal depending upon your point of view. I’ve presented this scenario to Service Providers for several years now, and it’s interesting to see how their responses have differed over that time, and in different countries. At first I would present the option as a “journey”, where different customers would become more comfortable with the idea over time. The challenge for the Provider was therefore, how soon they could realize the benefits of virtualizing devices like this, without their customers thinking that the security of their solution was somehow being compromised?

About a year or so into my presenting this scenario, I started on my “journey” explanation, when the Product owner at the Service Provider where I was presenting said, “our customers have reached the end of that journey already!” That Service Provider was already using NSX to virtualize their customer networks and had been explaining the benefits and capabilities of micro-segmentation and the Edge Service Gateway. Up until that point, their policy was to deploy a physical perimeter firewall, just like the one in the first illustration, and use the Edge Gateway as a “services” device, only providing load balancing and dynamic routing to their customer’s WAN. They offered the NSX Distributed Firewall as a second security layer in combination with the physical firewall. Their service offering looked like this.

Or at least had done, until their customers started to ask why they were being asked to pay for a physical firewall when the next device behind it was already a capable firewall. Those customers, happy with the idea of virtualizing anything that ran on x86 hardware, saw the service they were being offered as over-engineered, with three firewalls not the two which the Service Provider described. Is there a way then, to mitigate the risk of customers concerns over virtualizing network and security appliances? To a degree it depends upon the type of hardware platform a Service Provider is running which will make any proposed solution more, or less, costly or complex. It also depends upon, whether the Service Provider feels that they need to demonstrate risk mitigation or whether their customers will accept the new solution without complex re-engineering being necessary.

In our NSX design guides we recommend separate racks/clusters to run Edge Service Gateways, as this constrains the external VLAN backed networks to those “Edge” racks, and simplifies the remaining “compute” racks which only need to be configured with standard management and NSX VXLAN transport networks. If we look at the last solution with separate Edge compute, it looks like this.

While it would be possible to argue, as that Service Provider’s customers did, that there is no need for three firewalls, and simply remove the physical firewall, instead relying on the Edge Service Gateway, what if the security perimeter requirements were more complex? What if the customer required Internet facing Load balancers with features only present in third party products, or if they wanted to implement in-line proxies or other protocol analysis services such as data-loss prevention only possible through third party devices? Well, if we extend the scope of the Edge Cluster and make it a network and security services cluster our solution stack could look like this.

Now, there’s no untrusted traffic reaching our Compute clusters and although the network and security cluster does have an unfiltered Internet connection, all the virtualized workloads in that cluster are appliances specifically designed to operate in that kind of “DMZ” environment. A solution like this is straightforward to implement in a modular rack server / top of rack switched leaf and spine design datacenter. Some consideration may be necessary in a hyper-converged infrastructure (HCI) environment where the balance of compute and storage requirements could be quite different between network and security, and compute workloads but otherwise it shouldn’t require major design changes.

In a datacenter based on chassis and blades, the challenge may be in creating a virtualization environment to run network and security workloads which is sufficiently “separate” to mitigate the perceived risk. Solutions which are limited to individual chassis may only require the provision of separate chassis, whereas those whose network fabric spans multiple chassis may require a different approach, possibly using separate rack-mounted servers to create network and security clusters outside of the compute workload chassis environment.

How much effort is necessary, or required, depends upon several factors. But, in most cases, the benefits to the Service Provider, commercially, operationally and most importantly in customer satisfaction and time-to-value, should provide a compelling argument to look at virtualizing those last few physical appliances without necessarily having to change vendors or compromise the services offered.

Service Providers running vCloud Director have a few different options for the deployment, management and operation of third party network and security appliances in their environments, and we’ll look at these in more detail in a follow-up post.

Dedicated Hosted Cloud with vCloud Director for VMware Cloud Providers

When looking for service providers for hosted infrastructure, some customers require dedicated infrastructure for their workloads. Whether the customer is looking for additional separation for security or more predictable performance of hosted workloads, service providers will need tools that enable them to provide dedicated hardware service for customers while reducing their operational overhead. In some scenarios, providers will implement managed vSphere environments for customers to satisfy this type of request and then manage the individual vSphere environments manually or with custom automation and orchestration tools. However, it is also possible to leverage vCloud Director to provide dedicated hardware per customer while also providing a central management platform for service providers to manage multiple tenants. In this post, we will explore how this can be accomplished with ‘out of the box’ functionality in vCloud Director.

Continue reading

NSX Revenue Planning calculator

The NSX revenue planning calculator is designed to show a service provider how to make additional revenue by up-selling component NSX derived services. Many service providers I speak to are asking VMware the age-old question, ‘How can I make money from your bundles?’ Equally we also hear that the bundles are expensive, my response to this is – are you realizing the value and selling the functionality of the bundles or just internally operationalizing it?

Most end consumers are after vCAN managed services, but also desire ‘cloud like’ self-service from a cloud catalogue; this has been compounded with vendors bringing cloud portals into the private cloud and the realization from consumers that this is now a reality. Hence rolling all services into a robust ‘managed service’ may or may not be ideal for your customers, they may desire a mix of both, and certainly to minimise operational spend, a provider could hand over as much as possible to self-service.

In the upcoming vCloud Director release 8.2 and in the previous release 8.1 VMware has included NSX functionality in the vCD self-service portal, this means for the first time a service provider can provide self-service NSX services (whilst maintaining multi-tenancy & security) to end customers if they are permitted access. This presents the ideal solution of managed services and self-service controls for customers who want them and allows providers to become much more granular about their charging and service definitions.

The calculator focuses on the vCAN 7, 9 & 12 point bundles (Advanced, Advanced with Networking and Advanced with Networking & Management). Of course we would like our providers to use the 12-point bundle, and this is what the calculator attempts to show – the additional margin with each vCAN bundle where NSX exposes capabilities & services.
Continue reading

Migrating VMware vCloud Director vApps across Distributed Virtual Switches

An interesting topic that came to our attention is how to migrate VMware vCloud Director® vApps from one distributed virtual switch to another. Recently, from the experience of one of our field consultants, Aleksander Bukowinski, we received a detailed procedure to overcome the possible service disruptions due to such a move. Aleksander has also authored a whitepaper on this topic that will soon be available for our audience in VMware Partner Central. The paper also covers in detail an additional use case with Cisco Nexus 1000V and provides PowerShell and API call samples.

Depending on connectivity mode, we can have five different types of vApps in vCD: directly connected, routed, connected to routed vApp networks, isolated, and fenced. The migration process would not require shutting down the vApps while the migration happens, but rather could generate brief network outages in case the VMs are connected to a vCloud Director Edge Gateway, or no outage at all if the VMs use isolated networks with no dependency to the Edge. Continue reading

vRealize Automation Configuration with CloudClient for vCloud Air Network

As a number of vCloud Air Network service providers start to enhance their existing hosting offerings, VMware are seeing some demand from service providers to offer a dedicated vRealize Automation implementation to their end-customers to enable them to offer application services, heterogeneous cloud management and provisioning in a self-managed model.

This blog post details an implementation option where the vCloud Air Network service provider can offer “vRealize Automation as a Service” hosted in a vCloud Director vApp, with some additional automated configuration. This allows the service provider to offer vRealize Automation to their customers based out of their existing multi-tenancy IaaS platforms and achieve high levels of efficiency and economies of scale.

“vRealize Automation as a Service”

During a recent Proof of Concept demonstrating such a configuration, an vCloud Director Organizational vDC was configured for tenant consumption.  Within this Org vDC a vApp containing a simple installation of vRealize Automation was deployed that consisted of a vRealize Automation Appliance and one Windows Server for IaaS components and an instance of Microsoft SQL.  With vRealize Automation successfully deployed, the vRealize Automation instance was customized leveraging vRealize CloudClient via Microsoft PowerShell scripts.  Using this method for configuration of the tenant within vRealize Automation reduced the deployment time for vRealize Automation instances while ensuring that the vRealize Automation Tenant configuration was consistent and conformed to the pre-determined naming standards and conventions required by the provider.

vRaaS vCAN Operations
Continue reading

Streamlining VMware vCloud Air Network Customer Onboarding with VMware NSX Edge Services

When migrating private cloud workloads to a public or hosted cloud provider, the methods used to facilitate customer onboarding can provide some of the most critical challenges. The cloud service provider requires a method for onboarding tenants that reduces the need for additional equipment or contracts that often create barriers for customers when moving enterprise workloads onto a hosting or public cloud offering.

Customer Onboarding Scenarios

When a service provider is preparing for customer onboarding, there are a few options that can be considered. Some of the typical onboarding scenarios are:

  • Migration of live workloads
  • Offline data transfer of workloads
  • Stretching on-premises L2 networks
  • Remote site and user access to workloads

One of the most common scenarios is workload migration. For some implementations, this means migrating private cloud workloads to a public cloud or hosted service provider’s infrastructure. One path to migration leverages VMware vSphere® vMotion® to move live VMs from the private cloud to the designated CSP environment. In situations where this is not feasible, service providers can supply options for the offline migration of on-premises workloads where private cloud workloads that are marked for migration are copied to physical media, shipped to the service provider, and then deployed within the public cloud or hosted infrastructure. In some cases, migration can also mean the ability to move workloads between private cloud and CSP infrastructure on demand.

Continue reading

vCenter Server Scalability for Service Providers

Designing and architecting monster vCloud Air Network service provider environments takes VMware technology to its very limits, in terms of both scalability and complexity. vCenter Server, and its supporting services, such as SSO, are at the heart of the vSphere infrastructure, even in cloud service provider environments where a Cloud Management Platform (CMP) is employed to abstract the service presentation away from vCenter Server.

Meeting service provider scalability requirements with vCenter Server requires optimization at every level of the design, in order to implement a robust technical platform that can scale to its very limits, whilst also maintain operational efficiency and support.

This article outlines design considerations around optimization of Microsoft Windows vCenter Server instances and best practice recommendations, in order to maximize operational performance of your vCenter ecosystem, which is particularly pertinent when scaling over 400 host servers. Each item listed below should be addressed in the context of the target environment, and properly evaluated before implementation, as there is no one solution to optimize all vCenter Server instances.

The following is simply a list of recommendations that should, to some extent, improve performance in large service provider environments. This blog targets the Windows variant of vCenter Server 5.x and 6.x with a Microsoft SQL database, which is still the most commonly deployed configuration.

Warning: Some of the procedures and tasks outlined in this article are potentially destructive to data, and therefore should only be undertaken by experienced personnel once all appropriate safeguards, such as backed up data and a tested recovery procedure, are in place.

 

Part 1 – vCenter Server Operational Optimization

vCenter Server Sizing
vCloud Air Network service providers must ensure that the vCenter virtual system(s) are sized accordingly, based on their inventory size. Where vCenter components are separated and distributed across multiple virtual machines, ensure that all systems meet the sizing recommendations set out in the installation and configuration documentation.

vSphere 5.5: https://www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-pubs.html
vSphere 6.0: https://www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-6-pubs.html
vSphere 5.1: http://kb.vmware.com/kb/2021202

Distribute vCenter Services across multiple virtual machines (vSphere 5.5)
In vSphere 5.5, depending on inventory size, multiple virtual machines can be used to accommodate different vCenter roles. VMware recommends separating VMware vCenter, SSO Server, Update Manager and SQL for flexibility during maintenance and to improve scalability of the vCenter management ecosystem. The new architecture of vCenter 6 simplifies the deployment model, but also reduces design and scaling flexibility, with only two component roles to deploy.

Dedicated Management Cluster
For anything other than the smallest of environments, VMware recommends separating all vSphere management components onto a separate out-of-band management cluster. The primary benefits of management component separation, include:

  • Facilitating quicker troubleshooting and problem resolution as management components are strictly contained in a relatively small and manageable cluster.
  • Providing resource isolation between workloads running in the production environment and the actual systems used to manage the infrastructure.
  • Separating the management components from the resources they are managing.

vCenter to Host operational latency
The number of network hops between the vCenter Server and the ESXi host affects operational latency. The ESXi host should reside as few network hops away from the vCenter Server as possible.

vCenter to SQL Server operational latency
The number of network hops between the vCenter Server and the SQL database also affects operational latency. Where possible, vCenter should reside on the same network segment as the supporting database. If appropriate, configure a DRS affinity rule to ensure that the vCenter Server and database server reside on the same ESXi host, reducing latency still further.

Java Max Heap Size 
vCloud Air Network service providers must ensure that the max heap size for Java virtual machine is set correctly based on the inventory size. Confirm heap size on JVM Heap settings on vCenter, Inventory Service, SSO and Web Client are checked. Monitor Web Services to verify. vSphere 5.1 & 5.5: http://kb.vmware.com/kb/2021302

Concurrent Client Connections
Whilst no always easy, attempt to limit the number of clients connected to vCenter Server, as this affects its performance. This is particularly the case for the traditional Windows C# client.

Performance Monitoring
Employ a performance monitoring tool to ensure the health of the vCenter ecosystem and to help troubleshoot problems when they arise. Where appropriate, configure a vROps Custom Dashboard for vCenter/Management components. Also ensure appropriate alerts and notifications on performance monitoring tools exist.

Virtual disk type
All vCenter Server virtual machine VMDK’s should be provisioned in an eagerZeroedThick format. This provides approximately a 10-20 percent performance improvement over the other two disk formats.

vCenter vNIC type
vCloud Air Network service providers should ensure to employ the VMXNET3 paravirtualized network adaptor to maximise network throughput, efficiency and reduce latency.

ODBC Connection
Ensure that the vCenter and VUM ODBC connections are configured with the minimum permissions required for daily operations. Additional permissions are typically required during installation and upgrade activities, but not for day to day operations. Please refer to the Service Account Permissions provided below.

vCenter Logs Clean Up
vCenter Server has no automated way of purging old vCenter Log files. These files can grow and consume a significant amount of disk space on the vCenter Server. Consider a 3/6 monthly scheduled task to delete or move log files older than the period of time defined by business requirements.

For instance, the VBscript below can be used to clean up old log files from vCenter. This script deletes files that are older than a fixed number of days, defined in line 9, from the path set in line 6. This VBscript can be configured to run as a scheduled task using the windows task scheduler.

Dim Fso
Dim Directory
Dim Modified
Dim Files
Set Fso = CreateObject("Scripting.FileSystemObject")
Set Directory = Fso.GetFolder("C:\ProgramData\VMware\VMware VirtualCenter\Logs\")
Set Files = Directory.Files
For Each Modified in Files
If DateDiff("D", Modified.DateLastModified, Now) > 180 Then Modified.Delete
Next

For more information, refer to KB article: KB1021804 Location of vCenter Server log files.
For additional information on modifying logging levels in vCenter please refer to KB1004795 and KB1001584.

Note: Once a log file reaches a maximum size it is rotated and numbered similar to component-nnn.log files and they may be compressed.

Statistics Levels
The statistics collection interval determines the frequency at which statistic queries occur, the length of time statistical data is stored in the database, and the type of statistical data that is collected.

As historical performance statistics can take up to 90% of the vCenter server database size, it is the primary factor in the performance and scalability of the vCenter Server database. Retaining this performance data allow administrators to view the collected historical statistics, through the performance charts in the vSphere Web Client, through the traditional Windows Client or through command-line monitoring utilities, for up to 1 year after the data was first ingested into the database.

You must ensure that statistics collection times are set as conservatively as possible so that the system does not become overloaded. For instance, you could set a new DB Data Retention Period of 60 Days and configure the DB to not retain performance data beyond 60 days. At the same, it is equally important to ensure that the retention of this historical data meets the service provider’s data compliance requirements.

As this statistics data consumes such a large proportion of the database, proper management of these vCenter Server statistics is an important consideration for overall database health. This is achieved by the processing of this data through a series of rollup jobs, which stop the database server from becoming overloaded. This is a key consideration for vCenter Server performance and is addressed in more detail in Part 2 of this article.

Task and Events Retention
Operational teams should ensure that the Task and Events retention levels are set as conservatively as possible, whilst still meeting the service provider’s data retention and compliance requirements. Every time a task or event is executed via vCenter, it is stored in the database. For example, a task is created when an user powers on or off on a virtual machine and an event is generated when something occurs, such as the vCPU usage for a VM changing to red.

vCenter Server has a Database Retention Policy setting that allows you to specify after how long vCenter Server Tasks and Events should be deleted. This correlates to a database rollup job that purges the data from the database after the selected period of time. Whilst compared to statistical data these tables consume a relevantly small amount of database space, it is good practice to consider this option for further database optimization. For Instance, by default, vCenter is configured to store tasks and events data for 180 days. However, it might be possible, based on the service provider’s compliance requirements, to configure vCenter not to retain Event and Task Data in the database beyond 60 days.

vCenter Server Backup Best Practice
In addition to scheduling regular backups of the vCenter Server database, the backups for the vCenter Server should also include the SSL certificates and license key information.

 

Part 2 – SQL DB Server Operational Optimization (for vCenter Server)

SQL Database Server Disk Configuration
The vCenter Server database data file (mdf) generates mostly random I/O, while database transaction logs (ldf) generate mostly sequential I/O. The traffic for these files is almost always simultaneous so it’s preferable to keep these files on two separate storage resources, that don’t share disks or I/O. Therefore, where a large service provider inventory demands it, operational teams should ensure that the vCenter Server database uses separate drives for data and logs which, in turn, are backed by different physical disks.

tempDB Separation
For large service provider inventories, place tempDB on a different drive, backed by different physical disks than the vCenter database files or transaction logs.

Reduce Allocation Contention in SQL Server tempDB database
Consider using multiple data files to increase the I/O throughput to tempDB. Configure 1:1 alignment between TempDB files and vCPUs (up to eight) by spreading tempDB across at least as many equal sized files as there are vCPUs.

For instance, where 4 vCPUs exist on the SQL server, create three additional tempDB data files, and make them all equally sized. They should also be configured to grow in equal amounts. After changing the configuration, a restart of the SQL Server instance is required. For more information please refer to: http://support.microsoft.com/kb/2154845

Database Connection Pool
vCenter server starts, by default, with a database connection pool of 50 threads. This pool is then dynamically sized according to the vCenter Server’s workload. If high load is expected due to a large inventory, then the size of the pool can be increased to 128 threads. This will increase memory consumption and load time of the vCenter Server. To change the pool size, edit the vpxd.cfg file, adding, as below, where ‘128’ is the number of connection threads to be configured.

< vpxd>
< odbc>
< maxConnections>128
< /odbc>
< /vpxd>

Table Statistics
Update statistics of the SQL tables and indexes on a regular basis, for better overall performance of the database. Create an SQL job to carry out this task, or alternatively, it should form part of a vSphere database maintenance plan. http://sqlserverplanet.com/dba/update-statistics

Index Fragmentation (Not Applicable to vCenter 5.1 or newer)
Check for fragmentation of index objects and recreate indexes if needed. This happens with vCenter due to statistic roll ups. Defragment after <30% fragmentation. See this KB1003990.

Note: With the new enhancements and design changes made in the vCenter Server 5.1 database and later, this is no longer applicable or required.

Database Recovery Model
Depending on your vCenter database backup methodology, consider setting the transaction logs to SIMPLE recovery. This model will reduce the disk space needed for the logs as well decrease I/O load.

Choosing the Recovery Model for a Database: http://msdn.microsoft.com/en-us/library/ms175987(SQL.90).aspx
How to view or Change the Recovery Model of a Database in SQL Server Management Studio: http://msdn.microsoft.com/en-us/library/ms189272(SQL.90).aspx

Virtual Disk Type
Where the vCenter Server database server is a virtual machine, ensure that all VMDK’s are provisioned in an eagerZeroedThick format. This option provides approximately 10-20 percent performance improvement over the other two disk formats.

Verify SQL Rollup Jobs
Ensure all the SQL Agent rollup jobs have been created on the SQL server during the vCenter Server Installation. For instance:

  • Past Day stats rollup
  • Past Week stats rollup
  • Past Month stats rollup

For the full set of stored procedures and jobs please refer to the appropriate article below. Where necessary, recreate MSSQL agent rollup jobs. Note that detaching, attaching, importing, and restoring a database to a newer version of MSSQL Server does not automatically recreate these jobs. To recreate these jobs, if missing, please refer to: KB1004382.

KB 2033096 (vSphere 5.1, 5.5 & 6.0): http://kb.vmware.com/kb/2033096
KB 2006097 (vSphere 5.0): http://kb.vmware.com/kb/2006097

Also, ensure that the myDB references the vCenter Server database, and not the master or some other database. If these jobs reference any other database, you must delete and recreate the jobs.

Ensure database jobs are running correctly
Monitor scheduled database jobs to ensure they are running correctly. For more information, refer to KB article: Checking the status of vCenter Server performance rollup jobs: KB2012226

Verify MSSQL Permissions
Ensure that the local SQL and AD permissions required are in place, and align with the principle of least privilege (see below). If necessary, truncate all unrequired performance data from the database (Purging Historical Statistical Performance Data). For more information, refer to KB article: Reducing the size of the vCenter Server database when the rollup scripts take a long time to run KB1007453

Truncate all performance data from vCenter Server
As discussed in Part 1, to truncate all performance data from vCenter Server 5.1 and 5.5:

Warning: This procedure permanently removes all historical performance data. Ensure to take a backup of the database/schema before proceeding.

  1. Stop the VMware VirtualCenter Server service. Note: Ensure that you have a recent backup of the vCenter Server database before continuing.
  2. Log in to the vCenter Server database using SQL Management Studio.
  3. Copy and paste the contents of the SQL_truncate_5.x.sql script (available from the link below) into SQL Management Studio.
  4. Execute the script to delete the data.
  5. Restart the vCenter Server services.

For truncating data in vCenter Server and vCenter Server Appliance 5.1, 5.5, and 6.0, see Selective deletion of tasks, events, and historical performance data in vSphere 5.x and 6.x (2110031)

Shrink Database
After purging historical data from the database, optionally shrink the database. This is an online procedure to reduce the database size and to free up space on the VMDK, however, this activity will not in itself improve performance. For more information, refer to: Shrinking the size of the VMware vCenter Server SQL database KB1036738

For further information on Shrinking a Database, refer to: http://msdn.microsoft.com/en-us/library/ms189080.aspx

Rebuilding indexes to Optimize the performance of SQL Server
Configure regular maintenance job to rebuild indexes. KB2009918

  1. To rebuild the vCenter Server database indexes. Note, for a vCenter Server 5.1 and 5.5 database, download and extract the .sql files from the 2009918_rebuild_51.zip file attached to this procedure.
  2. Backup your vCenter Server database before proceeding. For more information, see Backing up and restoring vCenter Server 4.x and 5.x (1023985).
  3. These steps must be performed against the vCenter database and not the Master.
  4. Connect to the vCenter Server database using Management Studio for SQL Server
  5. Execute the .sql file to create the REBUILD_INDEX stored procedure, available from the above link.
  6. Execute the stored procedure that was created in the previous step: execute REBUILD_INDEX

VPX_HIST_STAT Table Sizes
VMware recommend a fill factor of 70% for the 4 VPX_HIST_STAT tables. If this recommended fill factor is too high for resources on the database server, then it will need to take time splitting pages, which equates to additional I/O.

If you are experiencing high unexplained I/O in the environment, monitor the SQL Server Access Methods object: Page Splits/sec. Page splits are expensive, and cause your table to perform more poorly due to fragmentation. Therefore, the fewer page splits you have the better your system will perform.

By decreasing the fill factor in your indexes, what you are doing is increasing the amount of empty space on each data page. The more empty space there is, the fewer page splits you will experience. On the other hand, having too much unnecessary empty space can also hurt performance because it means that less data is stored per page, which means it takes more disk I/O to read tables, and less data can be stored in the buffer cache.

High Page Splits/sec will result in the database being larger than necessary and having more pages to read during normal operations.

To determining where growth is occurring in the VMware vCenter Server database refer to:  http://kb.vmware.com/kb/1028356

For troubleshooting VPX_HIST_STAT table sizes in VMware vCenter Server 5, refer to: KB2038474

To reduce the size of the vCenter Server database when the rollup scripts take a long time to run, refer to: KB1007453

Monitor Database Growth
Service provider operational teams should monitor vCenter Server database growth over a period of time to ensure the database is functioning as expected. For more information, refer to KB article: Determining where growth is occurring in the vCenter Server database KB1028356

Schedule and verify regular database backups
The vCenter, SSO, VUM and SRM servers are by themselves stateless. The databases are far more critical since they store all the configuration and state information for each of the management components. These databases must be backed-up nightly and the restore process of each database needs to be tested periodically.

Operational teams should ensure that a schedule of regular backups exists of the vCenter database and based on requirements of the business, restore and mount databases from backup periodically onto a non-production system to ensure a clean recovery is possible, should database corruption or data loss occur in the production environment.

Create a Maintenance Plan for vSphere databases
Work with the DBA’s to create a daily and weekly database maintenance plan. For Instance:

  • Check Database Integrity
  • Rebuild Index
  • Update Statistics
  • Back Up Database (Full)
  • Maintenance Cleanup Task

Warning: DO NOT SHRINK DB IN MAINTENANCE PLAN UNLESS THERE IS A SPECIFIC REQUIREMENT TO RECLAIM DISK SPACE: http://msdn.microsoft.com/en-us/library/ms189080.aspx

 

Part 3 – Service Account Permissions (Least Privilege)

vCenter Service Account
Required by the ODBC Connection for access to the database, the vCenter service account must be configured with dbo_owner privileges for normal operational use. However, the vCenter database account being used to make the ODBC connection also requires the db_owner role on the MSDB System database, during installation or upgrade of the vCenter Server. This permission facilitates the installation of SQL Agent jobs for vCenter statistic rollups.

Typically, the DBA should only grant the vCenter service account the db_owner role on the MSDB System database when installing or upgrading vCenter, then revoke that role when these activities are complete.

RSA_DBO (vSphere 5.1 Only)
Only Required for SSO 5.1, the RSA_DBA account is a local SQL account which is used for creating the schema (DDL) and requires dbo_owner permissions.

RSA_USER (vSphere 5.1 Only)
Only Required for SSO 5.1, the RSA_USER reads and writes data (only DML).

VUM Service Account
Despite being a 64bit application, VUM requires a 32bit ODBC connection from “C:\Windows\SysWOW64\odbcad32.exe”. The VUM service account must be provide the dbo_owner permission on the VUM DB. The installation of vCenter Update Manager 5.x and 6.x with a Microsoft SQL back end database also requires the ODBC connection account to temporarily have db_owner permissions on the MSDB System database. This was a new requirement in vSphere 5.0.

As with the vCenter service account, typically the DBA would only grant the VUM service account the db_owner role for the MSDB System database during an install or upgrade to the VUM component of vCenter. This permission should then be revoked when that task has been completed.

Leveraging Virtual SAN for Highly Available Management Clusters

A pivotal element in each Cloud Service Provider service plan is the class of service being offered to the tenants. The amount of moving parts in a data center raises legitimate questions about the reliability of each component and its influence on the overall solution. Cloud infrastructure and services are built on the traditional three pillars: compute, networking and storage, assisted by security and availability technologies and processes.

The Cloud Management Platform (CMP) is the management foundation for VMware vCloud® Air Network™ providers with a critical set of components that deliver a resilient environment for vCloud consumers.

This blog post highlights how a vCloud Air Network provider can leverage VMware Virtual SAN™ as a cost effective, highly available storage solution for cloud services management environments, and how the availability requirements set by the business can be achieved.

Management Cluster

A management cluster is a group of hosts joined together and reserved for powering the components that provide infrastructure management services to the environment, some of which include the following:

  • VMware vCenter Server™ and database, or VMware vCenter Server Appliance™
  • VMware vCloud Director® cells and database
  • VMware vRealize® Orchestrator™
  • VMware NSX® Manager™
  • VMware vRealize Operations Manager™
  • VMware vRealize Automation™
  • Optional infrastructure services to adapt the service provider offering (LDAP, NTP, DNS, DHCP, and so on)

To help guarantee predictable reliability, steady performance, and separation of duties as a best practice, a management cluster should be deployed over an underlying layer of dedicated compute and storage resources without having to compete with business or tenant workloads. This practice also simplifies the approach for data protection, availability, and recoverability of the service components in use on the management cluster.

Blog - Leveraging VSAN for HA management clusters_1

Rationale for a Software-Defined Storage Solution

The use of traditional storage devices in the context of the Cloud Management Platform requires the purchase of dedicated hardware to provide the necessary workload isolation, performance, and high availability.

In the case of a Cloud Service Provider, the cost and management complexity of these assets would most likely be passed on the service costs to the consumer with the risk of tailoring a less competitive solution offering. Virtual SAN can dramatically reduce cost and complexity for this dedicated management environment. Some of the key benefits including the following:

  • Reduced management complexity because of the native integration with VMware vSphere® at the hypervisor level and access to a common management interface
  • Independence from shared or external storage devices, because it abstracts the hosts locally attached storage and presents it as a uniform datastore to the virtual machines
  • Granular virtual machine-centric policies which allow you to tune performance on a per-workload basis.

Availability as a Top Requirement

Availability is defined as “The degree to which a system or component is operational and accessible when required for use” [IEEE 610]. It is commonly calculated as a percentage, and often measured in term of number of 9s.

Availability = Uptime / (Uptime + Downtime)

To calculate the overall availability of a complex system, the availability percentage of each component should be multiplied as a factor.

Overall Availability = Element#1(availability %) * Element#2(availability %) * … * Element#n(availability %)

 

Number of 9s Availability % Downtime/year System/component inaccessible
1 90% 36.5 days Over 5 weeks per year
2 99% 3.65 days Less than 4 days per year
3 99.9% 8.76 hours About 9 hours per year
4 99.99% 52.56 minutes About 1 hour per year
5 99.999% 5.26 minutes About 5 minutes per year
6 99.9999% 31.5 seconds About half minute per year

When defining the level of service for its offering, the Cloud Service Provider will take this data into account and compute the expected availability of the systems provided. In this way, the vCloud consumer is able to correctly plan the positioning of their own workloads depending on their criticality and the business needs.

In a single or multi-tenant scenario, because the management cluster is transparent to the vCloud consumers, the class of service for this set of components is critical for delivering a resilient environment. If any Service Level Agreement is defined between the Cloud Service Provider (CMP) and the vCloud consumers, the level of availability for the CMP should match or be at least comparable to the highest requirement defined across the SLAs to maintain both the management cluster and the resource groups in the same availability zone.

Virtual SAN and High Availability

To support a critical management cluster, the underlying SDS solution must fulfill strict high availability requirements. Some of the key elements of Virtual SAN include the following:

  • Distributed architecture implementing a software-based data redundancy, similar to hardware-based RAID, by mirroring the data, not only across storage devices, but also across server hosts for increased reliability and redundancy
  • Data management based on data containers: logical objects carrying their own data and metadata
  • Intrinsic cost advantage by leveraging commodity hardware (physical servers and locally-attached flash or hard disks) to deliver mission critical availability to the overlying workloads
  • Seamless ability to scale out capacity and performance by adding more nodes to the Virtual SAN cluster, or to scale up by adding new drives to the existing hosts
  • Tiered storage functionality through the combination of storage policies, disk group configurations, and heterogeneous physical storage devices

Virtual SAN allows a storage policy configuration defining the number of failures to tolerate (FTT) which represents the number of copies of the virtual machine components to store across the cluster. This policy can increase or decrease the level of redundancy of the objects and their degree of tolerance to the loss of one or more nodes of the cluster.

Virtual SAN also supports and integrates VMware vSphere® High Availability (HA) features, including the following:

  • In case of a physical system failure, vSphere HA powers up the virtual machines on the remaining hosts
  • VMware vSphere Fault Tolerance (FT) provides continuous availability for virtual machines (applications) up to a limited size of 4 vCPUs and 64 GB RAM
  • VMware vSphere Data Protection™ provides a combination of backup and restore features for both virtual machines and applications

Blog - Leveraging VSAN for HA management clusters_2

Architecture Example

This example provides a conceptual system design for an architecture to implement a CMP in a cloud service provider scenario with basic resiliency and that is supported by Virtual SAN. The key elements of this design include the following:

  • Management cluster located in a single site
  • Two fault domains identified by the rack placement of the servers
  • A Witness to achieve a quorum in case of a failure, deployed on a dedicated virtual appliance (a Witness Appliance is a customized nested ESXi host designed to store objects and metadata from the cluster, pre-configured and available for download from VMware)
  • Full suite of management products, including optional CSP-related services
  • Virtual SAN general rule for failure to tolerate set to the value of 1 (two copies per object)
  • vSphere High Availability feature enabled for the relevant workloads

This example is a starting point that can provide an overall availability close to four 9’s, or 99.99%. Virtual SAN provides greater availability rates by increasing the number of copies per object (FTT) and the number of fault domains.

Some of the availability metrics for computing overall availability are variable and lie outside the scope of this blog post, but they can be summarized as the following:

  • Rack (power supplies, cabling, top of rack network switches, and so on)
  • Host (physical server and hardware components)
  • Hard disks MTBF (both SSD and spindle)
  • Hard disks capacity and performance (influence rebuild time)
  • Selection of the FTT, which influences the required capacity across the management cluster

Blog - Leveraging VSAN for HA management clusters_3

The complete architecture example will be documented and released as part of the VMware vCloud Architecture Toolkitfor Service Providers in Q1 2016.

 

vCloud Director with Virtual SAN Sample Use Case

This brief and high level implementation example will provide a sample use case for the utilization of VMware Virtual SAN in a vCloud Director for Service Providers environment.

Outlined in the illustration below, each Provider Virtual Data Center / Resource Cluster has been configured with a Virtual SAN datastore that meets the specific capability requirements set out by their Service Level Agreement (SLA) for that tier of service.

In this example, the service provider is deploying three tiers of offerings, Gold, Silver and Bronze. The compute consolidation ratio and virtual SAN capability, based on the disk group configuration and storage policy, defines how the offering will perform for a consumer. In addition, not shown in the configuration below, NIOC and QoS are being employed by the service provider to ensure an appropriate balance of network resources are assigned, based on tier of service. This requires the configuration of 3 separate tiered VLANs for Virtual SAN traffic (Gold, Silver and Bronze) with traffic priorities configured accordingly.

The exact disk configuration will vary depending on hardware manufacturer and provider SLAs.

Logical Design Overview

blog

The full VMware technology solution stack is illustrated below.

VSAN with vCD2

The above figure shows how the solution is constructed on VMWare technologies. The core vSphere platform provides the storage capability through Virtual SAN, which in turn is abstracted via vCloud Director. The VSAN Disk Group configuration across the hosts, along with the Storage Policy, that is configured at the vSphere level, define the performance and capacity capabilities of the distributed datastore, which in turn is employed to define the SLAs for this tier of the cloud offering.

As is illustrated above, the vSphere resources are abstracted by vCloud Director into a Provider Virtual Data Center (PvDC). These resources are then further carved up into individual Virtual Data Centers (vDC), assigned to Organisational tenants. The overall result is that the vApps that reside within the Organizational vDCs represent the Virtual SAN storage capability defined by the service provider.

Typically, but outside the scope of this discussion, tiered service offerings are defined by more than just storage capability. vCPU consolidation ratios, levels of guaranteed memory and network resources and backups etc. will all be employed by a service provider to define the SLAs.

As I develop this use case for the service providers I’m working with I will update this article further.