Home > Blogs > VMware vSphere Blog > Monthly Archives: December 2011

Monthly Archives: December 2011

Using both Storage I/O Control & Network I/O Control for NFS

Many of these blog articles arise from conversations I have with folks both internally at VMware & externally in the community. This post is another such example. What I really like about this job is that it gets me thinking about a lot of stuff that I normally take for granted. The question this time was around using both Storage I/O Control (SIOC) & Network I/O Control (NIOC) for NFS traffic & Virtual Machines residing on NFS datastores, and could they possibly step on each others toes, so to speak.

The answer is no, the technologies are complementary. Let me try to explain how.

First off, let's have a brief overview of what the technologies do.

Intro to Storage I/O Control (SIOC)

SIOC was covered in a previous blog post. Details can be found here – http://blogs.vmware.com/vsphere/2011/09/storage-io-control-enhancements.html. In a nutshell, if SIOC detects that a pre-defined latency threshold for a particular datastore has been exceeded, it will throttle the amount of I/O a VM can queue to that datastore based on a 'shares' mechanism. When the contention is alleviated, SIOC will stop and VMs can then begin to use the datastore without any throttling. This avoids the 'noisy neighbor' problems when one VM can hog all the bandwidth to a shared datastore. The point to note here is that SIOC is working on a per VM basis, and deals with datastore objects.

SIOC was first introduced in vSphere 4.1, but only for block storage devices (FC, iSCSI, FCoE) only. In vSphere 5.0, we introduced SIOC support for NFS datastores.

Intro to Network I/O Control (NIOC)

There is a nice overview of NIOC on the networking blog here – http://blogs.vmware.com/networking/2010/07/got-network-io-control.html. Again, in a nutshell, NetIOC allows you to  define a guaranteed bandwidth for different vSphere network traffic types.

NIOC uses a software approach to partitioning physical network bandwidth among the different types of network traffic flows. For example, you can guarantee a minimum NFS bandwidth/latency when a vMotion operation is initiated on the same network & prevent the vMotion operation from having an impact on the NFS traffic flow. The point to note here is that NIOC is working on a network traffic stream, e.g. NFS, and deals with NIC ports.

SIOC & NIOC Together

Lets take a scenario where there are multiple VMs spread across multiple ESXi hosts, all sharing the same NFS datastore.

i) SIOC Use Case

For quite a while, we have been able to give bandwidth fairness to VMs running on the same host via the SFQ, the start-time fair queueing scheduler. This scheduler ensures share-based allocation of I/O resources between VMs on a per host basis. It is when we have VMs accessing the same datastore from different hosts that we've had to implement a distributed I/O scheduler. This is called PARDA, the Proportional Allocation of Resources for Distributed Storage Access. PARDA carves out the array queue amongst all the Virtual Machines which are sending I/O to the datastore on the array & adjusts the per host per datastore queue size depending on the sum of the per Virtual Machine shares on the host.

If SIOC is enabled on the datastore, and the latency threshold on the datastore is surpassed because of the amount of disk I/O that the VMs are generating on the datastore, the I/O bandwidth allocated to the VMs sharing the datastores will be adjusted according to the share values assigned to the VMs.

ii) NIOC Use Case

But what if something impacts the NFS traffic flow? In this case, VM performance may be impacted not because of an over-committed datastore, but due to there not being enough network bandwidth for the ESXi host to communicate with the NFS server. For instance, as mentioned in the beginning of the post, what if a vMotion operation was initiated (an operation which could consume up to 8Gbps of the network bandwidth), and impacted the other traffic on the same pipe, such as NFS? Yes, I know a best practice from VMware is to dedicate a NIC for vMotion traffic to avoid this exact situation, but this isn't always practical on 10Gb networks. In the case where vMotion, NFS and other traffic types are sharing the same uplink, NIOC allows us to guarantee a minimum bandwidth on a per traffic type. The really cool thing is that when there is no congestion, network traffic can use *all* the available bandwidth of the uplink. And just for clarification, the uplink is actually a dvuplink since NIOC can only be enabled on distributed switches. The feature is not available on stand-alone vSwitches.

Another important point to note which sometimes causes confusion: NFS traffic on the ESX host caused by a VM's disk I/O does not count towards that VM's portgroup bandwidth allocation should NIOC kick on. These are two distinct and separate network traffic streams/types, the former being NFS and the second being VM I/O.

Conclusion

There is no reason in my opinion not to use both SIOC and NIOC together. The technologies are complementary.

References

 

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: Twitter @VMwareStorage

Load Balancing with NFS and Round-Robin DNS

Those of you who have been using NFS with vSphere over the past number of years will be aware that VMware currently only supports NFS v3 over TCP. There is no multipathing with this version of NFS, and although NIC teaming can be used on the virtual switch, this is for failover purposes only.

To do some semblance of load balancing, one could mount NFS datastores via different network interfaces. For instance, NFS datastore1 could be mounted via controller1 on subnet A, and NFS datastore2 could be mounted via controller2 of the same NFS server on subnet B. This would allow you to balance the load, but is a very manual process. Could we automate this in any way?

What about using round-robin DNS where each request to resolve a Fully Qualified Domain Name (FQDN) would result in the DNS server supplying the next IP address in a list of IP addresses associated with that FQDN? Interestingly, I had this query twice last week.

First, some background on how NFS behaves in vSphere. if a user specifies the DNS name for an NFS server, we persist that DNS name in the vCenter DB. Once the datastore is instantiated on ESX, we resolve the DNS name once. So even if the datastore is temporarily unmounted and remounted (say via esxcli) we would use the same IP address. If the ESX host is restarted or if the datastore is removed and re-added later, we would resolve the FQDN again which may come back with a different IP address if the DNS Server was configured to use round-robin.

Also note that DNS resolution is done on a per datastore basis. We don't have a DNS name lookup cache in NFS that is shared between multiple mount points. Therefore different ESX hosts mounting the same NFS datastore may resolve to different IPs using round-robin. Doing mounts of different datastores using an FQDN from the same ESX server will cause each mount to resolve the FDQN and again possibly pickup a different IP using round-robin DNS configuration.

So overall, DNS round-robin should work just fine if you want to do some automated load balancing with NFS.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: Twitter @VMwareStorage

VDS Best Practices – Operational aspects (Part 6 of 6)

Operational best practices

After customers successfully design the virtual network infrastructure, the next challenge for the customer is how to deploy the design and how to keep the network operational. VMware provides various tools, APIs, and procedures to help customers deploy and manage their network infrastructure effectively. Following are some key tools that are available in the vSphere platform

  • CLI
  • vCenter API
  • Virtual Network Monitoring and Troubleshooting
    • NetFlow
    • Port Mirroring

In the following section, we will briefly discuss how vSphere and network administrators can utilize these tools to manage their virtual network. For more details on these tools please refer to the vSphere documentation.

Command Line Interface

vSphere administrators have several ways to access vSphere components through vSphere interface options that include vSphere Client, vSphere Web Client, and vSphere Command- Line interface. The vSphere CLI command set allows you to perform configuration tasks using a vCLI package installed on supported platforms, or using vMA. Please refer to Getting Started with vSphere CLI document for more details on the commands at the following link http://www.vmware.com/support/developer/vcli. The entire networking configuration can be performed through the CLI and thus helps administrators to automate the deployment process.

 

vCenter API

The networking setup in the virtualized data center involves configuration of virtual and physical switches. To automate this configuration process, VMware has provided APIs that allow network switch vendors to get information about the virtual infrastructure. This information regarding the virtual infrastructure helps network switch vendors in automating the configuration of the physical switches. For example, vCenter can trigger an event after the vMotion of a virtual machine is performed. After receiving this event trigger and related information, the network vendors can reconfigure the physical switch port policies such that when the VM moves to another host the VLANs/Access Control Lists (ACLs) configurations are also migrated along with the VM. Multiple networking vendors have provided this automation between physical and virtual infrastructure configuration through the integration with vCenter APIs. Customers should check with their networking vendors and find out if such automation tool exist that will bridge the gap between physical and virtual networking and simplify the operational challenges.

 

Virtual Network Monitoring and Troubleshooting

Monitoring and Troubleshooting network traffic in a virtual environment requires similar tools that are available in the physical switch environment. With the release of vSphere 5, VMware provides network administrators the ability to monitor and troubleshoot the virtual infrastructure through the features such as NetFlow and Port Mirroring.

NetFlow capability on a Distributed Switch along with a NetFlow collector tool helps monitor application flows and measures flow performance over time. It also helps in capacity planning and ensuring that I/O resources are utilized properly by different applications, based on their needs.

The port mirroring capability on a Distributed Switch is a valuable tool that helps network administrators in debugging network issues in a virtual infrastructure. The granular control over monitoring ingress, egress or all traffic of a port helps administrators fine-tune what traffic is sent for analysis.

Conclusion

vSphere Distributed Switch provides customers the right amount of features, capabilities and operational simplicity for deploying the virtual network infrastructure. As customers move on to build private or public cloud, VDS provides the scalability numbers for such deployments. The advanced capabilities such as NIOC and LBT are key for achieving better utilization of I/O resources and for providing better SLAs for the virtualized business critical applications and multi-tenant deployments. The support for standard networking visibility and monitoring features such as Port mirror and NetFlow help administrators manage and troubleshoot virtual infrastructure through familiar tools. VDS also is an extensible platform that allows integration with other networking vendor products through the open vCenter APIs.

This is the final entry in the series of VDS best practices blog. I would love to get your inputs on all the discussed VDS design options. As I mentioned earlier, customers are not limited to use the discussed design options. Depending on the needs and available infrastructure, customers can either tweak these deign options or come up with a new design for their deployments. Thanks for reading through these long posts.

 

Storage DRS and Storage Array Feature Interoperability

We've had a number of queries recently about how Storage DRS works with certain array based features. The purpose of this post is to try to clarify how Storage DRS will behave when some of these features are enable on the array.

The first thing to keep in mind is that Storage DRS is not going to recommend a Storage vMotion unless something is wrong on the datastore; either it is running out of space, or its performance is degrading.

Let's now look at the interoperability:

1. Thin Provisioned LUNs

If the array presents a Thin Provisioned LUN of 2TB which is backed by only 300GB physical, is Storage DRS aware of this when we make migration decisions? In other words, could we fill up a Thin Provisioned datastore if we choose it as a destination for a Storage vMotion operation, and it is already quite full?

Although Storage DRS is not aware that the LUN is Thin Provisioned, it still should not fill it up. The reason why is that in vSphere 5.0, a new set of VAAI features for Thin Provisioning were introduced. One of these features was to surface an alarm in vCenter when a Thin Provisioned datastore became 75% full on the back-end. If a datastore has this alarm surfaced, then Storage DRS will no longer consider it as a destination for Storage vMotion operations. This should prevent a Storage vMotion operation from ever filling up a Thin Provisioned datastore. In this case, if the 2TB Thin Provisioned datastore has 225GB of its 300GB already used, the alarm would be surfaced and Storage DRS would not consider placing any additional VMs on it.

2. Deduplication & Compression

Many storage arrays use deduplication & compression as a space efficiency mechanism.  Storage DRS is not dedupe aware, but this shouldn't be a cause for concern. For instance, if a VM is heavily deduped, and Storage DRS recommends it for migration, Storage DRS does not know that the VM is deduped. Therfore the amount of space reclaimed from the source datastore will not be the full size of the VM. Also, when the VM is moved to the destination datastore, the VM will have to be inflated to full size. Later on, when the dedupe process runs (in many cases, this doesn’t run in real-time), the array might be able to reclaim some space from dedupe, but it will be temporarily inflated to full size first.

But is this really a concern? Let's take the example of a VM that is 40GB in size, but thanks to dedupe is only consuming 15GB of data on disk. Now when SDRS makes a decision to move this VM, it will find a datastore that can take 40GB (the inflated size of the VM). So that's not too much of an issue. What about the fact that SDRS is only going to gain 15GB of free space on the source datastore as opposed to the 40GB that it thought it was going to get? Well, that's not a concern either because if this datastore is still exceeding the space usage threshold after the VM is migrated, SDRS will migrate another VM from the datastore on the next run, and so on until the datastore space usage is below the threshold. so yes, it may take a few more iterations to handle dedupe datastores, but it will still work just fine.

And yes, it would be nice if Storage DRS understood that datastores were deduped/compressed, and this is something we are looking at going forward.

3. Tiered Storage

The issue here is that the Storage I/O Control (SIOC) injector (the utility which profiles the capabilities of the datastores for Storage DRS) might not understand the capabilities of tiered storage, i.e. if the injector hits the SSD tier, it might conclude that this is a very high performance datastore, but if it hits the SATA tier, it might conclude that this is a lower performance datastore. At this point in time, we are recommending that SDRS be used for initial placement of VMs and load balancing of VMs based on space usage only, and that the I/O metrics feature is disabled. We are looking into ways of determining the profile of a LUN built on tiered storage going forward, and allowing I/O metrics to be enabled.

I hope this gives you some appreciation of how Storage DRS can happily co-exist with various storage array features, and how in many ways the technologies are complimentary. While we would agree that some of the behaviour is sub-optimal, and it would be better if Storage DRS was aware of these array based features in its decision process, there is nothing that prevents Storage DRS working with these features. Going forward, we do hope to add even more intelligence to Storage DRS so that it can understand these features, and include them in its decision making algorithms.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: Twitter @VMwareStorage

VDS Best Practices – Blade Server Deployments (Part 5 of 6)

Blade Server in Example Deployment

Blade servers are server platforms that provide higher server consolidation per rack unit along with benefits of lower power and cooling costs. Blade chassis that hosts the blade servers have proprietary architectures and each vendor has its own way of managing resources in the blade chassis. It is difficult to talk about all different blade chassis available in the market and explain their deployments in this document. In this section, we will focus on some generic parameters that customers should consider while deploying VDS in a blade chassis environment.

From the networking point of view all blade chassis provide the following two options:

  • Integrated switches: In this option, the blade chassis allows built-in switches to control traffic flow between blade servers within the chassis and external network.
  • Pass – Through Technology: This is an alternative method of network connectivity that allows the individual blade servers to communicate directly with external network.

In this design, the integrated switch option is described where the blade chassis has a built-in Ethernet switch. This Ethernet switch acts as an Access layer switch as shown in Figure 1.

This section discusses the deployment where the ESXi host is running on Blade server. Two types of Blade server configuration will be described in the following section

  • Blade Server with Two 10 Gigabit Ethernet network adapters
  • Blade Server with hardware assisted multiple logical network adapters

For each of the above two configurations, the different VDS design approaches will be discussed.

 

Blade Server with Two 10 Gigabit Ethernet network adapters

This deployment is quite similar to the Rack Server with two 10 Gigabit Ethernet network adapters deployment where each ESXi host was presented with two 10 Gigabit network adapters. As shown in Figure 1, the ESXi host running on a blade server in the blade chassis is also presented with two 10 Gigabit Ethernet network adapters.

2x10gig_blade_deployment

Figure 1 Blade Server with 2- 10 Gig NICs

In this section two design options are described; one is a traditional static approach and other one is a VMware recommended dynamic configuration with NIOC and LBT features enabled. These two approaches are exactly similar to the deployment described under Rack Server with two 10 Gigabit network adapter section. Only blade chassis specific design decisions will be discussed as part of this section. For all other VDS and switch related configuration, readers are encouraged to refer the Rack sever with two 10 Gigabit network adapter section of this document.

Design Option 1 – Static Configuration

The configuration of this design approach is exactly same as described in the Design option1 section under Rack server with two 10 Gigabit network adapters. Please refer to the Table 1 below for the dvportgroup configuration details. Let’s take a look at the blade server specific parameters that need attention during the design.

Table 1 Static design configuration

Traffic Type

Port Group

Teaming Option

Active Uplink

Standby Uplink

Unused Uplink

Management

PG-A

Explicit Failover

dvuplink1

dvuplink2

None

vMotion

PG-B

Explicit Failover

dvuplink2

dvuplink1

None

FT

PG-C

Explicit Failover

dvuplink2

dvuplink1

None

iSCSI

PG-D

Explicit Failover

dvuplink1

dvuplink2

None

Virtual Machine

PG-E

LBT

dvuplink1/

dvuplink2

None

None

 

The network and hardware reliability considerations should be incorporated during the blade server design as well. In these blade server designs, customers have to focus on the following two areas:

  • High availability of blade switches in the blade chassis
  • Connectivity of the blade server network adapters to internal blade switches.

High availability of blade switches can be achieved by having two Ethernet switching modules in the blade chassis. And the connectivity of two network adapters on the blade server should be such that one network adapter is connected to first Ethernet switch module and other network adapter is hooked to the second switch module in the blade chassis.

Another aspect that needs attention in the blade server deployment is the network bandwidth availability across the mid plane of blade chassis and between blade switches and aggregation layer. If there is oversubscription scenario in the deployment then customers have to think about utilizing traffic shaping and prioritization (802.1p tagging) features available in the vSphere platform. The prioritization feature allows customer to tag the important traffic coming out of the vSphere platform. These high priority tagged packets are then treated according to priority by the external switch infrastructure. During congestion scenarios, the switch will drop lower priority packets first and avoid dropping the important high priority packets.

This static design option provides the flexibility to the customers of choosing different network adapters for different traffic types. However, while doing the traffic allocation on limited two 10 Gigabit network adapters administrators end up scheduling multiple traffic types on a single adapter. As multiple traffic types flow through one adapter, the chances of one traffic dominating others goes up. To avoid the performance impact due to the noisy neighbors (dominating traffic type), customers have to utilize the traffic management tools provided in the vSphere platform. One of the traffic management features is NIOC, and that feature is utilized in the design option 2 described below.

 

Design Option 2 – Dynamic Configuration with NIOC and LBT

This Dynamic configuration approach is exactly same as described in the Design option2 section under Rack server with two 10 Gigabit Ethernet network adapters. Please refer to the Table 2 below  the dvportgroup configuration details and NIOC settings. The physical switch related configuration in the blade chassis deployment is the same as described in the rack server deployment. For the blade center specific recommendation on reliability and traffic management please refer to previous section.

Table 2 Dynamic design configuration

Traffic Type

Port Group

Teaming Option

Active Uplink

Standby Uplink

NIOC Shares

NIOC Limits

Management

PG-A

LBT

dvuplink1, 2

None

5

-

vMotion

PG-B

LBT

dvuplink1, 2

None

20

-

FT

PG-C

LBT

dvuplink1, 2

None

10

-

iSCSI

PG-D

LBT

dvuplink1, 2

None

20

-

Virtual Machine

PG-E

LBT

dvuplink1, 2

None

20

-

 

VMware recommends this design option that utilizes the advanced VDS features and provides customer with a dynamic and flexible design approach. In this design, I/O resources are utilized effectively and Service Level Agreements are met based on the shares allocation.

Blade Server with Hardware assisted Logical network adapters (HP-Flex10 like deployment)

Some of the new blade chassis supports traffic management capabilities that allow customers to carve I/O resources. This is achieved by presenting logical network adapters to the ESXi hosts. Instead of two 10 Gigabit Ethernet network adapters, ESXi host now sees multiple physical network adapters that operate at different configurable speeds. As shown in Figure 2, each ESXi host is presented with eight Ethernet network adapters that are carved out of two 10 Gigabit Ethernet network adapter.

8gig_blade_deployment
Figure 2 Multiple Logical network adapters

This deployment is quite similar to the Rack server with eight 1 Gigabit Ethernet network adapter deployment. However, instead of 1 Gigabit network adapters the capacity of each network adapter is configured at the blade chassis level. In the blade chassis, customers can carve out different capacity network adapters based on the need of each traffic types. For example, if iSCSI traffic needs 2.5 Gigabit of bandwidth, a logical network adapter with that amount of I/O resources can be created on the blade chassis and presented to the blade server.

As for the configuration of the virtual switch VDS and blade chassis switch infrastructure goes, the configuration described in the design option 1 under the Rack server with eight 1 Gigabit network adapters is more relevant for this deployment. The static configuration option described in that design can be applied as is in this blade server environment. Please refer to Table 3 for dvportgroup configuration details and the switch configurations descried in that section for physical switch configuration details.

Table 3 Static Design configuration

Traffic Type

Port Group

Teaming Option

Active Uplink

Standby Uplink

Unused Uplink

Management

PG-A

Explicit Failover

dvuplink1

dvuplink2

3,4,5,6,7,8

vMotion

PG-B

Explicit Failover

dvuplink3

dvuplink4

1,2,5,6,7,8

FT

PG-C

Explicit Failover

dvuplink4

dvuplink3

1,2,5,6,7,8

iSCSI

PG-D

Explicit Failover

dvuplink5

dvuplink6

1,2,3,4,7,8

Virtual Machine

PG-E

LBT

dvuplink7/

dvuplink8

None

1,2,3,4,5,6

 

Now the question is whether NIOC capability adds any value in this specific blade server deployment. NIOC is a traffic management feature that helps in scenarios where multiple traffic types flow through one uplink or network adapter. If in this particular deployment only one traffic type is assigned to a specific Ethernet network adapter then the NIOC feature will not add any value. However, if multiple traffic types are scheduled over one network adapter then customers can make use of NIOC to assign appropriate shares to different traffic types. This NIOC configuration will make sure that the bandwidth resources are allocated to the traffic types and SLA is met.

To illustrate this through an example, Let’s consider a scenario where vMotion and iSCSI traffic is carried over one 3 Gigabit logical uplink. To protect the iSCSI traffic from network intensive vMotion traffic, administrators can configure NIOC and allocate shares to each traffic type. If both traffics are equally important then administrators can configure shares with equal values (10 each). With this configuration, when there is a contention scenario, NIOC will make sure that iSCSI process will get half of the 1Gigabit uplink bandwidth and avoid any impact of vMotion process.

VMware recommends that the Network and Server administrators work closely together while deploying the traffic management features of the VDS and Blade Chassis. A lot of co-ordination is required during the configuration of the traffic management features to achieve the best end-to-end QoS result.

This concludes the different design options for the Rack and Blade server deployments with different network adapter configurations. Would love to get your feedback on these different design options and design guidelines. In the next blog entry I will talk about some operational aspect of VDS. Please stay tuned. 

A question about slot sizes…

By Duncan Epping, Principal Architect, VMware

Yesterday I received a question on twitter:

Hi, to settle an argument in the office, if no reserves are in place, does number of vCPU’s affect slot size in vSphere 4? Thx :)

First of all, what is a slot? The availability guide explains it as follows

A slot is a logical representation of the memory and CPU resources that satisfy the requirements for any powered-on virtual machine in the cluster.

In other words a slot is the worst case CPU and Memory reservation scenario for any given virtual machine in a cluster. This slot is used when Admission Control is enabled and “Host Failures Tolerates” has been selected as the admission control policy. The total amount of available resources in the cluster will be divided by the slot size and that dictates how many VMs can be powered on without violating availability constraints. Meaning that it will guarantee that every powered on virtual machine can be failed over.

As said this slot is dictated by the worst case reservation for CPU and Memory. Prior to vSphere 4.0 we used the number of vCPUs to determine the slotsize for CPU as well. But we do not use vCPUs anymore to determine the slot size for CPU. The slotsize for CPU is determined by the highest reservation or 256MHz (vSphere 4.x and prior) / 32MHz (vSphere 5) if no reservation is set.

However, vCPUs can have an impact on your slot… it can have an impact on your memory slotsize. If no reservation is set anywhere HA will use the highest Memory Overhead in your cluster as the slot size for memory. This is where the amount of vCPUs come in to play, the more vCPUs you add to a virtual machine the higher will your memory overhead be.

I guess the answer to this question is: For CPU the number of vCPUs does not impact your slotsize, but for memory it may.

VMware’s Software FCoE (Fibre Channel over Ethernet) Adapter

Introduction

In vSphere 5.0, VMware introduced a new software FCoE (Fibre Channel over Ethernet) adapter. This means that if you have a NIC which supports partial FCoE offload, this adapter will allow you to access LUNs over FCoE without needing a dedicated HBA or third party FCoE drivers installed on the ESXi host.

The FCoE protocol encapsulates Fibre Channel frames into Ethernet frames. As a result, your host can use 10 Gbit lossless Ethernet to deliver Fibre Channel traffic. The lossless part is important and I'll return to that later.

Configuration Steps

A Software FCoE Adapter is software code that performs some of the FCoE processing. The adapter can be used with a number of NICs that support partial FCoE offload. Unlike the hardware FCoE adapter, the software adapter needs to be activated on an ESXi 5.0, similar to how the Software iSCSI adapter is enabled. Go to Storage Adapters in the vSphere client and click 'Add':

Picture1

To use the Software FCoE Adapter, the NIC used for FCoE must be bound as an uplink to a vSwitch  which contains a VMkernel portgroup (vmk). Since FCoE packets are exchanged in a VLAN, the VLAN ID must be set on the physical switch to which the adapter is connected, not on the adapter itself. The VLAN ID is automatically discovered during the FCoE Initialization Protocol (FIP) VLAN discovery process so there is no need to set the VLAN ID manually.

Picture2

 

Enhancing Standard Ethernet to handle FCoE

For Fiber Channel to work over Ethernet, there are a number of criteria which must be addressed, namely handling losslessness/congestion and bandwidth management.

1. Losslessness & Congestion: Fiber Channel is a lossless protocol, so no frames can be lost. Since classic Ethernet has no flow control, unlike Fibre Channel, FCoE requires enhancements to the Ethernet standard to support a flow control mechanism to prevents frame loss. One of the problems with Ethernet networks is that when a congestion condition arises, packets will be dropped (lost) if there is no adequate flow control mechanism. A flow control method that is similar to the buffer to buffer credit method on Fiber Channel is needed for FCoE.

FCoE uses a flow control PAUSE mechanism, similar to buffer-to-buffer credits in FC, to ask a transmitting device to hold off sending any more frames until the receiving device says its ok to resume. However, the PAUSE mechanism is not intelligent and could possibly pause all traffic, not just FCoE traffic. To overcome this, the quality of service (QoS) priority bit in the VLAN tag of the Ethernet frame is used to differentiate the traffic types on the network. Ethernet can now be thought of as being divided into 8 virtual lanes based on the QoS priority bit in the VLAN tag.

Picture3

Different policies such as losslessness, bandwidth allocation and congestion control can be applied these virtual lanes individually. If congestion arises and there is a need to ‘pause’ the Fiber Channel traffic (i.e. the target is busy processing, and wants the source to hold off sending any more frames), then there must be a way of pausing the FC traffic without impacting other network traffic using the wire. Let’s say that FCoE, VM traffic, VMotion traffic, management traffic & FT traffic are all sharing the same 10Gb pipe. If we have congestion with FCoE, we may want to pause it. However, we don’t want to pause all traffic, just FCoE. With standard ethernet, this is not possible – you have to pause everything. So we need an enhancement to pause one class of traffic while allowing the rest of the traffic to flow. PFC (Priority based Flow Control) is an extension of the current Ethernet pause mechanism, sometimes called Per-Priority PAUSE. using per priority pause frames. This way we can pause traffic with a specific priority and allow all other traffic to flow (e.g. pause FCoE traffic while allowing other network traffic to flow).

2. Bandwidth: there needs to be a mechanism to reduce or increase bandwidth. Again, with a 10Gb pipe, we want to be able to use as much of the pipe as possible when other traffic classes are idle. For instance, if we’ve allocated 1Gb of the 10Gb pipe for vMotion traffic, we want this to be available to other traffic types when there are no vMotion operations going on, and similarly we want to dedicate it to vMotion traffic when there are vMotions. Again, this is not achievable with standard ethernet so we need some way of implementing this. ETS (Enhanced Transmission Selection) provides a means to allocate bandwidth to traffic that has a particular priority. The protocol supports changing the bandwidth dynamically.

DCBX – Data Center Bridging Exchange

Data Center Bridging Exchange (DCBX) is a protocol that allows devices to discover & exchange their capabilities with other attached devices. This protocol ensures a consistent configuration across the network. The three purposes of DCBX are:
 

  1. Discover Capabilities: The ability for devices to discover and identify capabilities of other devices.  

  2. Identify misconfigurations: The ability to discover misconfigurations of features between devices.  Some features can be configured differently on each end of a link whilst other features must be configured identically on both sides. This functionality allows detection of configuration errors. 

  3. Configuration of Peers: A capability allowing DCBX to pass configuration information to a peer.

DCBX relies on Link Layer Discovery Protocol (LLDP) in order to pass this configuration information. LLDP is an industry standard version of Cisco Discovery Protocol (CDP) which allows devices to discover one another and exchange information about basic capabilities. This is why we need to bind a VMkernel port to a vSwitch.  Frames will be forwarded to the userworld dcdb process via the CDP VMkernel module to do DCBX. The CDP VMkernel module does both CDP & LLDP. This is an important point to note – the FCoE traffic does not go through the vSwitch. However we need the vSwitch binding so that the frames will be forwarded to the dcbd in userworld through the CDP VMkernel module for DCBx negotiation.  The vSwitch is NOT for FCoE data traffic.

Once the Software FCoE is enabled, a new adapter is created, and discovery of devices can now take place.

Picture4

 

References:

  • Priority Flow Control (PFC) – IEEE 802.1Qbb – Enable multiple traffic types to share a common Ethernet link without interfering with each other

  • Enhanced Transmission Selection (ETS) – IEEE 802.1Qaz – Enable consistent management of QoS at the network level by providing consistent scheduling

  • Data Center Bridging Exchange Protocol (DCBX) – IEEE 802.1Qaz – Management protocol for enhanced Ethernet capabilities

  • VLAN tag – IEEE 802.1q

  • Priority tag – IEEE 802.1p

 

Troubleshooting

The software FCoE adapter can be troubleshooted using a number of techniques available in ESXi 5.0. First there is the esxcli fcoe command namespace that can be used to get adapter & nic information:

Esxcli-fcoe
The dcbd daemon on the ESXi host will record any errors with discovery or communication. This daemon logs to /var/log/syslog.log. For increased verbosity, dcdb can be run with a '-v' option. There is also a proc node created when the software FCoE adapter is created, and this can be found in /proc/scsi/fcoe/<instance>. This contains interesting information about the FCoE devices discovered, and this proc node should probably be the starting point for an FCoE troubleshooting. Another useful utility is ethtool. When this command is run with the '-S' option against a physical NIC, it will display statistics including fcoe errors & dropped frames. Very useful.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: Twitter @VMwareStorage

VDS Best Practices – Rack Server Deployment with Two 10 Gigabit adapters (Part 4 of 6)

Rack Server with Two 10 Gigabit Ethernet network adapters

The two 10 Gigabit Ethernet network adapters deployment model is becoming very common because of the benefits they provide through I/O consolidation. The key benefits include better utilization of I/O resources, simplified management, and reduced CAPEX and OPEX. While this deployment provides these benefits, there are some challenges when it comes to the traffic management aspects. Specially, in highly consolidated virtualized environments where more traffic types are carried over fewer 10 Gigabit Ethernet network adapters, and it becomes critical to prioritize traffic types that are important and provide the required SLA guarantees. The NIOC feature available on the VDS helps in this traffic management activity. In the following sections you will see how to utilize this feature in the different designs.

As shown in Figure 1, the rack servers with two 10 Gigabit Ethernet network adapters are connected to the two access layer switches to avoid any single point of failure. Similar to the Rack server with eight 1 Gigabit Ethernet network adapters section, the different VDS and Physical switch parameter configurations are taken into account during this design. On the physical switch side, the new 10 Gigabit switches might have support for FCoE that allows convergence for SAN and LAN traffic. This document only covers the standard 10 Gigabit deployments that support IP storage traffic (iSCSI/NFS) and not FCoE.

In this section two design options are described; one is a traditional approach and other one is a VMware recommended approach.

2x10gig_deployment
Figure 1 Rack server with 2 – 10 Gig NICs

Design Option 1 – Static Configuration

The static configuration approach for rack server deployment with 10 Gigabit Ethernet network adapters is similar to the one described in the design option 1 of rack server deployment with eight 1 Gigabit Ethernet adapters. There are few differences in the configuration where the numbers of dvuplinks are changed from eight to two, and dvportgroup parameters are different. Let’s take a look at the configuration details on the VDS front.

dvuplink configuration

To support the maximum two Ethernet network adapters per host, the dvuplink port group is configured with 2 dvuplinks (dvuplink1, dvuplink2). On the hosts the dvuplink1 is associated with vmnic0 and dvuplink2 is associated with vmnic1.

 dvportgroups configuration

As described in the Table 1, there are five different dvportgroups that are configured for the five different traffic types. For example, dvportgroup PG-A is created for the management traffic type. Following are the other key configurations of dvportgroup PG-A:

  • Teaming Option: Explicit Failover order provides a deterministic way of directing traffic to a particular uplink. By selecting dvuplink1 as an Active uplink and dvuplink2 as standby uplink the management traffic will be carried over dvuplink1 unless there is a failure of dvuplink1. It is also recommended to configure the failback option to “No” to avoid the flapping of traffic between two NICs. The failback option determines how a physical adapter is returned to active duty after recovering from a failure. If failback is set to No, a failed adapter is left inactive even after recovery until another currently active adapter fails, requiring its replacement.
  • VMware recommends isolating all traffic types from each other by defining separate VLAN for each dvportgroup.
  • There are various other parameters that are part of the dvportgroup configuration. Customers can choose to configure these parameters based on their environment needs.

Table 1 below provides the configuration details for all the dvportgroups. According to the configuration, dvuplink1 carries Management, iSCSI, and Virtual Machine traffic while dvuplink2 handles the vMotion, FT, and Virtual Machine traffic. As you can see, Virtual machine traffic type makes use of two uplinks, and these uplinks are utilized through the load based teaming (LBT) algorithm.

In this deterministic teaming policy, customers can decide to map different traffic types to the available uplink ports depending on the environment needs. For example, if iSCSI traffic needs higher bandwidth and other traffic types have relatively low bandwidth requirements, then customers can decide to keep only iSCSI traffic on dvuplink1 and move all other traffic to dvuplink2. When deciding on these traffic paths, customers should understand the physical network connectivity and the paths bandwidth capacity.

Physical switch configuration

The external physical switch, where the rack servers’ network adapters are connected to, is configured with trunk configuration with all the appropriate VLANs enabled. As described in the physical network switch parameters sections, following switch configurations are performed based on the VDS setup described in Table 1.

  • Enable STP on the trunk ports facing ESXi hosts along with “port fast” mode and “bpdu” guard.
  • The teaming configuration on VDS is static and thus no link aggregation is configured on the physical switches.
  • Because of the mesh topology deployment as shown in Figure 1, the link state-tracking feature is not required on the physical switches.

 Table 1 Static design configuration

Traffic Type

Port Group

Teaming Option

Active Uplink

Standby Uplink

Unused Uplink

Management

PG-A

Explicit Failover

dvuplink1

dvuplink2

None

vMotion

PG-B

Explicit Failover

dvuplink2

dvuplink1

None

FT

PG-C

Explicit Failover

dvuplink2

dvuplink1

None

iSCSI

PG-D

Explicit Failover

dvuplink1

dvuplink2

None

Virtual Machine

PG-E

LBT

dvuplink1/

dvuplink2

None

None

 

This static design option provides the flexibility in the traffic path configuration but it cannot protect against one traffic type dominating others. For example, there is a possibility that network intensive vMotion process can take away most of the network bandwidth and impact virtual machine traffic. Bi-directional traffic shaping parameters at portgroup and port level can provide some help in managing different traffic rates. However, using this approach for traffic management requires customers to limit the traffic on the respective dvportgroups. Limiting traffic to a certain level through this method puts a hard limit on the traffic types even when the bandwidth is available to utilize. This underutilization of I/O resources because of hard limits is overcome through the NIOC feature, which provides flexible traffic management based on shares parameter. The design option 2 described below is based on the NIOC feature.

 

Design Option 2 – Dynamic Configuration with NIOC and LBT

This dynamic design option is the VMware recommended approach that takes advantage of the NIOC and LBT features of the VDS.

The connectivity to physical network infrastructure remains same as described in the design option 1. However, instead of allocating specific dvuplinks to individual traffic types, the ESXi platform utilizes those dvuplinks dynamically. To illustrate this dynamic design, each virtual infrastructure traffic type’s bandwidth utilization is estimated. In a real deployment, customers should first monitor the virtual infrastructure traffic over a period of time to gauge the bandwidth utilization, and then come up with bandwidth numbers.

Following are some bandwidth numbers estimated per traffic type:

  • Management Traffic (< 1 Gig)
  • vMotion (2 Gig)
  • FT (1 Gig)
  • iSCSI (2 Gig)
  • Virtual Machine (2 Gig)

These bandwidth estimates are different from the one considered with rack server deployment with eight 1 Gig network adapters. Let’s take a look at the VDS parameter configurations for this design. The dvuplink portgroup configuration remains same with two dvuplinks created for the two 10 Gigabit Ethernet network adapters. The dvportgroup configuration is as follows.

dvportgroups configuration

In this design all dvuplinks are active and there are no standby and unused uplinks as shown in Table 2.  All dvuplinks are thus available for use by the teaming algorithm. Following are the key configurations of dvportgroup PG-A:

  • Teaming Option: Load based teaming is selected as the teaming algorithm. With LBT configuration, the Management traffic initially will be scheduled based on the virtual port ID hash. Based on the hash output the management traffic will be sent out over one of the dvuplink. Other traffic types in the virtual infrastructure can also be scheduled on the same dvuplink With LBT configuration. Subsequently, if the utilization of the uplink goes beyond 75% threshold, LBT algorithm will be invoked and some of the traffic will be moved to other underutilized dvuplinks. It is possible that Management traffic will get moved to other dvuplinks when such event occurs.
  • There are no standby dvuplinks in this configuration so the failback setting is not applicable for this design approach. The default setting for this failback option is “Yes”.
  • VMware recommends isolating all traffic types from each other by defining separate VLAN for each dvportgroup.
  • There are several other parameters that are part of the dvportgroup configuration. Customers can choose to configure these parameters based on their environment needs.

As you follow the dvportgroups configuration in Table 2, you can see that each traffic type has all the dvuplinks as active and these uplinks are utilized through the load based teaming (LBT) algorithm. Let’s take a look at the NIOC configuration.

The Network I/O Control (NIOC) configuration in this design not only helps provide the appropriate I/O resources to the different traffic types but also provides SLA guarantees by protecting from one traffic type dominating others.

Based on the bandwidth assumptions made for different traffic types, the shares parameters are configured in NIOC shares column in Table 2. To illustrate how share values translate to bandwidth numbers in this deployment, let’s take an example of 10 Gigabit capacity dvuplink carrying all five traffic types. This is a worst-case scenario where all traffic types are mapped to one dvuplink. This will never happen when customers enable the LBT feature, because the LBT will move traffic type based on the uplink utilization. This example shows how much bandwidth each traffic type will be allowed on one dvuplink during a contention or oversubscription scenario and when LBT is not enabled

  • Management: 5 shares;        (5/75) * 10 Gigabit = 667 Mbps
  • vMotion: 20 shares;               (20/75) * 10 Gigabit = 2.67 Gbps
  • FT: 10 shares;                          (10/75) * 10 Gigabit = 1.33 Gbps
  • iSCSI: 20 shares;                      (20/75) * 10 Gigabit = 2.67 Gbps
  • Virtual Machine: 20 shares; (20/75) * 10 Gigabit = 2.67 Gbps
  • Total shares: 5 + 20 + 10 + 20 + 20 = 75

As you can see, for each traffic type first the percentage of bandwidth is calculated by dividing the share value by the total available share number (75), and then the total bandwidth of the dvuplink (10 Gigabit) is used to calculate the bandwidth share for the traffic type. For example, 20 shares allocated to vMotion traffic translate to 2.67 Gbps of bandwidth to vMotion process on a fully utilized 10 Gigabit network adapter.

In this 10 Gigabit Ethernet deployment, customers can provide bigger pipes to individual traffic types without the use of trunking or multipathing technologies. This was not the case with eight 1 Gigabit Ethernet deployment.

There is no change in physical switch configuration in this design approach. So please refer to the physical switch settings described in design option 1 in previous section.

Table 2 Dynamic design configuration

Traffic Type

Port Group

Teaming Option

Active Uplink

Standby Uplink

NIOC Shares

NIOC Limits

Management

PG-A

LBT

dvuplink1, 2

None

5

-

vMotion

PG-B

LBT

dvuplink1, 2

None

20

-

FT

PG-C

LBT

dvuplink1, 2

None

10

-

iSCSI

PG-D

LBT

dvuplink1, 2

None

20

-

Virtual Machine

PG-E

LBT

dvuplink1, 2

None

20

-

 

This design option utilizes the advanced VDS features and provides customer with a dynamic and flexible design approach. In this design I/O resources are utilized effectively and Service Level Agreements are met based on the shares allocation.

In the next blog entry I will talk about the Blade center deployments.

 

IBM’s VASA Implementation for the XIV Storage Array

IBM recently announced their first implementation of vSphere Storage APIs for Storage Awareness (VASA). At its most basic, VASA enables vCenter to see underlying storage capabilities. To learn more about VASA, please take a moment to read my previous posts on the subject here and here.

What follows is a brief description of IBM's VASA implementation for XIV arrays.

Q1. Which IBM array models will support VASA in this first release?
A1. All models of IBMs XIV will be supported.

Q2. How has IBM done the Vendor Provider Implementation?
A2. IBM’s VASA Vendor Provider is a standalone software “application” with a  Windows (only) installer that can be installed on either a physical or virtual machine.  It communicates with the IBM XIV using XIV CLI commands which are then surfaced to VMware through an Apache Tomcat webserver.  It requires no license and is a free download from IBM's Fix Central software download site.

Q3. Which Storage Capabilities will be surfaced by VASA into vCenter?
A3.  No capabilities are surfaced in Version 1.1.0 (the initial version).  Instead the VASA Provider delivers information about storage topology, capabilities, and state which can be displayed in the standard Storage Views menu.   In addition the VASA provider will report relevant XIV events and alerts such as Thin Provisioning capacity thresholds being exceeded.

Here is a screen-shot taken from the vSphere client showing the storage view, with the information surfaced via VASA. Information includes whether or not the LUN is Thin Provisioned, the Storage Array identifier and the LUN identifier used on the array:

Capabilities

The VASA provider for XIV can be downloaded from IBM Central. You will need an IBM ID to download it.

While IBM have chosen not to implement the full range of capabilities through VASA, it is nice to see them delivering some of the features.

As I have said in previous posts, this is a 1.0 release of the API. As we add more and more functionality to the API to retrieve additional information, I'm positive that we will see more and more of the array characteristics surfaced in the vCenter UI via VASA, giving vSphere admins more and more insight into the underlying storage used for datastores, and make management much easier.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: Twitter @VMwareStorage

Nice vmkfstools feature for Extents

Troubleshooting issues with extents has never been easy. If one extent member went offline, it has been difficult to find which physical LUN corresponds to the extent that went offline. vSphere 5.0 introduces the ability, via vmkfstools, to check which extent of a volume is offline. For example, here is a VMFS-5 volume I created which spans two iSCSI LUNs:

~ # vmkfstools -Ph /vmfs/volumes/iscsi_datastore/
VMFS-5.54 file system spanning 2 partitions.
File system label (if any): iscsi_datastore
Mode: public
Capacity 17.5 GB, 16.9 GB available, file block size 8 MB
UUID: 4d810817-2d191ddd-0b4e-0050561902c9
Partitions spanned (on “lvm”):
        naa.6006048c7bc7febbf4db26ae0c3263cb:1
        naa.6006048c13e056de156e0f6d8d98cee2:1
Is Native Snapshot Capable: NO
~ #

Now if something happened on the array side to cause one of the LUNs to go offline, previous versions of vmkfstools would not be able to identify which LUN/extent was the problem, and if investigating from the array side, you would have to look at all the LUNs making up the volume and try to figure out which one was problematic.  Now, in 5.0, we get notification about which LUN is offline:

~ # vmkfstools -Ph /vmfs/volumes/iscsi_datastore/
VMFS-5.54 file system spanning 2 partitions.
File system label (if any): iscsi_datastore
Mode: public
Capacity 17.5 GB, 7.2 GB available, file block size 8 MB
UUID: 4d810817-2d191ddd-0b4e-0050561902c9
Partitions spanned (on “lvm”):
        naa.6006048c7bc7febbf4db26ae0c3263cb:1
        (device naa.6006048c13e056de156e0f6d8d98cee2:1 might be offline)
        (One or more partitions spanned by this volume may be offline)
Is Native Snapshot Capable: NO
~ #

In this case, we can see the NAA id (SCSI identifier) of the LUN which has the problem and investigate why the LUN is offline from the array side. A nice feature I’m sure you will agree.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: Twitter @VMwareStorage

~ # vmkfstools -Ph /vmfs/volumes/iscsi_lun0/

VMFS-5.54 file system spanning 2 partitions.

File system label (if any): iscsi_lun0

Mode: public

Capacity 17.5 GB, 7.2 GB available, file block size 8 MB

UUID: 4d810817-2d191ddd-0b4e-0050561902c9

Partitions spanned (on “lvm“):

        naa.6006048c7bc7febbf4db26ae0c3263cb:1

        (device naa.6006048c13e056de156e0f6d8d98cee2:1 might be offline)

        (One or more partitions spanned by this volume may be offline)

Is Native Snapshot Capable: NO

~ #