DRS is a very powerful vSphere feature that has been around for years. It’s constantly monitoring host performance to ensure that VM demand is satisfied. When DRS determines that another host in a cluster could better suit a VM, a migration recommendation is generated, and the VM will vMotion to another host. DRS looks at several aspects to VM performance like CPU ready time, CPU utilization, memory active and memory swapped to make intelligent placement decisions.
DRS has historically been a black box though. This decision information can be ascertained by analyzing DrmDump files, but requires engineering support to decode and interpret them…until now.
DRS Doctor is a new fling that is aims to fix this, and allows the vSphere Administrator to diagnose DRS behavior without engineering support. This is great when you just want to dig in a little deeper and understand why DRS made a decision to move a virtual machine.
DRS Doctor records information about the state of the cluster, the advanced settings applied, the workload distribution, the virtual machine entitlements, performance demand, the recommended DRS moves, and more. Even better, DRS Doctor writes all this data into a log file that requires no special tools to read.
If you want to give it a try, check it out at the Flings site here. (opens new window)
Here is the output of a DRS Doctor log from my lab. This was running on a CentOS 6.5 VM running inside the cluster. This should give you an idea of the powerful features of DRS and give you comfort that it’s always doing everything to make VMs happy.
This log series starts with the cluster in an idle state where DRS considered the cluster balanced. I then started up a group of virtual machines, each with varying degree of CPU and MEM resource demand to purposely throw the cluster off balance. I waited for DRS to detect this imbalance, and then correct it. As you read through the log files, please look for my comments in #BOLD. For simplicity and readability, I have concatenated three log files to show the progression. As DRS Doctor runs, it will create a new log file every five minutes which corresponds to each DRS iteration.
#The start of log file 1. DRS Doctor will output any affinity/anti-affinity rules here. It also shows any advanced settings that are applied to the cluster. The target balance value is the same as shown in the UI. This is based on the Migration Threshold value that is configured on the cluster.
1 2 3 4 5 6 7 8 9 |
Advanced Options: AggressiveCPUActive : 1 PercentIdleMBInMemDemand : 100 Cluster Rules: No Cluster rules set Cluster Balance: current balance: 38 target balance: 70 overall status: Balanced |
#This is when I started a group of VMs in a very short period of time.
1 2 3 4 5 6 7 8 |
Task update: Id = 'vim.Task:task-322879', Task = VirtualMachine.powerOn, User = '', Entity = 'router (d55efe8e-5a07-40de-b542-725322904f30)', StartTime = None, EndTime = None, State = queued Task update: Id = 'vim.Task:task-322880', Task = VirtualMachine.powerOn, User = '', Entity = 'controlcenter (435fe052-6e50-494d-9118-2d4574ad2951)', StartTime = None, EndTime = None, State = queued Task update: Id = 'vim.Task:task-322881', Task = VirtualMachine.powerOn, User = '', Entity = 'log-01a (ee545ffe-769f-4851-98f9-7ec9d271e81f)', StartTime = None, EndTime = None, State = queued Task update: Id = 'vim.Task:task-322882', Task = VirtualMachine.powerOn, User = '', Entity = 'stg-01a (86fbce17-1b42-4be7-b6cf-8c1f7c530da0)', StartTime = None, EndTime = None, State = queued |
#DRS detects a change in the cluster balance state and reports that it’s become imbalanced.
1 2 3 4 |
Cluster Balance changed: current balance: 74 target balance: 70 overall status: Imbalanced |
#Powering on even more VMs.
1 2 3 4 5 6 7 8 9 10 |
Task update: Id = 'vim.Task:task-322883', Task = VirtualMachine.powerOn, User = '', Entity = 'esx-01a (8f72c0e8-aca6-45e1-898d-1e033c4f2cf3)', StartTime = None, EndTime = None, State = queued Task update: Id = 'vim.Task:task-322884', Task = VirtualMachine.powerOn, User = '', Entity = 'esx-02a (f0a271d0-18dd-41ff-b19d-e6451d5997bc)', StartTime = None, EndTime = None, State = queued Task update: Id = 'vim.Task:task-322885', Task = VirtualMachine.powerOn, User = '', Entity = 'esx-03a (ab91c603-5f18-441c-92e2-c8d79a8b6065)', StartTime = None, EndTime = None, State = queued Task update: Id = 'vim.Task:task-322886', Task = VirtualMachine.powerOn, User = '', Entity = 'esx-04a (000920b6-0e63-4e49-ba6e-cec57b3a8e84)', StartTime = None, EndTime = None, State = queued Task update: Id = 'vim.Task:task-322887', Task = VirtualMachine.powerOn, User = '', Entity = 'esx-05a (e7a1ea81-deec-44c4-90da-57321e38cd1a)', StartTime = None, EndTime = None, State = queued |
#DRS detects even more imbalance. Things are starting to get crazy, but that’s expected with the number of VMs started on such short order.
1 2 3 4 |
Cluster Balance changed: current balance: 179 target balance: 70 overall status: Imbalanced |
#DRS begins to create and apply recommendations. This contains a lot of great information. We can see that the priority rating for the recommendation is 4 and that the reason for migrating was due to CPU.
1 2 3 4 5 6 7 8 9 10 |
Applying recommendation: <strong>Rating = 4</strong>, Migrating vm 'esx-03a (ab91c603-5f18-441c-92e2-c8d79a8b6065)' from 'hqesx003.corp.local' to 'hqesx002.corp.local' at time 2016-06-16 15:50:23.429789+00:00, <strong>Reason = fairnessCpuAvg</strong> Task update: Id = 'vim.Task:task-322892', Task = Drm.ExecuteVMotionLRO, User = 'System', Entity = 'esx-03a (ab91c603-5f18-441c-92e2-c8d79a8b6065)', StartTime = 2016-06-16 15:50:23.517169+00:00, EndTime = None, State = running Task update: Id = 'vim.Task:task-322893', Task = ClusterComputeResource.applyRecommendation, User = 'CORP\matthew', Entity = 'Corp Cluster', StartTime = None, EndTime = None, State = queued Applying recommendation: Rating = 2, Migrating vm 'router (d55efe8e-5a07-40de-b542-725322904f30)' from 'hqesx004.corp.local' to 'hqesx001.corp.local' at time 2016-06-16 15:50:23.429832+00:00, Reason = fairnessCpuAvg Task update: Id = 'vim.Task:task-322894', Task = Drm.ExecuteVMotionLRO, User = 'System', Entity = 'router (d55efe8e-5a07-40de-b542-725322904f30)', StartTime = 2016-06-16 15:50:23.655664+00:00, EndTime = None, State = running |
1 2 3 4 |
Placement update: VM 'hqdrs001' is placed on host 'hqesx003.corp.local' at time 2016-06-16 15:00:07.984000+00:00 Placement update: VM 'hqdrs001' is placed on host 'hqesx003.corp.local' at time 2016-06-16 12:52:05.498999+00:00 |
#This next section is money. This is the source of main inputs that go into DRS placement/migration recommendations; entitlements and demand. Here you can get a dump of every VM on every host in the cluster. It will show the current entitlement, demand, ready time, active memory, entitled memory, and any swapping. It’s glorious!
(Sidebar: To better understand the data, it probably wouldn’t hurt to revisit the definitions of some of these metrics. Seriously, read this. If you don’t read it here, read it here.)
- CPU Entitled: CPU resources devoted by the ESXi scheduler.
- CPU Demand: The amount of CPU resources a virtual machine would use if there were no CPU contention or CPU limit.
- CPU Used: Amount of actively used virtual CPU. This is the host’s view of the CPU usage, not the guest operating system view.
- CPU Ready Time: Percentage of time that the virtual machine was ready, but could not get scheduled to run on the physical CPU.
- MEM Entitled: Amount of host physical memory the virtual machine is entitled to, as determined by the ESX scheduler.
- MEM Active: Amount of guest “physical” memory actively used.
- MEM Shared: Amount of guest “physical” memory shared with other virtual machines (through the VMkernel’s transparent page-sharing mechanism, a RAM de-duplication technique). Includes amount of zero memory area.
- MEM Swapped: Current amount of guest physical memory swapped out to the virtual machine swap file by the VMkernel.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
Dumping host entitlements: Host: hqesx001.corp.local, cpuCapacity(MHz): 18310, cpuEntitled(MHz): 1431, cpuPercentageUsed: 7%, memCapacity(MB): 65501, memEnitled(MB): 37392, memPercentageUsed: 57% Total number of poweredOn vms on the host = 12 VM: stg-01a (86fbce17-1b42-4be7-b6cf-8c1f7c530da0), cpuEntitled(MHz): 95, cpuDemand(MHz): 164, cpuUsed(MHz): 196, cpuReadyTime(%): 0, memEntitled(KB): 1006592, memActive(KB): 786432, memShared(KB): 6352, memSwapped(KB): 0 VM: esx-02a (78d1a2ce-443d-4403-a996-3f5504aba4fb), cpuEntitled(MHz): 263, cpuDemand(MHz): 354, cpuUsed(MHz): 232, cpuReadyTime(%): 1, memEntitled(KB): 3964928, memActive(KB): 1048576, memShared(KB): 1427860, memSwapped(KB): 0 VM: stg-01a (2e2eab53-ddd2-443c-b045-2d541f4ef02b), cpuEntitled(MHz): 143, cpuDemand(MHz): 191, cpuUsed(MHz): 491, cpuReadyTime(%): 0, memEntitled(KB): 1147904, memActive(KB): 188740, memShared(KB): 395480, memSwapped(KB): 0 Host: hqesx002.corp.local, cpuCapacity(MHz): 18310, cpuEntitled(MHz): 3089, cpuPercentageUsed: 16%, memCapacity(MB): 65501, memEnitled(MB): 22031, memPercentageUsed: 33% Total number of poweredOn vms on the host = 7 VM: esx-01a (8f72c0e8-aca6-45e1-898d-1e033c4f2cf3), cpuEntitled(MHz): 1295, cpuDemand(MHz): 1304, cpuUsed(MHz): 2678, cpuReadyTime(%): 1, memEntitled(KB): 3415040, memActive(KB): 3145728, memShared(KB): 226564, memSwapped(KB): 0 VM: vse-vApp Network (1f463d11-1fec-43cc-8225-c37ad87966c8)-0, cpuEntitled(MHz): 47, cpuDemand(MHz): 48, cpuUsed(MHz): 64, cpuReadyTime(%): 0, memEntitled(KB): 484352, memActive(KB): 346028, memShared(KB): 16972, memSwapped(KB): 0 VM: esx-03a (3d08b37e-3e52-497c-9927-b5260805c74f), cpuEntitled(MHz): 239, cpuDemand(MHz): 246, cpuUsed(MHz): 220, cpuReadyTime(%): 1, memEntitled(KB): 3956736, memActive(KB): 796916, memShared(KB): 1127412, memSwapped(KB): 0 Host: hqesx003.corp.local, cpuCapacity(MHz): 18310, cpuEntitled(MHz): 5053, cpuPercentageUsed: 27%, memCapacity(MB): 65501, memEnitled(MB): 46829, memPercentageUsed: 71% Total number of poweredOn vms on the host = 16 VM: esx-05a (e7a1ea81-deec-44c4-90da-57321e38cd1a), cpuEntitled(MHz): 0, cpuDemand(MHz): 0, cpuUsed(MHz): 2877, cpuReadyTime(%): 15, memEntitled(KB): 17183744, memActive(KB): 12582912, memShared(KB): 11324, memSwapped(KB): 0 VM: esx-03a (ab91c603-5f18-441c-92e2-c8d79a8b6065), cpuEntitled(MHz): 2063, cpuDemand(MHz): 1872, cpuUsed(MHz): 1565, cpuReadyTime(%): 20, memEntitled(KB): 3398656, memActive(KB): 3145728, memShared(KB): 11808, memSwapped(KB): 0 VM: log-01a (ee545ffe-769f-4851-98f9-7ec9d271e81f), cpuEntitled(MHz): 1943, cpuDemand(MHz): 2889, cpuUsed(MHz): 3222, cpuReadyTime(%): 1, memEntitled(KB): 3304448, memActive(KB): 3145728, memShared(KB): 74236, memSwapped(KB): 0 Host: hqesx004.corp.local, cpuCapacity(MHz): 18310, cpuEntitled(MHz): 2391, cpuPercentageUsed: 13%, memCapacity(MB): 65501, memEnitled(MB): 43765, memPercentageUsed: 66% Total number of poweredOn vms on the host = 10 VM: esx-06a (1421f70a-5799-43db-95e7-8a7abb1d3915), cpuEntitled(MHz): 0, cpuDemand(MHz): 0, cpuUsed(MHz): 2402, cpuReadyTime(%): 0, memEntitled(KB): 17183744, memActive(KB): 12582912, memShared(KB): 14244, memSwapped(KB): 0 VM: hqplx001, cpuEntitled(MHz): 71, cpuDemand(MHz): 77, cpuUsed(MHz): 82, cpuReadyTime(%): 0, memEntitled(KB): 2425856, memActive(KB): 83884, memShared(KB): 2242540, memSwapped(KB): 0 VM: hqvcd001, cpuEntitled(MHz): 95, cpuDemand(MHz): 114, cpuUsed(MHz): 125, cpuReadyTime(%): 0, memEntitled(KB): 2178048, memActive(KB): 838860, memShared(KB): 66648, memSwapped(KB): 0 VM: esx-05a (4b5d3d73-5a93-4bd7-8c68-bc3ea9a41323), cpuEntitled(MHz): 119, cpuDemand(MHz): 132, cpuUsed(MHz): 98, cpuReadyTime(%): 2, memEntitled(KB): 2205696, memActive(KB): 0, memShared(KB): 1790500, memSwapped(KB): 0 |
#This section provides a nice summary for what DRS did, and more importantly WHY it happened. No more mystery.
1 2 3 4 5 6 7 8 9 |
AUDIT: VMs migrated: 5, VMs placed: 2 Drs recommeneded migrations: 5 VM 'hqdrs001' is recommended to migrate from 'hqesx003.corp.local' to 'hqesx002.corp.local' at time 2016-06-16 15:50:23.429854+00:00, Reason: fairnessMemAvg VM 'esx-04a (58da2d1b-ec6a-4c72-bbd7-a385f8456887)' is recommended to migrate from 'hqesx004.corp.local' to 'hqesx002.corp.local' at time 2016-06-16 15:50:23.429848+00:00, Reason: fairnessMemAvg VM 'esx-01a (2ad6ca89-bb80-4e38-a753-55e75ec08cc4)' is recommended to migrate from 'hqesx004.corp.local' to 'hqesx002.corp.local' at time 2016-06-16 15:50:23.429841+00:00, Reason: fairnessMemAvg VM 'router (d55efe8e-5a07-40de-b542-725322904f30)' is recommended to migrate from 'hqesx004.corp.local' to 'hqesx001.corp.local' at time 2016-06-16 15:50:23.429832+00:00, Reason: fairnessCpuAvg VM 'esx-03a (ab91c603-5f18-441c-92e2-c8d79a8b6065)' is recommended to migrate from 'hqesx003.corp.local' to 'hqesx002.corp.local' at time 2016-06-16 15:50:23.429789+00:00, Reason: fairnessCpuAvg User initiated migrations: 0 |
#Here is the start of the second log file (the next DRS iteration). As you can see the cluster is still imbalanced. At this point all the newly powered on VMs are starting to calm down. DRS will continue to check the state and make recommendations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
This log contains data for 5 minutes starting from VC TimeStamp: 2016-06-16 15:50:38.044131+00:00 Advanced Options: AggressiveCPUActive : 1 PercentIdleMBInMemDemand : 100 Cluster Rules: No Cluster rules set Cluster Balance: current balance: 107 target balance: 70 overall status: Imbalanced Dumping host entitlements: Host: hqesx001.corp.local, cpuCapacity(MHz): 18310, cpuEntitled(MHz): 1526, cpuPercentageUsed: 8%, memCapacity(MB): 65501, memEnitled(MB): 32794, memPercentageUsed: 50% Total number of poweredOn vms on the host = 13 VM: router (d55efe8e-5a07-40de-b542-725322904f30), cpuEntitled(MHz): 23, cpuDemand(MHz): 33, cpuUsed(MHz): 44, cpuReadyTime(%): 0, memEntitled(KB): 558080, memActive(KB): 309328, memShared(KB): 42412, memSwapped(KB): 0 Host: hqesx002.corp.local, cpuCapacity(MHz): 18310, cpuEntitled(MHz): 3638, cpuPercentageUsed: 19%, memCapacity(MB): 65501, memEnitled(MB): 32613, memPercentageUsed: 49% Total number of poweredOn vms on the host = 11 VM: esx-01a (2ad6ca89-bb80-4e38-a753-55e75ec08cc4), cpuEntitled(MHz): 335, cpuDemand(MHz): 356, cpuUsed(MHz): 298, cpuReadyTime(%): 1, memEntitled(KB): 3958784, memActive(KB): 713028, memShared(KB): 258344, memSwapped(KB): 0 Host: hqesx003.corp.local, cpuCapacity(MHz): 18310, cpuEntitled(MHz): 1525, cpuPercentageUsed: 8%, memCapacity(MB): 65501, memEnitled(MB): 37770, memPercentageUsed: 57% Total number of poweredOn vms on the host = 14 VM: esx-05a (e7a1ea81-deec-44c4-90da-57321e38cd1a), cpuEntitled(MHz): 119, cpuDemand(MHz): 135, cpuUsed(MHz): 103, cpuReadyTime(%): 1, memEntitled(KB): 12445696, memActive(KB): 5704252, memShared(KB): 295724, memSwapped(KB): 0 Host: hqesx004.corp.local, cpuCapacity(MHz): 18310, cpuEntitled(MHz): 1000, cpuPercentageUsed: 5%, memCapacity(MB): 65501, memEnitled(MB): 33395, memPercentageUsed: 50% Total number of poweredOn vms on the host = 8 VM: esx-06a (1421f70a-5799-43db-95e7-8a7abb1d3915), cpuEntitled(MHz): 119, cpuDemand(MHz): 135, cpuUsed(MHz): 169, cpuReadyTime(%): 1, memEntitled(KB): 12294144, memActive(KB): 7549744, memShared(KB): 1003156, memSwapped(KB): 0 AUDIT: VMs migrated: 0, VMs placed: 8 Drs recommeneded migrations: 0 User initiated migrations: 0 |
#Start of the final log file.
1 2 3 4 5 6 |
This log contains data for 5 minutes starting from VC TimeStamp: 2016-06-16 15:55:40.403076+00:00 Advanced Options: AggressiveCPUActive : 1 PercentIdleMBInMemDemand : 100 Cluster Rules: No Cluster rules set |
#Balance is getting better. Almost there.
1 2 3 4 |
Cluster Balance: current balance: 90 target balance: 70 overall status: Imbalanced |
#BOOM! After things started settling down, everything came back into balance.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
Cluster Balance changed: current balance: 65 target balance: 70 overall status: <strong>Balanced</strong> Dumping host entitlements: Host: hqesx001.corp.local, cpuCapacity(MHz): 18310, cpuEntitled(MHz): 1575, cpuPercentageUsed: 8%, memCapacity(MB): 65501, memEnitled(MB): 26239, memPercentageUsed: 40% Total number of poweredOn vms on the host = 13 VM: router (d55efe8e-5a07-40de-b542-725322904f30), cpuEntitled(MHz): 23, cpuDemand(MHz): 35, cpuUsed(MHz): 26, cpuReadyTime(%): 0, memEntitled(KB): 466944, memActive(KB): 110100, memShared(KB): 48860, memSwapped(KB): 0 Host: hqesx002.corp.local, cpuCapacity(MHz): 18310, cpuEntitled(MHz): 3182, cpuPercentageUsed: 17%, memCapacity(MB): 65501, memEnitled(MB): 32490, memPercentageUsed: 49% Total number of poweredOn vms on the host = 11 VM: esx-01a (2ad6ca89-bb80-4e38-a753-55e75ec08cc4), cpuEntitled(MHz): 335, cpuDemand(MHz): 308, cpuUsed(MHz): 317, cpuReadyTime(%): 1, memEntitled(KB): 3950592, memActive(KB): 1258288, memShared(KB): 270452, memSwapped(KB): 0 Host: hqesx003.corp.local, cpuCapacity(MHz): 18310, cpuEntitled(MHz): 1429, cpuPercentageUsed: 7%, memCapacity(MB): 65501, memEnitled(MB): 31072, memPercentageUsed: 47% Total number of poweredOn vms on the host = 14 VM: esx-05a (e7a1ea81-deec-44c4-90da-57321e38cd1a), cpuEntitled(MHz): 119, cpuDemand(MHz): 170, cpuUsed(MHz): 119, cpuReadyTime(%): 2, memEntitled(KB): 5684224, memActive(KB): 1509948, memShared(KB): 1400880, memSwapped(KB): 0 Host: hqesx004.corp.local, cpuCapacity(MHz): 18310, cpuEntitled(MHz): 833, cpuPercentageUsed: 4%, memCapacity(MB): 65501, memEnitled(MB): 26541, memPercentageUsed: 40% Total number of poweredOn vms on the host = 8 VM: esx-06a (1421f70a-5799-43db-95e7-8a7abb1d3915), cpuEntitled(MHz): 119, cpuDemand(MHz): 129, cpuUsed(MHz): 184, cpuReadyTime(%): 1, memEntitled(KB): 5595136, memActive(KB): 1509948, memShared(KB): 1724720, memSwapped(KB): 0 AUDIT: VMs migrated: 0, VMs placed: 0 Drs recommeneded migrations: 0 User initiated migrations: 0 |
If you ever wanted to peek under the covers of DRS, I encourage you to download and try this fling. If you find this information valuable or have suggestions and other feedback, please make sure to post your thoughts to the Flings page for DRS Doctor.