Introduction
Each member of the Tanzu GemFire DistributedSystem produces a variety of statistics including ones in these categories:
- Operating System
- Java Virtual Machine (JVM)
- JVM heap memory
- JVM garbage collection
- Peer to peer requests
- Client to server requests
- Cache performance
If the statistic-sampling-enabled property is set to true, then the statistics are periodically written to an archive file configured by the statistic-archive-file property. The main way to view the file is to use the Visual Statistics Display (vsd) tool. See the documentation here for additional details on producing the statistics file. See the documentation here for additional details on vsd.
Some of these statistics are helpful in troubleshooting most issues; some are more obscure and only apply to narrow situations.
This article describes the statistics that are most useful when troubleshooting issues, and in some cases, relationships between the statistics.
Most Useful Statistics
All of the statistics are grouped into categories. The most useful categories are listed below. The most important statistics in each category are described in the following sections.
- VMStats — statistics for the JVM
- VMMemoryPoolStats — statistics for JVM heap memory
- VMGCStats — statistics for JVM garbage collection
- StatSamplerStats — statistics for the statistics sampler itself
- ResourceManagerStats — statistics for heap monitoring
- PartitionedRegionStats — statistics for partitioned regions
- LinuxSystemStats — statistics for the operating system
- DistributionStats — statistics for peer to peer requests
- CacheServerStats — statistics for client to server requests
- CachePerfStats — statistics for cache performance
VMStats
The VMStats instance groups together all the statistics related to the JVM process including:
fdsOpen/fdLimit— indicate the current and maximum number of file descriptors in the JVM retrieved from theUnixOperatingSystemMXBeanprovided byManagementFactory.getOperatingSystemMXBean(). If the number of open file descriptors reaches the limit, then an exception with ‘Too many open files’ will occur.processCpuTime— indicates the processing time of the JVM CPU retrieved fromUnixOperatingSystemMXBeanfromManagementFactory.getOperatingSystemMXBean(). This statistic shows how much of the total host CPU (seeLinuxSystemStats) is accounted for by the JVM.threads— indicates the number of threads in the JVM retrieved from theThreadMXBeanprovided byManagementFactory.getThreadMXBean()
VMMemoryPoolStats
A VMMemoryPoolStats instance groups together all the statistics related to a java heap memory space. Examples include CMS Old Gen, Par Eden Space, G1 Eden Space and G1 Old Gen. One is created for each of the MemoryPoolMXBeans provided by ManagementFactory.getMemoryPoolMXBeans().
currentUsedMemory— indicates the current heap usage of the JVMcurrentMaxMemory— indicates the maximum heap usage of the JVM
VMGCStats
A VMGCStats instance groups together all the statistics related to a java garbage collector. Examples include ConcurrentMarkSweep, ParNew, G1 Old Generation and G1 Young Generation. One is created for each of the GarbageCollectorMXBeans provided by ManagementFactory.getGarbageCollectorMXBeans().
collections— indicates the number of garbage collectionscollectionTime— indicates the garbage collection time in nanoseconds. Spikes in this statistic may cause members to be disconnected from theDistributedSystemand may require garbage collection tuning or adjustments to the configured heap or region configuration (e.g. add or change heap LRU eviction).
StatSamplerStats
The StatSamplerStats instance groups together all the statistics related to statistic sampling.
delayDuration— indicates the delay between samples taken by the statistics sampler thread . TheHostStatSampler’sstatThreadsamples statistics periodically based on the statistic-sample-rate property. If thestatThreaddoesn’t sample when it should, thedelayDurationwill show a spike. This often indicates a resource issue (e.g. GC or CPU) and helps narrow the time frame for investigation.jvmPauses— indicates the number of JVM pauses. This statistic is incremented when the delay between statistics samples is greater than three seconds. This time is configurable via the gemfire.statSamplerDelayThreshold java system property.
ResourceManagerStats
The ResourceManagerStats instance groups together all the statistics related to the monitoring of heap usage.
heapCriticalEvents— indicates the number of times the heap usage exceeded the critical heap percentage. The critical heap percentage is the percentage at which the member will accept no more Cache operations. It is configured via theResourceManager'scritical-heap-percentageproperty.evictionStartEvents— indicates the number of times the heap usage exceeded the eviction heap percentage. The eviction heap percentage is the percentage at which eviction will begin for regions defined with heap LRU eviction. It is configured via theResourceManager'seviction-heap-percentageproperty.
PartitionedRegionStats
A PartitionedRegionStats instance groups together all the statistics related to a partitioned region.
bucketCount— indicates the number of buckets defined in the memberprimaryBucketCount— indicates the number of primary buckets defined in the memberdataStoreBytesInUse— indicates the number of entry bytes across all the buckets including primaries and secondariesdataStoreEntryCount— indicates the number of entries across all the buckets including primaries and secondaries
LinuxSystemStats
The LinuxSystemStats instance groups together all the statistics related to the linux system performance.
cachedMemory— indicates the amount of memory cached in RAM retrieved from/proc/meminfocpuActive— indicates the active CPU percentage retrieved from/proc/statfreeMemory— indicates the amount of free memory available on the host machine retrieved from/proc/meminfo. This statistic helps determine if the amount of available memory is adequate for the JVM heap plus native threads.loadAverage1,loadAverage5,loadAverage15— indicate the number of running and waiting processes retrieved from/proc/loadavg. These statistics help determine if the load on the system is too high for the number of CPUs.physicalMemory— indicates the amount of physical memory on the host retrieved from/proc/meminforecvBytes— indicates the number of bytes received over the network from other members retrieved from/proc/net/devrecvDrops— indicates the number of received bytes dropped retrieved from/proc/net/dev. Non-zero values for this statistic indicate possible network issues.xmitBytes— indicates the number of bytes transmitted over the network to other members retrieved from/proc/net/devxmitDrops— indicates the number of transmitted bytes dropped retrieved from/proc/net/dev. Non-zero values for this statistic indicate possible network issues.
DistributionStats
The DistributionStats instance groups all the statistics related to peer to peer communication and processing.
nodes— indicates the number of members of theDistributedSystemfunctionExecutionThreads/functionExecutionQueueSize— indicate the number of threads in theExecutorServicecalledfunctionExecutionPoolused to process Function execution requests and the queue for excess requests when all the threads are in use. ThefunctionExecutionThreadsstatistic corresponds to the number ofFunction Execution Processorthreads (default maximum is the maximum of processors*16 and 100). If thefunctionExecutionQueueSizeis consistently greater than zero, then thefunctionExecutionPool’smaximum number of threads can be increased by setting the DistributionManager.MAX_FE_THREADS java system property. See my blog here for additional information on when and how Function execution threads are used.highPriorityThreads/highPriorityQueueSize— indicate the number of threads in theExecutorServicecalledhighPriorityPoolused to process high priority messages (e.g.CreateRegionMessage,RequestImageMessage) and the queue for excess requests when all the threads are in use. ThehighPriorityThreadsstatistic corresponds to the number ofPooled High Priority Message Processorthreads (default maximum is 1000). If thehighPriorityQueueSizeis consistently greater than zero, then thehighPriorityPool’smaximum number of threads can be increased by setting the DistributionManager.MAX_THREADS java system property.partitionedRegionThreads/partitionedRegionQueueSize— indicate the number of threads in theExecutorServicecalledpartitionedRegionPoolused to process partitioned region messages (e.g.PutMessage,DestroyMessage) and the queue for excess requests when all the threads are in use. ThepartitionedRegionThreadsstatistic corresponds to the number ofPartitionedRegion Message Processorthreads (default maximum is the maximum of processors*32 and 200). If thepartitionedRegionQueueSizeis consistently greater than zero, then thepartitionedRegionPool’smaximum number of threads can be increased by setting the DistributionManager.MAX_PR_THREADS java system property.processingThreads/overflowQueueSize— indicate the number of threads in theExecutorServicecalledthreadPoolused to process normal messages (e.g.TXCommitMessage,ManagerStartupMessage) and the queue for excess requests when all the threads are in use. TheprocessingThreadsstatistic corresponds to the number ofPooled Message Processorthreads (default maximum is 1000). If theoverflowQueueSizeis consistently greater than zero, then thethreadPool’smaximum number of threads can be increased by setting the DistributionManager.MAX_THREADS java system property.sendersTO— indicates the number of outgoing thread-owned (TO) connections to other members. This statistic will only be set with theconserve-socketsproperty set to false. In that case, when a thread processing a request in one member needs to send a message to another member, it will create and use a dedicated connection to that member. An example is when aServerConnectionthread processing a client put request needs to replicate the value to a secondary member. This will cause the remote member to create a dedicatedP2P message readerthread to handle this message and any future messages from the local member and thread. This will increment thesendersTOstatistic in the local member and thereceiversTOstatistic in the remote member.receiversTO— indicates the number of incoming thread-owned (TO) connections from remote members. A correspondingsendersTOwill be incremented in the remote member. This statistic corresponds to the number ofP2P message readerthreads and will only be set with theconserve-socketsproperty set to false.senderTimeouts— indicates the number of outgoing thread-owned (TO) connections that have been idle for thesocket-lease-timeproperty (default is 60000 ms) and have been closed. When a thread-owned connection is closed, its corresponding remoteP2P message readerthread will also be closed. The localsendersTOand the remotereceiversTOstatistics will be decremented. In addition, the localsenderTimeoutswill be incremented. The thread-owned connections between members are created on demand and can be costly to create (especially with SSL). Once they are established, they should be maintained as long as the thread that established them exists. Increasingsocket-lease-time(maximum is 600000 ms) or deactivating it by setting it to zero will help ensure that connections are not closed prematurely.replyTimeouts— indicates the number of times a thread in one member waited for at leastack-wait-thresholdseconds (default=15) for a reply from another member. The thread will continue to wait even though the timeout has occurred until either the reply is received or the remote member leaves theDistributedSystem. This statistic corresponds to a 15 second warning message in the log.replyWaitsInProgress— indicates the number of threads in one member waiting for a reply from a remote member. This statistic flatlined above zero indicates a permanently stuck thread.suspectsReceived— indicates the number of suspect messages received from other members whenever a member departs unexpectedly or there are network issues such that a specific member cannot be contactedsuspectsSent— indicates the number of suspect messages sent to other members whenever a member departs unexpectedly or there are network issues such that a specific member cannot be contacted
CacheServerStats
The CacheServerStats instance groups all the statistics related to client to server communication and processing.
currentClients— indicates the number of unique clients that currently have a connection to this server. For long-lived clients, this statistic should be relatively flat.currentClientConnections— indicates the total number of client connections to this server. This statistic indicates the number of client threads performing Cache operations.closeConnectionRequests— indicates the number of close connection requests from clients. For long-lived clients, this statistic is an indicator of how often idle client connections are timed-out and closed. This statistic also has a relationship withsendersTOandreceiversTO. Churn in this statistic also means churn in those statistics. Churn in this case means socket connections from the client to the server and from that server to its members being closed and reopened. Since creating socket connections can be expensive (especially for SSL), this statistic should be as close to zero as possible. If there is a lot of churn in this statistic then the clientPool'sidle-timeoutproperty should be increased or deactivated. The default is five seconds which is often too low.connectionsTimedOut— this statistic indicates the number of connections that the server determines have timed out on the client based on thePool'sread-timeoutproperty. Even though the statistic is incremented, theServerConnectionthread processing the client request continues processing that request. This statistic should be as close to zero as possible. If not, then theread-timeoutproperty should be increased.threadQueueSize— this statistic indicates the number of client requests waiting for aServerConnectionthread to process them. It is only applicable if theCacheServer'smax-threadsproperty is set greater than zero. This property causes anExecutorServicecalledpoolto be created. If thethreadQueueSizeis consistently greater than zero, then themax-threadsproperty should be increased.
CachePerfStats
The CachePerfStats instance groups all the statistics related to Cache usage.
cacheListenerCallsInProgress— indicates the number of CacheListener callbacks in progress. This statistic flatlined above zero indicates a permanently stuck CacheListener.cacheWriterCallsInProgress— indicates the number of CacheWriter callbacks in progress. This statistic flatlined above zero indicates a permanently stuck CacheWriter.loadsInProgress— indicates the number of CacheLoader callbacks in progress. This statistic flatlined above zero indicates a permanently stuck CacheLoader.
Conclusion
This article has shown some of the more useful statistics used when troubleshooting issues.