vFabric GemFire is a sophisticated product in a complex problem space: data management in distributed systems. In order to help our users get the most out of GemFire, we are starting a “cookbook” series, which will provide tried and tested recipes that we hope every GemFire user will find useful.
Our first topic is the Visual Statistics Display (VSD). VSD is a visual tool for analyzing GemFire statistics. It reads GemFire statistics from special statistics archive files created by GemFire, and renders their graphs for analysis. It is not a real-time online monitoring tool, such as vFabric Hyperic, so it does not have the real-time monitoring and alerting capabilities that they have. On the other hand, it is the most powerful tool for examining the state of a vFabric GemFire system, as it provides access to all the statistics collected by GemFire. No real-time monitoring tool can do that, as the amount of statistics that GemFire collects is prohibitive for real-time collection in a distributed system.
Having a complete view into the state of a GemFire process is what makes VSD an indispensable forensic tool for performance analysis, and tracking down problems by performing offline analysis of distributed systems using statistics gathered by the cluster. It is also helpful any time we need to verify the runtime state of a distributed system, for example: upon startup or data loading; to make sure that all the nodes are present and see one another, that all the entries are loaded and well balanced across all the nodes; or that JVM heaps have enough headroom; etc.
At the same time, all this “power” comes at a price. The amount of statistics available for viewing in VSD can be overwhelming. In this article, I will point out some of the most important statistics that are useful in verifying the state of a distributed system, including its configuration, resource usage, and throughput for different operations.
With that in mind, let’s take a deeper look at VSD.
Getting Started with VSD
An important prerequisite for VSD is that the collection of GemFire statistics be enabled at runtime. That is accomplished by setting the configuration properties:
statistic-archive-file=myStats.gfs. As the collection of statistics at the default sampling rate of 1s does not affect performance, it should always be turned on–during development, testing, and in production.
There is a special category of statistics called time-based statistics that can be very useful in troubleshooting and assessing performance of some GemFire operations, but they should be used with caution because their collection can affect performance. They can be enabled using the property
Once a distributed system is up and running, every GemFire instance will have its statistics file created. I usually copy all the stat files into one directory so that I can easily load them into VSD.
Important note: Starting with GemFire version 7.0, VSD is included in the product, and is located in tools directory of the product directory tree. In previous releases, VSD is a separate, free download. For step-by-step instructions on setting up and using VSD, check out the documentation.
Analyzing the Data
Once you have VSD running and statistics archives loaded, it will be populated with lots of interesting data, as shown in the screen shot below.
But what does all this data mean? How do I know what statistics to look at? A large number of statistics are intended only for GemFire Tech Support and Engineering, and finding the ones that are also interesting to the rest of us can be overwhelming. Here is a quick guide to some of the most important categories and statistics they contain:
As the name implies, these statistics can help with verifying the runtime configuration of a GemFire system:
- The number of peer nodes (i.e. servers or peer accessors) in the system: DistributionStats:nodes. This value should be the same for every node in the system.
- The number of clients and client connections for each server: CacheServerStats: currentClients, and currentClientConnections
- The number of data entries:
- CachePerfStats:entries. Each region has its own CachePerfStats instance per JVM named RegionStats-<region name>, or RegionStats-partition-<region name> for partitioned regions, and its entries statistic is the number of entries for that region in the JVM.
- DiskRegionStatistics (a per region disk statistic category about the region’s disk use): entriesInVM, and entriesOnlyOnDisk show the number of entries in the JVM (which can also be on disk too), and the number of entries that are only on disk, respectively.
- Partitioned Region Configuration: One of the main parameters of Partitioned Region (PR) configuration is the primary bucket distribution. To make sure that primary buckets for a PR are evenly distributed, check the PartitionedRegionStats.primaryBucketCount statistic for each partition. This statistic shows the number of primary buckets in a partition.
The resources that are vital for normal operation and performance are memory, file descriptors (most importantly sockets, then files), CPU, network, and disk (when disk operations, such as overflow and persistence, are involved). The following stats cover all those:
- Memory: There are several stats categories that show memory usage, for different types and granularity of memory.
- Heap: VMMemoryUsageStats:vmHeapMemoryStats are all about heap usage, as are the memory stats under VMStats:vmStats: freeMemory, totalMemory, maxMemory.
- Non-heap memory: VMMemoryUsageStats:vmNonHeapMemoryStats.
- System-wide memory stats as reported by the OS: The OS statistic category (e.g. LinuxSystemStats on Linux) includes various system level memory statistics, such as freeMemory, which shows the free memory on the host (as opposed to related to the JVM process), physicalMemory (total physical memory on the host), paging related statistics (pagesSwappedIn, pagesSwappedOut, unallocatedSwap).
- Client and gateway queue sizes: while not actual resources, these queues may be responsible for increased memory usage, so it’s good to keep them in mind when investigating memory issues. The client queue stats are in ClientSubscriptionStats category: eventsQueued, and eventsRemoved. The difference between the two is the current queue size. The gateway queue stats are in GatewayStatistics (GatewaySenderStatistics as of GemFire 7.0) category: eventQueueSize is the size of the queue.
- File Descriptors: file descriptor related statistics are captured in the category VMStats: fdsOpen and fdLimit show the number of open file descriptors, and the limit on file descriptors for the host, respectively
- CPU: The CPU usage is captured in OS statistic category, e.g. LinuxSystemStats. The statistic cpuActive shows the percentage of the total available CPU time that has been used in a non-idle state.
- System load: OS statistic category (e.g. LinuxSystemStats) includes the loadAverage1, loadAverage5, loadAverage15 statistics, which show the average system load for 1, 5, and 15 minutes.
- Network: OS stats also include network related stats for received (recv) and transmitted traffic (recvBytes, xmitBytes, recvErrors, xmitErrors). Note that some of these statistics may be incorrect in GemFire versions prior to 6.6.2 due to a bug that is fixed in GemFire 6.6.2.
- Disk: DiskDirStatistics:diskSpace shows the amount of disk space used for GemFire disk storage on a given disk. Above mentioned entriesOnlyOnDisk, and entriesInVM from DiskRegionStatistics are useful for determining the distribution of data between memory and disk, for regions that use disk overflow/persistence.
Throughput for Different Operations
There are several stat categories that capture the throughput for gemfire operations: CachePerfStats (non-PR, and PR specific), and CacheServerStats, which capture throughput statistics with respect to clients. Note that the PR specific instances of CachePerfStats cover only the specific partitioned regions, while the cachePerfStats instance includes aggregate stats for all non-PR regions.
- CachePerfStatscategory includes the following stats (all measured in the number of operations per second):
- gets: the number of successful gets
- puts: the number of times an entry has been added or replaced as a result of a local operation (put, create, or get which results in a load, netsearch, or netload of a value)
- updates: the number of updates originating remotely
- putalls; the number of putAll operations
- destroys: the number of destroys
- Function execution: FunctionService
- Queries: queryExecutions: the number of query executions
- Transactions: txCommits, txFailures, txRollbacks: the number of successful, failed, and rolled back transactions, respectively
- CacheServerStatscategory includes the following throughput stats for client operations on the cache server:
- getRequests, getResponses,
- getAllRequests, getAllResponses,
- putRequests, putResponses,
- putAllReuqests, putAllResponses,
- queryRequests, queryResponses.
- Disk operations: If any disk related statistic categories are present in VSD, that means that there is disk activity (some entries are on disk). Presence of disk operations may explain a drop in throughput, as disk use slows things down
- DiskRegionStatistics (statistics about a region disk use): writes, writeTime, writtenBytes, reads, readTime, readBytes,
- DiskStoreStatistics are statistics about a specific disk store’s use of disk. In addition to write/read as those in DiskRegionStatistics, this category includes queueSize statistic, which shows the current number of entries in the asynchronous queue waiting to be flushed to disk.