At VMworld-2012 San Francisco, the session APP-CAP1426 – The Benefits of Virtualization for Middleware was greatly attended, and we want to thank all of the attendees that helped us score 4.5 out of 5 in our survey results. Because of this, the session is going to be presented at VMworld-2012 Barcelona, and we are posting related information here in this article. Before reading this article, you might want to take a look at the related blog post we released before VMworld-2012 San Francisco.
NOTE: Just like in VMworld-2012 San Francisco, at Barcelona we will raffle copies of my book the Enterprise Java Applications Architecture on VMware
This year at the session, we went deep into tuning large-scale middleware and the discussion around JVM tuning was well received. Hence as a follow-up, I wanted to share some of my more recent research, which will be discussed at VMworld-2012 Barcelona. This focuses on tuning in-memory data management systems such as vFabric SQLFire.
The article below covers:
- An Overview of GC Tuning
- Parallel GC in Young Generation and CMS in Old Generation
- GC Tuning Recipe
- Step A – Young Generation Tuning
- Step B – Old Generation Tuning
- Step C – Survivor Spaces Tuning
- JVM and GC Best Practices for vFabric SQLFire Members
Overview of GC Tuning
There are two main garbage collection (GC) policies: throughput/parallel GC and CMS. Discussion of others is omitted because they do not currently apply to latency sensitive workloads.
The throughput/parallel GC policy is called the throughput GC because it focuses on improved memory throughput as opposed to better response time. It is also synonymously called the parallel GC because it uses multiple worker threads (configured with
–XX:ParallelGCThreads=<nThreads>) to collect garbage. The throughput/parallel GC is essentially a stop-the-world type of collector, and therefore the application is paused when GC activity occurs. To minimize this effect and create a more scalable system, multiple parallel GC threads can be configured to help parallelize the minor GC activity.
Although the throughput GC uses multiple worker threads to collect garbage in the young generation, when those threads run they pause application threads which could be problematic for latency sensitive workloads. The combination of
–Xmn (young generation size) tuning are two key tuning options to consider adjusting up or down. The GC activity might not be as frequent in old generation, but when GC activity takes place in old generation the application experiences a garbage collection time that is significantly longer than that of the younger generation garbage collection time. This is especially true if the parallel/throughput collector is used in old generation. To mitigate this pause problem in old generation, it is possible to use the CMS GC in old generation while young generation is still being collected by the throughput/parallel collector.
CMS is concurrent mark and sweep. When the GC threads run they do not pause the application threads. They run concurrently alongside the application threads. In CMS, there are multiple phases. So, it is also sometimes referred to as the multipass collector. The phases are: initial mark, marking and pre-cleaning, remark, and sweeping. Although the CMS collector is named as the concurrent collector, it is sometimes more accurately referred to as the “mostly concurrent collector” because there are two short pausing phases, first in the initial mark and then later in the remark phase. These pauses are of no significance to the overall cycle of CMS, and, therefore, are mostly ignored from a practical concurrency versus amount of pause perspective.
The CMS phases operate as follows:
- Initial Mark Phase (short pausing) – The beginning of tenured generation collection within the overallphases of CMS. This initial marking phase of CMS is where all the objects directly reachable from roots are marked. This is done with all the mutator threads stopped.
- Concurrent Marking Phase (no pause) – Threads stopped in the first Initial Mark phase are started again and all the objects reachable from the objects marked in first phase are marked here.
- Concurrent Pre-cleaning Phase (no pause) – Looks at the objects in heap which got updated by promotions from young generation or got updated by mutator threads during the concurrent marking in the previous concurrent marking phase. The rescanning of objects concurrently in the pre-cleaning phase helps to reduce the work in the next “remark” pausing phase.’
- Remark Phase (pausing) – This phase rescans any residual updated objects in the heap, retracing them from the object graph.
- Concurrent Sweeping (non-pausing) – Start of sweeping of dead objects and where sweeping is a concurrent phase performed with all other threads running.
Parallel GC in Young Generation and CMS in Old Generation
In Figure 1, JVM configurations that use a combination of parallel GC in young generation and CMS in old generation are illustrated. Figure 1 shows the young generation in the blue box, sized by the
–Xmn, and configured to have
–XX:ParallelGCThreads. The Minor GC threads run as dotted blue arrows, between application threads depicted as green arrows. There are multiple worker threads conducting garbage cleaning in young generation due to the –
XX:ParallelGCThreads configuration. Each time these worker threads run to collect garbage they pause the green arrow application threads. However, multiple worker threads help to alleviate the problem. Naturally, the size of the young generation plays a role in this. As the size of the young generation increases, the duration of the Minor GC increases, but it is not as frequent. The smaller the young generation, the more frequent Minor GC is with shorter duration.
The old generation cannot be directly sized, but instead is implicitly sized by being the difference between
–Xmn. In the old generation, the GC policy is configured with –
XX:+UseConcMarkSweepGC. This GC runs concurrently alongside application threads without pausing them, as depicted by the red arrow that denotes CMS activity.
Figure 1 – Parallel GC in Young Generation and CMS in Old Generation
The remaining JVM options used in Figure 1 are described in Table 1, and the complete configuration can be found a little later in the blog, see BP 19 – Use Parallel and CMS GC Policy Combination.
Table 1 – JVM Configuration Options Used for Parallel Young Generation and CMS in Old Generation
|–Xmn21g||Fixed size young generation.|
|–XX:+UseConcMarkSweepGC||The concurrent collector is used to collect the tenured generation and does most of the collection concurrently with the execution of the application. The application is paused for short periods during the collection. A parallel version of the young generation copying collector is used with the concurrent collector.|
|–XX:+UseParNewGC||Sets whether to use multiple threads in the young generation (with CMS only). By default, this is enabled in Java 6u13, and probably any Java 6, when the machine has multiple processor cores.|
|–XX:CMSInitiatingOccupancyFraction=75||Sets the percentage of the heap that must be full before the JVM starts a concurrent collection in the tenured generation. The default is approximately 92 in Java 6, but that can lead to significant problems. Setting this lower allows CMS to run more often (all the time sometimes), but it often clears more quickly to avoid fragmentation.|
|–XX:+UseCMSInitiatingOccupancyOnly||Indicates that all concurrent CMS cycles should start based on –XX:CMSInitiatingOccupancyFraction=75.|
|–XX:+ScavengeBeforeFullGC||Do young generation GC prior to a full GC.|
|–XX:TargetSurvivorRatio=80||Desired percentage of survivor space used after scavenge.|
|–XX:SurvivorRatio=8||Ratio of eden/survivor space size.|
|–XX:+UseBiasedLocking||Enables a technique for improving the performance of uncontended synchronization. An object is “biased” toward the thread which first acquires its monitor using a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations performed by that thread are relatively much faster on multiprocessor machines. Some applications with significant amounts of uncontended synchronization can attain significant speedups with this flag enabled Some applications with certain patterns of locking can see slowdowns, although attempts have been made to minimize the negative impact.|
|–XX:MaxTenuringThreshold=15||The maximum age that a young object is allowed to live in young generation after each Minor GC, before it is tenured to old generation. The tenure of an object is incremented by 1 when the object survives a Minor GC and is copied to Survivor spaces. The maximum for the HotSpot JVM J6 is 15. A smaller value causes premature promotions to old generation that can lead to more frequent old generation activity that can hurt response times.|
|–XX:ParallelGCThreads=4||Sets the number of garbage collection worker threads in the young generation. The default value varies with the JVM platform.This value should not be higher than 50% of the cores available to the JVM.There is an assumption that a single JVM is running on one virtual machine, and that no other JVM is contending for the cores available to the virtual machine on which the JVM runs.For example, if a vSphere cluster has 16 virtual machines and therefore 16 vFabric SQLFire members. Each virtual machine is configured to have 68GB RAM and 8 vCPUs. One vFabric SQLFire member JVM virtual machine runs on one 8-core socket within the vSphere host. This implies that 8 cores are available to service the 8 vCPUs allocated to the virtual machine, because –XX:ParallelGCThreads=4. Four vCPUs are consumed by the ParallelGCThreads and the remaining four are available to service application threads, concurrent old generation activity, off the heap activity, and any other workloads that might be running on the virtual machine, such as a monitoring agent.One minor caveat here is that in the very short pausing phases of the initial-mark (aside from the other concurrent phases) it is single threaded but finishes rather quickly, and then the re-mark is multithreaded. The initial mark being single threaded does not use any of the
–XX:ParalleGCThreads allocated, but the re-mark phase being multithreaded uses some of the parallel threads allocated. Because re-mark is a very short phase, it uses negligible parallel thread cycles.There is enough variance from workload to workload that these assumptions should be verified for your own application with a load test.
|–XX:+UseCompressedOops||Enables the use of compressed pointers (object references represented as 32-bit offsets instead of 64-bit pointers) for optimized 64-bit performance with Java heap sizes less than 32GB.|
|–XX:+OptimizeStringConcat||Optimize string concatenation operations where possible. (Introduced in Java 6 update 20.)|
|–XX:+UseCompressedStrings||Use with caution. Use a byte for strings that can be represented as pure ASCII. (Introduced in Java 6 update 21 Performance Release.)In certain versions of Java 6 this option may have been deprecated.|
|–XX:+UseStringCache||Enables caching of commonly allocated strings.|
|–XX:+DisableExplicitGC||Disables all calls to System.gc() that might still be embedded erroneously in the application code.|
|–XX:+UseNUMA||Do not use. This JVM option should not be used because it is not compatible with the CMS garbage collector. vFabric SQLFire server JVMs deployed on virtual machines running on the vSphere ESXi 5 hypervisor do not need this flag because VMware provides many NUMA optimizations that have been proven to provide great locality for vFabric SQLFire types of workloads.|
GC tuning presented so far is typically adequate for most workloads, with some caveats in terms of adjusting
–Xmn, ParallelGCThreads, SurvivorRatio, and stack size
–Xss. The following section presents more detail for a potential GC tuning recipe that can be used to help guide your decisions to further tune GC cycles in young and old generations. There are many recipes that can be followed, and the GC recipe discussed in the next section is quite prudent.
GC Tuning Recipe
There are many other recipes that can be applied, and JVM tuning can take many weeks of discussion on theory and practice. This section describes what is applicable to JVM tuning for latency sensitive workloads.
In Figure 2, a three-step tuning recipe is outlined as follows:
Step A: Young Generation Tuning. This involves measuring frequency and duration of Minor GC, then adjusting
–XX:ParallelGCThreads to meet application response time SLAs.
Step B: Old Generation Tuning. This involves measuring the frequency and duration of the CMS GC cycles and adjusting
–XX:CMSInitiatingOccupancyFraction to meet SLA application workload response time requirements.
Step C: Survivor Spaces Tuning. This is a refinement step in tuning the survivor spaces to either delay promotion from young generation by increasing survivor spaces size, or reduce Minor GC duration and speed up the onset of promotion from young generation to old generation by reducing survivor spaces sizes.
Figure 2. GC Tuning Recipe for Parallel GC in Young Generation and CMS in Old Generation
Step A – Young Generation Tuning
In this step (Step A from Figure 2), the frequency (how often GC runs) and duration (how long GC runs for) of the Minor GC are first measured, then compared with the GC pause and response time requirements, to determine whether the GC cycle must be tuned. Understanding the internals of the young generation is critical to fine tuning the Minor GC cycle, and therefore the diagram from Figure 1 appears in Figure 3 with a slight modification to further detail the young generation cycle. The main objective of this tuning is to measure the frequency and duration of Minor GC and determine whether sufficient time is made available for application threads to run in between Minor GC activity. In Figure 3, the application threads are shown as green arrows running between the Minor GC activity.
The young generation is sized by –Xmn as shown in Figure 3, and configured with the
–XX:ParNewGC as indicated previously, along with having multiple worker threads to help with the GC cycles configured with
–XX:ParallelGCThreads=<nThreads>. The young generation also contains two Survivor Spaces (dark blue boxes), indicated as S0 and S1 on the diagram. These spaces are sized as SurvivorSpacesSize
= –Xmn / (–XX:SurvivorRatio + 2 ). The other space of significance, and one of the most important spaces within young generation, is the Eden Space (orange box on Figure 3). Eden Space is implicitly sized as the difference between
–Xmn and the SurvivorSpaceSize*2. A more complete tuning discussion of survivor spaces is given with Step C of this tuning recipe, but in brief, starting with a Survivor ratio of 8, makes SurvivorSpaceSize 10% of
–Xmn. Therefore S0 is 10% of
–Xmn, and S1 is also 10%. The Eden Space size is 80% of
SurvivorRatio is set to 8.
Figure 3. Measuring Minor GC Duration and Frequency in the Young Generation
At this point, we assume the frequency and duration of Minor GC have been measured.. The next section describes the impact of adjusting some of these parameters.
Impact of Adjusting -Xmn
The single most critical JVM option by far is
–Xmn, the young generation size. Consider the impact of reducing or increasing
–Xmn. This section presents a description of the young generation internal cycle, and the impact of adjusting it.
Understanding the young generation GC cycle: All objects are created in the Eden Space. When Minor GC occurs, the Eden Space is completely cleaned and all objects that survive are moved to the first Survivor Space S0. After some time when another Minor GC takes place, Eden Space is cleaned again and more survivors are moved to the Survivor Space (with some copying between Survivor Space S0 and S1). Therefore, it is critical to have ample space for Eden and sufficient space for Survivor Space. As mentioned previously, make the preliminary assumption that a SurvivorRatio of 8 is adequate for SurvivorSpaceSize.
Understanding the impact of reducing or increasing young generation size – Xmn: If you determine that the duration of the Minor GC is too long and pausing longer than the application threads can tolerate (seen by long application response times), it is appropriate to reduce the value of
–Xmn. A long duration or pausing in Minor GC is an indication of young generation being sized too large for your application.
In most examples we assume
–Xmn is about 33% of
–Xmx, which is a good starting point, and it depends on the scale of the heap. In smaller scale JVM heap sizes less than 8GB, 33% might make sense, however as you get to larger sizes (like the example), 33% of 64GB implies
–Xmn is 21GB, which is a significant amount of space.
If response time requirements are met then you do not have to adjust the assumption of
–Xmn= 33% of
–Xmx. However, if the duration of the pause is too long, then you should adjust
–Xmn down and observe the impact on the response time of the application. Typically, as you reduce
–Xmn, you reduce the pause time of Minor GC, and at the same time increase the frequency of Minor GC. This is because a reduction in
-Xmn implies a reduced Eden Space size, which causes Minor GC to run more frequently. This might not be a bad compromise as enough application threads are spread across the full life cycle of the young generation more uniformly and with less abrupt long pauses, leading to smoother application thread execution across many shorter pauses.
However, if Minor GC runs too frequently, meaning the application threads are hardly getting a chance to execute, or rarely execute, then you have sized the
–Xmn too low. Increasing
–Xmn causes the duration of the pause to increase. You can iteratively adjust
-Xmn first downwards to the point where you start to see too many frequent Minor GCs and then adjust –Xmn slightly higher in the next iteration so that you find the best compromise. If after many iterations, you are satisfied with the frequency of the Minor GC, but yet the duration is slightly problematic, then you can adjust –XX:ParallelGCThreads by increasing them to allow for more parallel garbage collection to take place by more worker threads.
–XX:ParallelGCThreads, do not size it more than 50% of the available number of underlying vCPU or CPU cores. In the example in used in BP 19 – Use Parallel and CMS GC Policy Combination one JVM is configured to run on one virtual machine that resides on one socket that has 8 underlying CPU cores, and therefore 50% of the CPU compute resource is allocated to potentially be consumed by
–XX:ParallelGCThreads. The other 50% on the socket remains for regular application transactions, that is, four vCPUs are consumed by the
ParallelGCThreads and the remaining four are available to service application threads, concurrent old generation activity, off the heap activity, and any other workload that might be running on the virtual machine, such as a monitoring agent.
One minor caveat here. In the very short pausing phases of the initial-mark (aside from the other concurrent phases), it is single threaded and finishes rather quickly. , and then the re-mark is multithreaded. The initial mark being single threaded does not use any of the
–XX:ParalleGCThreads allocated, but the re-mark phase being multithreaded uses some of the parallel threads allocated. Because re-mark is a very short phase, it uses negligible parallel thread cycles.
You can tune
–XX:ParallelGCThreads to below 50% allocation to give more threads back to your applications. If you attempt this and it does not hurt overall response time, then it might be prudent to reduce
–XX:ParallelGCThreads. Conversely, if you have exhausted young generation size tuning,
-Xmn, and have ample CPU cycles, consider increasing beyond the 50% mark progressively in one thread increments. Load test and measure response times for the application.
When considering reducing
–XX:ParalleGCThreads, the minimum should be two. Any lower than this can negatively impact the behavior of the parallel collector. When sizing large scale JVMs for vFabric SQLFire types of workloads, for example 8GB and greater, it requires at least a 4 vCPU virtual machine configuration, because two vCPUs are taken by
–XX:ParallelGCThreads, and the other two vCPUs are taken by the application threads. Further, when using CMS type of configuration you should always use virtual machines with four vCPUs or more. As previously described, assume a starting Survivor Ratio of 8 and defer any survivor space tuning until Step 3.
Figure 4 demonstrates the impact of reducing
–Xmn as described in the preceding discussion on “Understanding the impact of reducing or increasing young generation size –Xmn.” The diagram shows the original frequency of Minor GC as solid blue triangles having larger duration/pause, but then when
–Xmn is reduced, the frequency of Minor GC is increased as depicted by the dashed triangles.
Figure 4. Impact of Reducing -Xmn
Figure 5 demonstrates the impact of increasing
–Xmn which has the benefit of reducing the frequency of Minor GC, but the drawback of increasing its duration or pause. You can use the iterative approach to balance how far to increase
–Xmn versus how much to decrease it. The effective range for large scale JVMs ranges from a few gigabytes, but never more than approximately 33% of
Figure 5. Impact of Increasing -Xmn
You can mitigate the increase in Minor GC duration by increasing the
–XX:ParallelGCThreads. When increasing
–XX:ParallelGCThreads, you should not increase this to more than 50% of the available CPU cores dedicated to the vFabric SQLFire member JVM residing on a virtual machine. This should be done in concert with measuring the core CPU utilization to determine if there is ample CPU left to allocate even more threads.
Step B – Old Generation Tuning
This step (Step B from Figure 2) is concerned with tuning old generation after measuring the frequency and duration of major/full GC. The single most important JVM option that influences old generation tuning is often the total heap size,
–Xmx, and adjusting this up or down has a bearing on old generation full GC behavior. If you increase
–Xmx you will cause the duration of full GC to take longer, but less frequent, and vice versa. The decisions as to when to adjust
–Xmx directly depends on the adjustments we made in Step 1, where we adjusted
–Xmn. When you increase
–Xmn you cause the old generation space to be reduced, since old generation is implicitly sized (not directly through a direct JVM option) as “
–Xmn”. The tuning decision to offset the increased
–Xmn size is to proportionally increase
–Xmx to accommodate the change. If you increased
–Xmn by 5% of
–Xmx needs to be also increased by 5%. If you don’t then the impact of increasing
–Xmn on old generation is an increased full GC frequency since now the old generation space has been reduced.
The inverse argument also holds in the case where
–Xmn is reduced,
–Xmx has to be proportionally reduced as well. If you don’t adjust
–Xmx as are result of reducing
–Xmn, then old generation space is proportionally larger and the Full GC duration will be longer in this case.
In Figure 6, the impact of decreasing young generation size
–Xmn, has the effect of increasing the Full GC duration on old generation. The increase in old generation Full GC duration can be offset by decreasing
–Xmx proportionally to the amount
–Xmn was reduced by.
Figure 6. Impact of Decreasing Young Generation on Old Generation
Figure 7, shows the impact of increasing young generation size
–Xmn on old generation. When you increase the young generation size it implicitly causes the old generation size to become smaller, therefore causing the frequency of full GC to increase. One way to offset this is to proportionally increase
–Xmx by the amount
–Xmn was increased by.
Figure 7. Impact on Old Generation of Increasing Young Generation
Step C – Survivor Spaces Tuning
Step C from Figure 2 attempts to refine the Survivor Spaces. The assumption thus far in the discussion is that SurvivorRatio is 8, which is one of the best starting point choices. When SurvivorRatio is 8, this implies that the Survivor Space size for S0 and S1 are 10% of
–Xmn each. If at the end of the iterative Step 1 and Step 2 you are close to your response time objectives but would still like to refine things in either young generation or old generation, you can attempt to adjust the Survivor Spaces sizes.
Before tuning the SurvivorRatio, note that the SurvivorSpaceSize =
-Xmn / (SurvivorRatio +2).
Refinement impact of Survivor Space sizing on young generation: If you still have a problem with the duration of Minor GC in young generation, but you have exhausted the sizing adjustments of
–XX:ParallelGCThreads, then you can potentially increase the SurvivorSpaceSize by decreasing the SurvivorRatio. You can try setting the SurvivorRatio to 6 instead of 8, which implies that the survivor spaces will be 12.5% of
–Xmn for each of S0 and S1. The resulting increase in SurvivorSpaceSize causes Eden Space to be proportionally reduced, mitigating the long duration/pause problem. Conversely, if Minor GC is too frequent, then you can choose to increase the Eden Space Size by increasing SurvivorRatio to no more than 15, which reduces the SurvivorSpaceSize.
Refinement impact of Survivor Space sizing on old generation: If, after iteratively following Step 1 and Step 2, you still would like to refine the old generation, but have exhausted all of the preceding recommendations, you can adjust the SurvivorSpaceSize to delay the tenure or promotion of surviving objects from S0 and S1 to the old generation. If in old generation you have a high frequency of Full GC, this is an indication of excessive promotion from young generation. If
–Xmn was adjusted as in Step 1 and you need only a refinement, then you can increase the size of the Survivor Spaces to delay the promotion of surviving objects to old generation. When you increase the Survivor Space size, you are reducing the Eden space which causes more frequent Minor GC. Perhaps because the adjustment of survivor spaces is in the range of 5–10% depending on the SurvivorRatio, this small adjustment should not cause a large spike in Minor GC frequency.
JVM and GC Best Practices for vFabric SQLFire Members
The following best practices for JVMs are taken from the previous tuning discussions.
|BP 18 – JVM Version||Use JDK 1.6.0_29 or the latest JDK 1.6.0_XX. As of the date of this document, JDK 1.6.0_33 is also available and can be used.|
|BP 19 – Use Parallel and CMS GC Policy Combination||
|BP 20 – Set Initial Heap Equal to Maximum Heap||Set
|BP 21 – Disable Calls to System.gc()||Set
|BP 22 – New Generation Size||Set the
|BP 23 – Using 32-Bit Addressing in a 64-Bit JVM||When memory is constrained, set the
|BP 24 – Stack Size||In most cases the default
|BP 25 – Perm Size||It is a common best practice to set
Hope to see you at my talk APP-CAP1426 – The Benefits of Virtualization for Middleware in Barcelona.
Thank you for reading!