This is a continuation of the previous post titled (Part 2 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al.
In our first review, The Atlassian guide to GC tuning is an extensive post covering the methodology and things to keep in mind when tuning GC, practical examples are given and references to important resources are also made in the process. The next one How NOT to measure latency by Gil Tene, he discusses some common pitfalls encountered in measuring and characterizing latency, demonstrating and discussing some false assumptions and measurement techniques that lead to dramatically incorrect reporting results, and covers simple ways to sanity check and correct these situations. Finally Kirk Pepperdine in his post Poorly chosen Java HotSpot Garbage Collection Flags and how to fix them! throws light on some JVM flags – he starts with some 700 flags and boils it down to merely 7 flags. Also cautions you to not just draw conclusions or to take action in a whim but consult and examine – i.e. measure don’t guess!
In our first review, The Atlassian guide to GC tuning is an extensive post covering the methodology and things to keep in mind when tuning GC, practical examples are given and references to important resources are also made in the process. The next one How NOT to measure latency by Gil Tene, he discusses some common pitfalls encountered in measuring and characterizing latency, demonstrating and discussing some false assumptions and measurement techniques that lead to dramatically incorrect reporting results, and covers simple ways to sanity check and correct these situations. Finally Kirk Pepperdine in his post Poorly chosen Java HotSpot Garbage Collection Flags and how to fix them! throws light on some JVM flags – he starts with some 700 flags and boils it down to merely 7 flags. Also cautions you to not just draw conclusions or to take action in a whim but consult and examine – i.e. measure don’t guess!
Garbage Collection (GC) Tuning Guide by Atlassian
Background
By default JVM tunings attempt to provide acceptable performance in majority of the cases, but it may not always give the desired results depending on application behaviour. Benchmark your application before applying any GC or JVM related settings. There are a number of environment, OS, hardware related factors to taken into consideration and many a times any tuning to JVM may be linked to these factors and if they chance the tuning may need to be revisited. Beware and accept the fact that any tuning has its upper limit in a given environment, if your goals are still not met, then you need to improve your environment. Always monitor your application to see if your tuning goals are still in scope.
Choosing Performance Goals – GC for the Oracle JVM can be reduced by targeting three goals ie. latency (mean & maximum), throughput, and footprint. To reach the tuning goals, there are principles that guide you to tuning the GC i.e. Minor GC reclamation, GC maximise memory, two of three (pick 2 of 3 of these goals, and sacrifice the other).
Preparing your environment for GC –
Some of the items of interest are load the application with work (apply load to the application as it would be in production), turn on GC logging, set data sampling windows, determine memory footprint, rules of thumb for generation sizes, determine systemic requirements.
Iteratively change the JVM parameters keeping the environment its setting in tact. An example of turning on GC logging is:
java -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:<file> …
or
java -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<file> …
Sampling windows are important to help specifically diagnose an application’s runtime rather than the load-time or time it or the JVM takes to stabilise before it can execute anything. Set the heap sizes either by guessing (give it maximum and then pull back) or by using results accumulated in the GC logs. Also check for OOME which can be an indicator if heap size or standard memory needs increasing. There are a number of rules of thumb for generation sizes (a number of them depend on sizes of other pools).
Understanding the Throughput collector and Behaviour based Tuning
Hotspot uses behaviour based tuning with parallel collectors including the MarkAndSweep collector, they are designed to keep three goals in mind i.e. maximum pause time, throughput and footprint. The –-XX:+PrintAdaptiveSizePolicy JVM flag prints details of this behaviour for analysis. Even though there are options to set the maximum pause time for any GC event there is no guarantee it will be performed. It also uses the best of 2 (from 3) goals – i.e. only 2 goals are focussed at a time, and there is interaction between goals, and sometimes there is oscillation between them.
Time to tune
There is a cycle to follow when doing this, i.e.
- Determine desired behaviour.
- Measure behaviour before change.
- Determine indicated change.
- Make change.
- Measure behaviour after change.
- If behaviour not met, try again.
Repeat the above till the results are met (or nearly there).
Tools
Java VisualVM with Visual GC plugin for live telemetry (note there’s an observer effect when using VisualVM, which can have an impact on your readings). GCViewer by Tagtruam industries, useful for post-hoc review of GC performance based on the GC log.
Conclusions: a lot of factors need to be taken into consideration when tuning GC. A meticulous plan needs to be considered, keeping an eye on your goal throughout the process and once achieved, you continue to monitor your application to see if your performance goals are still intact. Adjust your requirements after studying the results of your actions such that they both meet at some point.
— The authors cover the methodology in good detail and give practical advise and steps to follow in order to tune your application – the hands-on and practical aspects of the blog should definitely be read. —
How NOT to measure Latency by Gil Tene
The author starts with saying “some of the mistakes others including the author has made” – plenty of wrong ways to do things and what to avoid doing.
WHY and HOW – you measure latency, HOW only makes sense if you understand the WHY. There is public domain code / tools available that can be used to measure latency.
Don’t use statistics to measure latency!
Classic way of looking at latency – response time (latency) is a function of load! Its important to know what this response time – is it average, worst case, 99 percentile, etc… based on different types of behaviours?
Hiccups – some sort of accumulated work that gets performed in bursts (spikes). Important to look at these behaviours when looking at response time and average response time. They are not evenly spread around, but may look like periodic freezes, shifts from one mode/behaviour to another (no sequence).
Common fallacies:
– computers run application code continously
– response time can be measure as a work units over time
– response time exhibits a normal distribution
– “glitches” or “semi-random omissions” in measurement don’t have a big effect
Hiccups are hidden knowingly or unknowingly – i.e. averages and standard deviations can help!
Many compute the 99 percentile or another figure from averages and distributions. A better way to deal with it is to measure it using data – for e.g. plot a histogram. You need to have a SLA for the entire spread and plot the distribution across 0 to 100 percentile.
Latency tells us how long something should take! True real-time is being slow and steady!
Author gives examples of types of applications with different latency needs i.e. The Olympics, Pacemaker, Low latency trading, Interactive applications.
A very important way to establishing requirements – an extensive interview process, to help the client come up with a realistic figures for the SLA requirements. We need to find out what is the worst the client is willing to accept and for how long or how often is that acceptable, to help construct a percentile table.
Question is how fast can the system run, smoothly without hitting an obstacles or crashes! The author further makes a comparison of performance between the HotSpot JVM and C4.
Coordinated omission problem – what is it? data that it is not emitted in a random way but in a coordinated way that skews the actual data. Delays in request times does not get measured due to the manner in which response time is recorded from within the system – which omits random uncoordinated data. The maximum value (time) in the data usually is the key value to start with to compute and compare the other percentiles and check if it matches with that computed with coordinated emission in it. Percentiles are useful, compute them.
Graphs with sudden bumps and more smooth lines tend to have coordinated emission in it.
HdrHistogram helps plot histograms – a configurable precision system, telling it the range and precision to cover. It has built in compensation for Coordinated Omission – its OpenSource and available on Github under the CCO license.
jHiccup also plots histogram to show how your computer system has been behaving (run-time and freeze time), also records various percentiles, maximum and minimum values, distribution, good modes, bad modes, etc…. Also opensource under COO license.
Measuring throughput without a latency requirement, the information is meaningless. Mistakes in measurement/analysis can lead to orders-of-magnitude errors and lead to bad business decisions.
A very important way to establishing requirements – an extensive interview process, to help the client come up with a realistic figures for the SLA requirements. We need to find out what is the worst the client is willing to accept and for how long or how often is that acceptable, to help construct a percentile table.
Question is how fast can the system run, smoothly without hitting an obstacles or crashes! The author further makes a comparison of performance between the HotSpot JVM and C4.
Coordinated omission problem – what is it? data that it is not emitted in a random way but in a coordinated way that skews the actual data. Delays in request times does not get measured due to the manner in which response time is recorded from within the system – which omits random uncoordinated data. The maximum value (time) in the data usually is the key value to start with to compute and compare the other percentiles and check if it matches with that computed with coordinated emission in it. Percentiles are useful, compute them.
Graphs with sudden bumps and more smooth lines tend to have coordinated emission in it.
HdrHistogram helps plot histograms – a configurable precision system, telling it the range and precision to cover. It has built in compensation for Coordinated Omission – its OpenSource and available on Github under the CCO license.
jHiccup also plots histogram to show how your computer system has been behaving (run-time and freeze time), also records various percentiles, maximum and minimum values, distribution, good modes, bad modes, etc…. Also opensource under COO license.
Measuring throughput without a latency requirement, the information is meaningless. Mistakes in measurement/analysis can lead to orders-of-magnitude errors and lead to bad business decisions.
— The author has shown some cool graphs and illustrations of how we measure things wrong and what to look out for when assessing data or even graphs. Gives a lot of tips, on how NOT to measure latency. —
Poorly chosen Java HotSpot Garbage Collection Flags and how to fix them! by Kirk Pepperdine
There are more than 700 product flags defined in the HotSpot JVM and not many have a good idea of what effect it might have when used in runtime, let be the combination of two or more together.
Whenever you think you know a flag, it often pays if you check out the source code underlying that flag to get a more accurate idea.
Identifying redundant flags
The author enlists a huge list of flags for us to determine if we need them or not. It see a list of flags available with your version of java, try this command, to get a long list of options:
$ java -XX:+PrintFlagsFinal
You can narrow down the list by eliminating deprecated flags and ones with a default value, leaving us with a much shorter list.
DisableExplicitGC, ExplicitGCInvokesConcurrent, and RMI Properties
Calling System.gc() isn’t a great idea and one way to disable that functionality from code that uses it, is to use the DisableExplicitGC flag as one of the arguments of your VM options flags. Or instead use the ExplicitGCInvokesConcurrent flag to invoke the Concurrent Mark and Sweep cycle instead of the Full GC cycle (invoked by System.gc()). You can also set the interval period between complete GC cycles using the sun.rmi.dgc.client.gcInterval and sun.rmi.dgc.server.gcInterval properties.
Xmx, Xms, NewRatio, SurvivorRatio and TargetSurvivorRatio
The -Xmx flag is used to set the maximum heap size, and all JVMs are meant to support this flag. By adjusting the minimum and maximum heap sizes such that the JVM can’t resize the heap size at runtime disables the properties of the adaptive sizing properties.
Use GC logs to help make memory pool size decisions. There is a ratio to maintain between Young Gen and Old Gen and within Young Gen between Eden and the Survivor spaces. The SurvivorRatio flag is one such flag that helps decide how much space will be used by the survivor spaces in Young gen and the rest will be left for Eden. These ratios are key towards determining the frequency and efficiency of the collections in Young Gen. TargetSurvivorRatio is another ratio to examine and cross-check before using.
PermSize, MaxPermSize
These options are removed starting JDK 8 in favour of Meta space, but in previous versions it helped in setting PermGen’s resizing abilities. Restricting its adaptive nature might lead to longer pauses of old generational spaces.
UseConcMarkSweepGC, CMSParallelRemarkEnabled, UseCMSInitiatingOccupancyOnly CMSInitiatingOccupancyFraction
UseConcMarkSweepGC makes the JVM select the CMS collector, CMS works with ParNew instead of the PSYoung and PSOld collectors for the respective generational spaces.
CMSInitiatingOccupancyFraction (IOF) is flag that helps the JVM determine how to maintain the occupancy of data (by examining the current rate of promotion) in the Old Gen space, sometimes leading to STW, FullGCs or CMF (concurrent mode failure).
UseCMSInitiatingOccupancyOnly tells the JVM to use the IOF without taking the rate of promotion into account.
CMSParallelRemarkEnabled helps parallelise the fourth phase of the CMS (single threaded by default), but use this option only after benchmarking your current production systems.
CMSClassUnloadingEnabled
CMSClassUnloadingEnabled ensures that classes loaded in the PermGen space is regularly unloaded reducing situations of out-of-PermGen-space. Although this could lead to longer concurrent and remark (pause) collection times.
AggressiveOpts
AggressiveOpts as the name suggests helps improve performance by switching on and off certain flags but this strategy hasn’t change across various builds of Java and one must examine benchmarks of their system before using this flag in production.
Next Steps
Of the number of flags examined the below are more interesting flags to look at:
-ea
-mx4G
-XX:+MaxPermSize=300M
-XX:+UseConcMarkSweepGC
-XX:+CMSClassUnloadingEnabled
-XX:SurvivorRatio=4
-XX:+DisableExplicitGC
Again examine the GC logs to see if any of the above should be used or not, if used, what values are appropriate to set. Quite often the JVM does get it right, right out of the box, hence getting it wrong can be detrimental to your application performance. Its easy to mess up then to get it right, when you have so many flags with dependent and overlapping functionalities.
Conclusion: Refer to benchmarks and GC logs for making decisions on enabling and disabling of flags and the values to set when enabled. Lots of flags, easy to get them wrong, only use them if they are absolutely needed, remember the Hotspot JVM has built in machinery to adapt (adaptive policy), consult an expert if you still want to.
— The author has done justice by giving us a gist of what the large number of JVM flags and recommend reading the article to learn about specifics of certain flags.. —
As it is not practical to review all such videos and articles, a number of them have been provided in the links below for further study. In many cases I have paraphrased or directly quoted what the authors have to say to preserve the message and meaning they wished to convey.
A few other authors have written articles related to these subjects, for the Java Advent Calendar, see below,
Using Intel Performance Counters To Tune Garbage Collection
How NOT to measure latency
Feel free to post your comments below or tweet at @theNeomatrix369!
Useful resources
- Are your GC logs speaking to you, the G1GC edition by Kirk Pepperdine – Slides – Video
- Performance Special Interest Group discussion – moderated by Richard Warburton (video)
- Caching in: understand, measure and use your CPU Cache more effectively” by @RichardWarburto – (video & slides)
- Article on Atomic I/O operations (Linux) by Jonathan Corbet
- Articles and Presentations about Azul Zing, Low Latency GC & OpenJDK by Gil Tene (videos & slides)
- Lock-Free Algorithms For Ultimate Performance by Martin Thompson
- Performance Java User’s Group – “For expert Java developers who want to push their systems to the next level”
- Tuning the Size of your thread pool by Kirk Pepperdine
- How NOT to measure Latency by Gil Tene
- Understanding Java Garbage Collection and What You Can Do about It by Gil Tene
- Vanilla #Java Understanding how Core Java really works can help you write simpler, faster applications by Peter Lawrey
- Profiling Java In Production – by Kaushik Srenevasan
- HotSpot JVM garbage collection options cheat sheet (v3) by Alexey Ragozin
- Optimizing Google’s Warehouse Scale Computers: The NUMA Experience – authors from Univ. of Cal (SD) & Google!
- MegaPipe: A New Programming Interface for Scalable Network I/O by several authors!
- What Every Programmer Should Know About Memory by Ulrich Drepper
- Memory Barriers: a Hardware View for Software Hackers – Paul E. McKenney (Linux Technology Center – IBM Beaverton)
- Are your GC logs speaking to you, the G1GC edition by Kirk Pepperdine – Slides – Video
- Performance Special Interest Group discussion – moderated by Richard Warburton (video)
- Caching in: understand, measure and use your CPU Cache more effectively” by @RichardWarburto – (video & slides)
- Article on Atomic I/O operations (Linux) by Jonathan Corbet
- Articles and Presentations about Azul Zing, Low Latency GC & OpenJDK by Gil Tene (videos & slides)
- Lock-Free Algorithms For Ultimate Performance by Martin Thompson
- Performance Java User’s Group – “For expert Java developers who want to push their systems to the next level”
- Tuning the Size of your thread pool by Kirk Pepperdine
- How NOT to measure Latency by Gil Tene
- Understanding Java Garbage Collection and What You Can Do about It by Gil Tene
- Vanilla #Java Understanding how Core Java really works can help you write simpler, faster applications by Peter Lawrey
- Profiling Java In Production – by Kaushik Srenevasan
- HotSpot JVM garbage collection options cheat sheet (v3) by Alexey Ragozin
- Optimizing Google’s Warehouse Scale Computers: The NUMA Experience – authors from Univ. of Cal (SD) & Google!
- MegaPipe: A New Programming Interface for Scalable Network I/O by several authors!
- What Every Programmer Should Know About Memory by Ulrich Drepper
- Memory Barriers: a Hardware View for Software Hackers – Paul E. McKenney (Linux Technology Center – IBM Beaverton)
This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!