How (NOT TO) measure latency

Latency is defined as

time interval between the stimulation and response

and is a value which is of importance in many computer systems (financial systems, games, websites, etc). Hence we – as computer engineers – want to specify some upper bounds / worst case scenarios for the systems we build. How can we do this? The days of counting cycles for assembly instructions are long gone (unless you work on embedded systems) – there are just too many additional factors to consider (the operating system – mainly the task scheduler, other running processes, the JIT, the GC, etc). The remaining alternative is doing empirical (hands on) testing.

Use percentiles

So we whip out JMeter, configure a load test, take the mean (average) value 土 3 x standard deviation and proudly declare that 99.73% of the users will experience latency which is in this interval. We are especially proud because (a) we considered a realistic set of calls (URLs if we are testing a website) and (b) we allowed for JIT warmup.

But we are still very wrong! (which can be sad if our company writes SLAs based on our numbers – we can bankrupt the company single-handedly!)

Lets see where the problem is and how we can fix it before we cause damage. Consider the dataset depicted below (you can get the actual values here to do your own calculations).

For simplicity there are exactly 100 values used in this example. Lets say that they represent the latency of fetching a particular URL. You can immediately tell that the values can be grouped in three distinct categories: very small (perhaps the data was already in the cache?), medium (this is what most users will see) and poor (probably there are some corner-cases). This is typical for medium-to-large complexity (ie. “real life”) composed of many moving parts and is called a multimodal distributions. More on this shortly.

If we quickly drop these values into LibreOffice Calc and do the number crunching, we’ll come to the conclusion that the average (mean) of the values is 40 and according to the six sigma rule 99.73% of the users should experience latencies less than 137. If you look at the chart carefully you’ll see that the average (marked with red) is slightly left of the middle. You can also do a simple calculation (because there are exactly 100 values represented) and see that the maximum value in the 99th percentile is 148 not 137. Now this might not seem like a big difference, but it can be the difference between profit and bankrupcy (if you’ve written a SLA based on this value for example).

Where did we go wrong? Let’s look again carefully at the three sigma rule (emphasis added):

nearly all values lie within three standard deviations of the mean in a normal distribution.

Our problem is that we don’t have a normal distribution. We probably have a multimodal distribution (as mentioned earlier), but to be safe we should use ways of interpreting the results which are independent of the nature of the distribution.

From this example we can derive a couple of recommendations:

  1. Make sure that your test framework / load generator / benchmark isn’t the bottleneck – run it against a “null endpoint” (one which doesn’t do anything) and ensure that you can get an order of magnitude better numbers
  2. Take into account things like JITing (warmup periods) and GC if you’re testing a JVM based system (or other systems which are based on the same principles – .NET, luajit, etc).
  3. Use percentiles. Saying things like “the median (50th percentile) response time of our system is…”, “the 99.99th percentile latency is…”, “the maximum (100th percentile) latency is…” is ok
  4. Don’t calculate the average (mean). Don’t use standard deviation. In fact if you see that value in a test report you can assume that the people who put together the report (a) don’t know what they’re talking about or (b) are intentionally trying to mislead you (I would bet on the first, but that’s just my optimism speaking).

Look out for coordinated omission

Coordinate omission (a phrase coined by Gil Tene of Azul fame) is a problem which can occur if the test loop looks something like:


start:
t = time()
do_request()
record_time(time() - t)
wait_until_next_second()
jump start

That is, we’re trying to do one request every second (perhaps every 100ms would be more realistic, but the point stands). Many test systems (including JMeter and YCSB) have inner loops like this.

We run the test and (learning from the previous discussion) report: the 85% of the request will be served under 0.5 seconds if there are 1 requests per second. And we still can be wrong! Let us look at the diagram below to see why:

On the first line we have our test run (horizontal axis being time). Lets say that between second 3 and 6 the system (and hence all requests to it) are blocked (maybe we have a long GC pause). If you calculate the 85th percentile, you’ll 0.5 (hence the claim in the previous paragraph). However, you can see 10 independent clients below, each doing the request in a different second (so we have our criteria of one request per second fulfilled). But if we crunch the numbers, we’ll see that the actual 85th percentile in this case is 1.5 (three times worse than the original calculation).

Where did we go wrong? The problem is that the test loop and the system under test worked together (“coordinated” – hence the name) to hide (omit) the additional requests which happen during the time the server is blocked. This leads to underestimating the delays (as shown in the example).

  1. Make sure every request less than the sampling interval or use a better benchmarking tool (I don’t know of any which correct for this) or post-process the data with Gil’s HdrHistogram library which contains built-in facilities to account for coordinated omission

This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

2013 – The Java/JVM Year in Review and a Look to the Future!

Hi all, It’s been an amazing year for Java/JVM developers!

For me its been a year of fine tuning the Adopt OpenJDK and Adopt a JSR programmes with the LJC and working hard at jClarity – The Java/JVM performance analysis company that I’ve now been demoted to CEO of ;-).

In terms of the overall ecosystem there’s been some great steps forward:

  1. The language and platform continues to evolve (e.g. Java 8 is looking in great shape!).
  2. The standards body continues to become more inclusive and open (e.g. More Java User Groups have joined the Executive Committee of the JCP).
  3. The community continues to thrive (record numbers of conferences, Java User Groups and GitHub commits).

As is traditional, I’ll try and take you through some of the trends of the past year that was and a small peek into the future as well.

2013 Trends

There were some major trends in the Java/JVM and software development arena in 2013. Mobile Apps have become the norm, virtualised infrastructure is everywhere and of course Java is still growing faster (in absolute terms) than any other language out there except for Javascript. Here are some other trends worth noting.

Functional Programming

Mainstream Java developers have adopted Scala, Groovy, Clojure and other JVM languages to write safer, more concise code for call back handlers, filters, reduces and a host of other operations on collections and event handlers. A massive wave of developers will of course embrace Java 8 and Lambdas when it comes out in early 2014.

Java moving beyond traditional web apps

In 2013 Java firmly entrenched itself into areas of software development outside of the traditional 3-tier web application. It is used realm of NoSQL (e.g. Hadoop), High performance Messaging systems (e.g. LMAX Disruptor), Polyglot Actor-Like Application Frameworks (e.g. Vert.x) and much much more.

HTML 5

2013 has started to see the slow decline of Java/JVM serverside templating systems as the rise of Javascript libraries and micro frameworks (e.g. AngularJS, Twitter Bootstrap) means that developers can finally have a client side tempalting framework that doesn’t suck and has real data binding to JSON/XML messages going to and from the browser. Websockets and asynchronous messaging are also nicely taken care of by the likes of SockJS.

2014+ – The Future

Java 8 – Lambdas

More than enough has been written about these, so I’ll offer up a practical tutorial and the pragmatic Java 8 Lambdas Book for 2014 that you should be buying!

Java 9/10 – VM improvements

Java 9 and 10 are unlikely to bring many changes to the language itself, apart from applying Java 7 and 8 features to the internal APIs, making the overall Java API a much more pleasant experience to use.

So what is coming? Well the first likely major feature will be some sort of packed object or value type representation. This is where you will be able to defined a Java ‘object’ as being purely a data type. What do I mean by this? Well it means that the ‘object’ will be able to be laid out in memory very efficiently and not have the overhead of hashCode, equals or other regalia that comes with being a Java Object. Think of it a little bit like a C/C++ struct. Here’s an example of what it might look like in source code:


@Packed
final class PackedPoint extends PackedObject {
int x, y;
}

And the native representation:


// Native representation
struct Point {
int x, y;
}

There are loads of performance advantages to this, especially for immutable, well defined data structures such as points on a graph or within a 3D model. The JVM will be able to use far more efficient housekeeping mechanisms with respects to Memory Layout, Lookups, JIT and Garbage Collection.

Some sort of Reified generics will also come into the picture, but it’s more likely to be an extension of the work done of the ‘packed object’ idea and removing the need to unnecessary overhead in maintaining and the manipulation of collections that use the Object representation of primitives (e.g. List).

Java 9/10 – ME/SE convergence

It was very clear at JavaOne this year that Oracle are going to try and make Java a player in the IoT space. The opportunuites are endless, but the JVM has some way to go to find the right balance bewtwen being able to work on really small decvices (which is where Java ME and CDLC sit today) and providing a rich language features and a full stack for Java developers (which is where Java SE and EE sit today).

The merger between ME and SE begins in earnest for Java 8 and will hopefully be completed by Java 9. I encourage all of oyu to try out early versions of this on your Raspberry Pi!

But wait there’s more!

The way Java itself is being build and maintained is changing itself with a move towards more open development at OpenJDK. Features such as Tail Call Recursion and Co-routines are being discussed and there is a port on the way to have Java natively work on graphics cards.

There’s plenty to challenge yourself with and also have a lot of fun in 2014! Me? I’m going to try and get Java running on my Pi and maybe build that AngularJS front end for controlling hardware in my house…..

Happy Holidays everyone!

Martijn (@karianna) Java Champion, Speaker, Author, Cat Herder etc

Meta: this post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!