Problem Statement
Even the most basic problems become interesting when you dig deep enough. Today, I would like to do a deep dive into measuring time in Java. We will start with the very basics of the Java API and gradually move lower in our stack: through the OpenJDK source code, glibc, all the way down to the Linux kernel. We will look at the performance overhead in various environments and try to reason about the results.
We are going to explore the measurement of the elapsed time: the amount of time that passes from the start-event to the end-event of some activity. This is useful for performance improvements, operational monitoring and timeout enforcement.
The following pseudo-code is the common idiom we can see in pretty much any codebase:
START_TIME = getCurrentTime() executeAction() ELAPSED_TIME = getCurrentTime() - START_TIME
Sometimes it’s less explicit. We could use Aspect-Oriented Programming principles to avoid polluting our business code with what’s essentially an operational concern, but it’s still there in one form or another.
Elapsed Time in Java
Java offers two basic primitives for measuring time: System.currentTimeMillis()
and System.nanoTime()
. There are several differences between these two calls, let’s break them down.
1. Stability of the Point of Origin
System.currentTimeMillis()
returns the number of milliseconds since the start of the Unix epoch – January 1, 1970 UTC. On the other hand, System.nanoTime()
returns the number of nanoseconds since some arbitrary point in the past.
This immediately tells us the finest possible granularity of currentTimeMillis()
is 1 millisecond. It makes measuring anything shorter than 1ms impossible. The fact that currentTimeMillis()
uses January 1, 1970 UTC as a reference point is both good and bad.
Why is it good? We can compare currentTimeMillis()
values returned by 2 different JVMs and even by 2 different computers.
Why is it bad? The comparison won’t be very useful when our computers don’t have synchronized time. The clocks in typical server farms aren’t perfectly synchronized and there will always be some gap. This can still be acceptable if I’m comparing log files from 2 different systems: it’s OK if timestamps aren’t perfectly synchronized. However, sometimes the gap can lead to disastrous results, for example when it’s used for conflict resolution in distributed systems.
2. Clock Monotonicity
The other problem is that the returned values are not guaranteed to be monotonically increasing. What does it mean? When you have 2 consecutive calls of currentTimeMillis()
, the 2nd call can return a lower value than the first one. This is counterintuitive and can lead to nonsensical results such as elapsed time being a negative number. It’s clear that currentTimeMillis()
is not a good choice to measure the elapsed time inside an application. What about nanoTime()
?
System.nanoTime()
does not use the Unix epoch as a reference point, but some unspecified point in the past. The point stays fixed during a single JVM execution, but that’s it. Thus, it’s meaningless to even compare nanoTime()
values returned by two different JVMs running on the same computer, let alone on separate computers. The reference point is usually related to the last computer start, but this is purely an implementation detail and we cannot rely on it at all. The advantage is that even when wall-clock time in a computer goes backwards for some reason, it won’t have any impact on nanoTime()
. This is why nanoTime()
is a great tool to measure the elapsed time between 2 events on a single JVM, but we cannot compare timestamps from 2 different JVMs.
Implementation in Java
Let’s explore how currentTimeMillis()
and nanoTime()
are implemented in Java. I am going to use sources from the current head of OpenJDK 14. System.currentTimeMillis()
is a native method so our Java IDE won’t tell us how it’s implemented. This native code looks a bit better:
JVM_LEAF(jlong, JVM_CurrentTimeMillis(JNIEnv *env, jclass ignored)) JVMWrapper("JVM_CurrentTimeMillis"); return os::javaTimeMillis(); JVM_END
We can see it’s just delegating as implementation varies by Operating System. This is the Linux implementation:
jlong os::javaTimeMillis() { timeval time; int status = gettimeofday(&time, NULL); assert(status != -1, "linux error"); return jlong(time.tv_sec) * 1000 + jlong(time.tv_usec / 1000); }
This code is delegating to a Posix function gettimeofday()
. This function returns a simple struct:
struct timeval { time_t tv_sec; /* seconds */ suseconds_t tv_usec; /* microseconds */ };
The struct contains the number of seconds since the epoch and the number of microseconds in a given second. The contract of currentTimeMillis()
is to return the number of milliseconds since the epoch so it has to do a simple conversion: jlong(time.tv_sec) * 1000 + jlong(time.tv_usec / 1000)
The function gettimeofday()
is implemented by glibc, which ultimately calls the Linux kernel. We will have a deeper look later.
Let’s see how nanoTime()
is implemented: It turns out it’s not too different – System.nanoTime()
is also a native method: public static native long nanoTime();
and jvm.cpp
delegates to the OS-specific implementation:
JVM_LEAF(jlong, JVM_NanoTime(JNIEnv *env, jclass ignored)) JVMWrapper("JVM_NanoTime"); return os::javaTimeNanos(); JVM_END
The Linux implementation of javaTimeNanos() is quite interesting:
jlong os::javaTimeNanos() { if (os::supports_monotonic_clock()) { struct timespec tp; int status = os::Posix::clock_gettime(CLOCK_MONOTONIC, &tp); assert(status == 0, "gettime error"); jlong result = jlong(tp.tv_sec) * (1000 * 1000 * 1000) + jlong(tp.tv_nsec); return result; } else { timeval time; int status = gettimeofday(&time, NULL); assert(status != -1, "linux error"); jlong usecs = jlong(time.tv_sec) * (1000 * 1000) + jlong(time.tv_usec); return 1000 * usecs; } }
There are 2 branches: if the OS supports the monotonic clock, it will use it, and otherwise it delegates to our old friend gettimeofday()
. Gettimeofday()
is the same Posix call the System.currentTimeMillis()
used! Obviously the conversion looks a bit different as granularity of nanoTime()
is higher, but it’s the same Posix call! This implies that under some circumstances System.nanoTime()
uses the Unix epoch as a reference thus it can go back in time! In other words: It’s not guaranteed to be monotonic!
The good news is that up to my knowledge all modern Linux distributions do support the monotonic clock. I assume this branch is there for the sake of compatibility with ancient versions of kernel/glibc. If you are interested in the details of how HotSpot detects whether the OS supports the monotonic clock, see this code. For most of us it’s important to know that OpenJDK practically always calls the Posix function clock_gettime()
which is implemented in glibc and glibc delegates to the Linux kernel.
Benchmark I – Local Laptop
At this point we have some intuition how nanoTime()
and currentTimeMillis()
are implemented. Let’s have a look whether they are blazing fast or dog slow. Here’s a simple JMH benchmark:
@BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) public class Bench { @Benchmark public long nano() { return System.nanoTime(); } @Benchmark public long millis() { return System.currentTimeMillis(); } }
When I run this benchmark on my laptop with Ubuntu 19.10, I get these results:
Benchmark | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|
Bench.millis | avgt | 25 | 29.625 | ± 2.172 | ns/op |
Bench.nano | avgt | 25 | 25.368 | ± 0.643 | ns/op |
System.currentTimeMillis()
takes about 29 nanoseconds per call while System.nanoTime()
takes about 25 nanoseconds. Not great, not terrible. It implies it’s probably unwise to use System.nano()
to measure anything taking less than a few dozens of nanoseconds as the overhead of our instrumentation would be higher than the measured interval. We should also avoid using nanoTime()
in tight loops because latency adds up quickly. On the other hand, it appears to be sensible to use nanoTime()
to measure e.g. the response time from a remote server or the duration of an expensive calculation.
Benchmark II – AWS
Running benchmarks on a laptop is convenient, but not very practical unless you are willing to give up your laptop and use it as a production environment for your applications. Instead let’s run the same benchmark in AWS EC2.
Let’s spin up a c5.xlarge machine with Ubuntu 16.04 LTS and use the wonderful SDKMAN tool to install Java 13 built by the awesome people on the AdoptOpenJDK project:
$ sudo apt install zip unzip $ curl -s "https://get.sdkman.io" | bash # Yes, I know a pipe from curl to shell is not cool. I'm lazy. $ source "/home/ubuntu/.sdkman/bin/sdkman-init.sh" $ sdk install java 13.0.1.hs-adpt
$ java -version openjdk version "13.0.1" 2019-10-15 OpenJDK Runtime Environment AdoptOpenJDK (build 13.0.1+9) OpenJDK 64-Bit Server VM AdoptOpenJDK (build 13.0.1+9, mixed mode, sharing) ubuntu@ip-10-0-0-133:~$
Here are the results:
Benchmark | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|
Bench.millis | avgt | 25 | 28.467 | ± 0.034 | ns/op |
Bench.nano | avgt | 25 | 27.331 | ± 0.003 | ns/op |
That’s pretty much the same as on laptop, not too bad. Now let’s try the c3.large instance. It’s an older generation, but still frequently used:
Benchmark | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|
Bench.millis | avgt | 25 | 362.491 | ± 0.072 | ns/op |
Bench.nano | avgt | 25 | 367.348 | ± 6.100 | ns/op |
This does not look good at all! c3.large is an older and smaller instance so some slowdown is expected, but this is way too much! Both currentTimeMillis()
and nanoTime()
are one order of magnitude slower. 360 ns may not sound too bad at first, but think about this: to measure the elapsed time just once, you need two invocations. So each measurement costs you roughly 0.7 μs. If you have 10 probes measuring various execution stages then you are at 7 μs. To put it into perspective: 40gbit network cards have a round-trip around 10 μs. It means adding a bunch of probes to our hot path could have a very significant latency impact!
A bit of Kernel Investigation
Why is a c3 instance so much slower than my laptop or a c5 instance? It turns out it’s related to Linux clock sources and more importantly the glibc–kernel interface. We already know that each call of nanoTime()
or currentTimeMillis()
invokes native code in the OpenJDK which calls glibc, which calls the Linux kernel.
The interesting part is the glibc–Linux kernel transition: normally when a process calls a Linux kernel function – also known as a syscall – it involves a switch from user mode to kernel mode and back. This transition is a relatively expensive operation involving many steps:
- store the CPU registers in a kernel stack
- run the kernel code with the actual function
- copy the result from the kernel space to the user space
- restore the CPU registers from the kernel stack
- jump back to the user code
This has never been a cheap operation and it’s getting more expensive with the advent of side-channel security attacks and related mitigation techniques.
Performance-sensitive applications usually work hard to avoid the user–kernel transition. The Linux kernel itself offers a shortcut for some very frequent system calls, called vDSO – virtual Dynamic Shared Object. It essentially exports some of its functions and maps them into a process’s address space. A user process can invoke these functions as if they were regular functions from an ordinary shared library. In turns out both clock_gettime()
and gettimeofday()
have such a shortcut implemented, so when glibc invokes clock_gettime()
, it effectively just jumps to a memory address without doing the expensive user–kernel transition.
All this sounds like an interesting theory, but it doesn’t explain why System.nanoTime()
is so much slower on the c3 instance.
Experiment Time
We’ll use another amazing Linux tool to monitor the number of system calls: perf
. The simplest test we can do is to start our benchmark and count all system calls in our operating system. The perf
syntax for this is pretty easy:
sudo perf stat -e raw_syscalls:sys_enter -I 1000 -a
This will give us the total number of system calls per second. One important detail: it will give us only the real system calls, with the full user mode–kernel mode transition. The vDSO calls don’t count. This is how it looks when running on the c5 instance:
$ sudo perf stat -e raw_syscalls:sys_enter -I 1000 -a # time counts unit events 1.000082884 138 raw_syscalls:sys_enter 2.000222087 252 raw_syscalls:sys_enter 3.000343414 127 raw_syscalls:sys_enter 4.000455977 136 raw_syscalls:sys_enter 5.000569565 140 raw_syscalls:sys_enter 6.000678534 136 raw_syscalls:sys_enter
You can see there is about 130 system calls every second. Given that each iteration of our benchmark takes less than 30 ns, it’s obvious the application uses vDSO to bypass system calls.
This is how it looks on the c3 instance:
$ sudo perf stat -e raw_syscalls:sys_enter -I 1000 -a time counts unit events 1.000159646 1390979 raw_syscalls:sys_enter 2.000942549 1391734 raw_syscalls:sys_enter 3.001555643 1392785 raw_syscalls:sys_enter 4.002148930 1393389 raw_syscalls:sys_enter 5.002734293 1392383 raw_syscalls:sys_enter 6.003311904 1338051 raw_syscalls:sys_enter
That’s over 1,300,000 system calls per second! Also the latency of both nanoTime()
and currentTimeMillis()
roughly doubled to 700ns / operation. This is a rather strong indication each benchmark iteration invokes a real system call!
Let’s use another perf
command to gather additional evidence. This command will count all system calls invoked in 5 seconds and group them by name:
sudo perf stat -e 'syscalls:sys_enter_*' -a sleep 5
When running on the c5 instance there is nothing extraordinary going on. However when running on the c3 instance we can see this:
15695416 syscalls:sys_enter_gettimeofday (99.98%)
This is our smoking gun! Very strong evidence that when the benchmark is running on the c3 box, it makes the real gettimeofday()
system call! But why?
This is the relevant part of the 4.4 kernel (used in Ubuntu 16.04):
notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz) { if (likely(tv != NULL)) { if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE)) return vdso_fallback_gtod(tv, tz); tv->tv_usec /= 1000; } if (unlikely(tz != NULL)) { tz->tz_minuteswest = gtod->tz_minuteswest; tz->tz_dsttime = gtod->tz_dsttime; } return 0; }
It’s the function mapped into user memory and invoked by glibc when Java calls System.currentTimeMillis()
. It invokes the do_realtime()
which populates the struct tv
with the current time and then it returns back to the caller. What’s important is that all this is executed in user mode, without any slow system call. Well, unless the do_realtime()
returns VCLOCK_NONE
. In this case it calls vdso_fallback_gtod()
which will do a slow system call.
Why is the c3 instance doing the fallback do system call and c5 is not? Well, it’s related to a change in a virtualization technique! AWS had been using the Xen virtualization since its inception. About 2 years ago they announced a transition from Xen to the KVM virtualization. C3 instances use the Xen virtualization, newer c5 instance use KVM. What’s relevant for us is that each technique uses a different implementation of the Linux Clock. Linux shows the current clock source in /sys/devices/system/clocksource/clocksource0/current_clocksource
.
This is c3:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource Xen
And this is c5:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource kvm-clock
It turns out the kvm-clock implementation sets the vclock_mode
to VCLOCK_PVCLOCK
which means the slow fallback branch above is not taken. The Xen clock source doesn’t set this mode at all and it stays at VCLOCK_NONE
. This will result in jumping into the vdso_fallback_gtod()
function which will eventually initiate a real system call!
#define VCLOCK_NONE 0 /* No vDSO clock available. */ #define VCLOCK_TSC 1 /* vDSO should use vread_tsc. */ #define VCLOCK_HPET 2 /* vDSO should use vread_hpet. */ #define VCLOCK_PVCLOCK 3 /* vDSO should use vread_pvclock. */ struct arch_clocksource_data { int vclock_mode; };
The nice thing about Linux is that it’s highly configurable and often gives us enough rope to hang ourselves. We can try to change the clock source on c3 and re-run the benchmark. Available sources are visible via $ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
xen tsc hpet acpi_pm
TSC stands for the Time Stamp Counter, a pretty fast source and, important for us, a proper vDSO implementation. Let’s switch the clock source in the c3 instance from Xen to TSC:
# echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
Check it’s really switched:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc
Looks good! Now we can re-run our benchmark:
Benchmark | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|
Bench.millis | avgt | 25 | 25.558 | ± 0.070 | ns/op |
Bench.nano | avgt | 25 | 24.101 | ± 0.037 | ns/op |
The numbers look pretty good! Actually better than the c5 instance with kvm-clock. Number of system calls per second is on the same level as on c5 instance:
ubuntu@ip-10-0-0-133:~$ sudo perf stat -e raw_syscalls:sys_enter -I 1000 -a # time counts unit events 1.000172233 144 raw_syscalls:sys_enter 2.000726439 203 raw_syscalls:sys_enter 3.001163704 180 raw_syscalls:sys_enter 4.001595597 177 raw_syscalls:sys_enter 5.002030641 192 raw_syscalls:sys_enter
Some people recommend switching the clock source to TSC even when using the Xen virtualization. I don’t know enough about its possible side-effects, but apparently even some big companies do/did this in production. Obviously this does not prove it’s safe, but it’s an indication it works for some.
Final Words
We have seen how a low-level implementation detail can have a significant impact on the performance of common Java calls. It’s not a theoretical problem visible just in microbenchmarks, real systems are affected too. You can read a bit more about vDSO directly in the Linux kernel source tree.
I wouldn’t be able to conduct this investigation without my amazing colleagues at Hazelcast. It’s a world-class team and I learned a lot from them! I would like to thank to Brendan Gregg for his collection of perf tricks, my memory is rather weakly consistent and Brendan created a great cheatsheet.
And last but not least: if you are interested in performance, runtimes or distributed systems, follow me!
Author: Jaromir Hamala
I am a Jack of all trades at Hazelcast. I build distributed teams and systems by day and poke various runtimes by night. I like exploring how things work under the hood.
dany74q December 7, 2019
Pretty cool – nice work Jaromir !