How and Why is Unsafe used in Java?

Overview

sun.misc.Unsafe has been in Java from at least as far back as Java 1.4 (2004).  In Java 9, Unsafe will be hidden along with many other, for-internal-use classes. to improve the maintainability of the JVM.  While it is still unclear exactly what will replace Unsafe, and I suspect it will be more than one thing which replaces it, it raises the question, why is it used at all?

Doing things which the Java language doesn’t allow but are still useful.

Java doesn’t allow many of the tricks which are available to lower level languages.  For most developers this is very good thing, and it not only saves you from yourself, it also saves you from your co-workers.  It also makes it easier to import open source code because you know there is limits to how much damage they can do.  Or at least there is limits to how much you can do accidentally. If you try hard enough you can still do damage.

But why would you even try, you might wonder?  When building libraries many (but not all) of the methods in Unsafe are useful and in some cases, there is no other way to do the same thing without using JNI, which is even more dangerous and you lose the “compile once, run anywhere”

Deserialization of objects

When deserializing or building an object using a framework, you make the assumption you want to reconstitute an object which existed before.  You expect that you will use reflection to either call the setters of the class, or more likely set the internal fields directly, even the final fields.  The problem is you want to create an instance of an object, but you don’t really need a constructor as this is likely to only make things more difficult and have side effects.
public class A implements Serializable {
private final int num;
public A(int num) {
System.out.println("Hello Mum");
this.num = num;
}

public int getNum() {
return num;
}
}

In this class, you should be able to rebuild and set the final field, but if you have to call a constructor and it might do things which don’t have anything to do with deserialization.  For these reasons many libraries use Unsafe to create instances without calling a constructor

Unsafe unsafe = getUnsafe();
Class aClass = A.class;
A a = (A) unsafe.allocateInstance(aClass);

Calling allocateInstance avoids the need to call the appropriate constructor, when we don’t need one.

Thread safe access to direct memory

Another use for Unsafe is thread safe access to off heap memory.  ByteBuffer gives you safe access to off heap or direct memory, however it doesn’t have any thread safe operations.  This is particularly useful if you want to share data between processes.

import sun.misc.Unsafe;
import sun.nio.ch.DirectBuffer;

import java.io.File;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.lang.reflect.Field;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;

public class PingPongMapMain {
public static void main(String... args) throws IOException {
boolean odd;
switch (args.length < 1 ? "usage" : args[0].toLowerCase()) {
case "odd":
odd = true;
break;
case "even":
odd = false;
break;
default:
System.err.println("Usage: java PingPongMain [odd|even]");
return; }
int runs = 10000000;
long start = 0;
System.out.println("Waiting for the other odd/even");
File counters = new File(System.getProperty("java.io.tmpdir"), "counters.deleteme"); counters.deleteOnExit();

try (FileChannel fc = new RandomAccessFile(counters, "rw").getChannel()) {
MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_WRITE, 0, 1024);
long address = ((DirectBuffer) mbb).address();
for (int i = -1; i < runs; i++) {
for (; ; ) {
long value = UNSAFE.getLongVolatile(null, address);
boolean isOdd = (value & 1) != 0;
if (isOdd != odd)
// wait for the other side.
continue;
// make the change atomic, just in case there is more than one odd/even process
if (UNSAFE.compareAndSwapLong(null, address, value, value + 1))
break;
}
if (i == 0) {
System.out.println("Started");
start = System.nanoTime();
}
}
}
System.out.printf("... Finished, average ping/pong took %,d ns%n",
(System.nanoTime() - start) / runs);
}

static final Unsafe UNSAFE;

static {
try {
Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
theUnsafe.setAccessible(true);
UNSAFE = (Unsafe) theUnsafe.get(null);
} catch (Exception e) {
throw new AssertionError(e);
}
}
}

When you run this in two programs, one with odd and the other with even. You can see that each process is changing data via  persisted shared memory.

In each program it maps the same are of the disks cache into the process.  There is actually only one copy of the file in memory.  This means the memory can be shared, provided you use thread safe operations such as the volatile and CAS operations.

The output on an i7-3970X is

Waiting for the other odd/even
Started
… Finished, average ping/pong took 83 ns

That is 83 ns round trip time between two processes. When you consider System V IPC takes around 2,500 ns and IPC volatile instead of persisted, that is pretty quick.

Is using Unsafe suitable for work?

I wouldn’t recommend you use Unsafe directly.  It requires far more testing than natural Java development.  For this reason I suggest you use a library where it’s usage has been tested already.  If you wan to use Unsafe yourself, I suggest you thoughly test it’s usage in a stand alone library.  This limits how Unsafe is used in your application and give syou a safer, Unsafe.

Conclusion

It is interesting that Unsafe exists in Java, and you might to play with it at home.  It has some work applications especially in writing low level libraries, but in general it is better to use a library which uses Unsafe which has been tested than use it directly yourself.

About the Author.

Peter Lawrey has the most Java answers on StackOverflow. He is the founder of the Performance Java User’s Group, and lead developer of Chronicle Queue and Chronicle Map, two libraries which use Unsafe to share persisted data between processes.

Using Intel Performance Counters To Tune Garbage Collection

<

div dir=”ltr” style=”text-align: left;”>

Introduction

I have to admit that I was shocked. Indeed, quite shaken when I realised this advent calendar post would be about garbage collection. The topic of GC has raised such passion amongst the advocates of Java and those who believe memory management should be manual. Many an article has been written regarding tiny subtle changes in strange looking command line arguments which have fractions of a percentage point performance impact on Java applications. How could I add to this huge body work? I hope this post will not add to the GC hot air but rather be a breath of fresh air instead. Let us not look at the CPU time consumed by the garbage collector or pause times; how about looking at a hidden yet potentially critical aspect of memory management in general and garbage collection in particular: Data caching is one of the major challenges of modern computer software design (the others being instruction caching and multi-core work distribution). Modern CPUs run so fast that main memory has no hope of keeping up. The way that some of this catastrophic performance penalty can be clawed back is caching. Memory is pulled in parallel into high speed cache memory and then the CPU accesses this cache. If we are lucky, and the code causes the CPU to read and write the same memory a few times (inside a loop for example) then the CPU can happily access the cache and be saved from waiting for loads and stores to and from main memory. “How does garbage collection impact on the performance of caches” one might ask? There are many ways, some of them very subtle, however, here is a grab bag of some important ones: The garbage collector traverses references in memory. This causes cache lines (blocks of memory in the cache) to contain the memory surrounding the reference and hence no longer hold other data which the program is using. Whilst we call it a garbage collector, it is actually an allocator, mover and collector. This really matters when we think about data caching:

  • Allocation: Based on hardware rules, memory addresses are matched to cache lines. If pieces of memory share a cache line but are actually accessed from different threads we get an effect called false sharing. However, if little bits of data are spread out but accessed from the same thread we get poor cache utilisation.
  • Moving: Objects are not left in one place throughout their life times. The garbage collector avoids memory fragmentation by moving objects around. This has the interesting effect of guaranteeing that the cache lines associated with the object will no longer be associated with it after a move.
  • Collecting: The funny thing is that collecting is the easy bit. It can be as simple as just marking the memory as available for reuse. It is the traversal of the object graphs (multiple roots) to find out what can be collected which is going to cause data cache line loads and thus evict lines from the cache which were being read from or written to by user code.

So we can now see that the design of the garbage collector is critical to the operation of the data cache. Swapping which collector we use will not only have a impact on GC pauses and other obvious issues, it will also effect, at a low level and in a fundamental way, all of user code.

An Example

I am not going to present an exhaustive scientific paper on this concept. The purpose of this post is to show an alternative way of approaching JVM tuning. So, I ran a simple, short, multi-threaded patch in my personal synthesiser program Sonic-Field. The patch uses feedback, resonance and a bunch of other concepts to synthesise string instruments and then convolution to place the sounds in an acoustic environment.The reason for picking sonic field is not just because it is of reasonable complexity, highly threaded and uses Spring but because I recently found I could get better performance from it using the CMS garbage collector. Latency with Sonic-Field is of no interest because it is a batch processor. However, the standard Java 7 garbage collector interacted badly with the way Sonic Field writes swap files out to disk when running low on memory. I tried CMS because it keeps the memory down the whole time (in theory – don’t flame me) because it constantly attempts to do small garbage collections along side the user threads. If we put all this together we might well come up with a reasonable theory “The CMS garbage collector might give fewer pauses and might be able to keep memory use down but in so doing it will almost certainly cause more data cache misses”. Constantly traversing the reference graph in memory to try and collect dead objects is going to cause cache loads and those loads will cause other data to be flushed from the cache (it has finite size). Thus, when user threads come to read again they will cause more cache misses and so on. Does it matter? That answer will be entirely down to the application and the hardware and the load on the application. I am not, I repeat not, advocating one garbage collector over another! Nevertheless, it is a question I would like to answer so let’s answer it for my little test patch. These data cache effects of the garbage collector are not visible from the normal VM profiling tools. This means that they do not get discussed much in the JVM community and they get considered in JVM tuning even less. However, there is a tool (actually several – but I am going to talk about the easiest to use) which can shed some light on the topic. I am talking about Intel’s PCM (Performance Counter Monitor). It can be used for code tuning as well, but I thought talking about the GC would be more fun today.

A Worked Example

pcm is just a command line tool. We pass the command line to run Java to it in quotes and it does its measurements. With other tooling, the performance counters can be used to get all sorts of other detail about an application. The benefit of the pcm command line tool is its simplicity and lack of intrusion into the over all application run. The disadvantage is that it will measure the JVM and application warm up phases. However, for server style applications or batch processors (like Sonic Field) these overheads are usually trivial compared to the actual application run. I ran my patch on my personal Macbook Pro Retina (2012) with 16Gig of RAM. The JVM was:

java version "1.8.0-ea"

Java(TM) SE Runtime Environment (build 1.8.0-ea-b61)

Java HotSpot(TM) 64-Bit Server VM (build 25.0-b05, mixed mode)

Readings from pcm are simply written to standard out when the application exits. I compared runs with no settings for the garbage collector (the default therefore) and with my currently preferred set of tweaks. To be honest, I am not sure if the tweaks are optimal; I kind of lifted them from a bunch of online articles… Here is the launch script for Java:

/Users/alexanderturner/x/IntelPerformanceCounterMonitorV2.5.1 2/pcm.x "java -Xmx12G -Xms12G  -DsonicFieldTemp=/Users/alexanderturner/temp -DsonicFieldThreads=12 -DsonicFieldSwapLimit=4.0  -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops -XX:ParallelGCThreads=8 -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly -classpath bin:spring-framework-3.1.2.RELEASE/dist/org.springframework.asm-3.1.2.RELEASE.jar:spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.beans-3.1.2.RELEASE.jar:spring-framework-3.1.2.RELEASE/dist/org.springframework.core-3.1.2.RELEASE.jar:spring/spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.context-3.1.2.RELEASE.jar:
spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.context-support-3.1.2.RELEASE.jar:
spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.expression-3.1.2.RELEASE.jar:
spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.test-3.1.2.RELEASE.jar:
spring/otherJars/commons-logging-1.1.1.jar com.nerdscentral.sfpl.RenderRunner $1"

Hopefully it is clear just how simple running Java under Intel Performance Counter Monitor v2 really is. So, here is the output:

Standard GC

Core (SKT) EXEC IPC FREQ AFREQ L3MISS L2MISS L3HIT L2HIT L3CLK L2CLK READ WRITE TEMP
0 0.53 0.75 0.70 1.31 422 M 621 M 0.32 0.32 0.14 0.01 N/A N/A 32
2 0.56 0.77 0.73 1.31 346 M 466 M 0.26 0.31 0.11 0.01 N/A N/A 28
1 0.22 0.69 0.32 1.31 144 M 192 M 0.25 0.28 0.11 0.01 N/A N/A 32
3 0.21 0.68 0.31 1.31 135 M 171 M 0.21 0.28 0.10 0.01 N/A N/A 28
4 0.55 0.77 0.71 1.31 332 M 410 M 0.19 0.38 0.11 0.01 N/A N/A 22
7 0.18 0.68 0.26 1.30 124 M 134 M 0.08 0.30 0.11 0.00 N/A N/A 27
5 0.19 0.68 0.29 1.31 133 M 155 M 0.14 0.30 0.11 0.00 N/A N/A 22
6 0.61 0.79 0.78 1.32 343 M 382 M 0.10 0.35 0.10 0.00 N/A N/A 27

—————————————————————————————————-

SKT 0 0.38 0.75 0.51 1.31 1982 M 2533 M 0.22 0.33 0.11 0.01 N/A N/A 22

Instructions retired: 2366 G ; Active cycles: 3166 G ; Time (TSC): 773 Gticks ; C0 (active,non-halted) core residency: 39.04 % —————————————————————————————————

TOTAL * 0.38 0.75 0.51 1.31 1982 M 2533 M 0.22 0.33 0.11 0.01 N/A N/A N/A

C: 1.49 => corresponds to 37.36 % utilization for cores in active state. Instructions per no C1 core residency: 23.92 %; C3 core residency: 0.01 %; C6 core residency: 0.00 %; C7 core residency: 37.02 % C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 % PHYSICAL CORE I Pminal CPU cycle: 0.76 => corresponds to 19.12 % core utilization over time interval

Concurrent Mark Sweep

Rather than

-XX:+UseConcMarkSweepGC
-XX:+UseCompressedOops
-XX:+CMSParallelRemark
-XX:ParallelGCThreads=Enabled
-XX:CMSInitiatingOccupancyFraction=60
-XX:+UseCMSInitiatingOccupancyOnly

Core (SKT) EXEC IPC FREQ AFREQ L3MISS L2MISS L3HIT L2HIT L3CLK L2CLK READ WRITE TEMP
0 0.53 0.69 0.76 1.31 511 M 781 M 0.35 0.35 0.17 0.02 N/A N/A 26
2 0.54 0.71 0.75 1.31 418 M 586 M 0.29 0.40 0.14 0.01 N/A N/A 29
1 0.31 0.66 0.47 1.30 207 M 285 M 0.27 0.26 0.11 0.01 N/A N/A 26
3 0.21 0.68 0.31 1.31 135 M 171 M 0.21 0.28 0.10 0.01 N/A N/A 28
4 0.55 0.77 0.71 1.31 332 M 410 M 0.19 0.38 0.11 0.01 N/A N/A 22
7 0.18 0.68 0.26 1.30 124 M 134 M 0.08 0.30 0.11 0.00 N/A N/A 27
5 0.19 0.68 0.29 1.31 133 M 155 M 0.14 0.30 0.11 0.00 N/A N/A 22
6 0.61 0.79 0.78 1.32 343 M 382 M 0.10 0.35 0.10 0.00 N/A N/A 27
3 0.30 0.66 0.46 1.30 198 M 258 M 0.23 0.27 0.11 0.01 N/A N/A 29
4 0.59 0.73 0.81 1.31 397 M 504 M 0.21 0.46 0.12 0.01 N/A N/A 29
7 0.30 0.66 0.45 1.30 186 M 204 M 0.09 0.29 0.11 0.00 N/A N/A 30
7 0.30 0.66 0.45 1.30 186 M 204 M 0.09 0.29 0.11 0.00 N/A N/A 30
5 0.30 0.66 0.45 1.30 188 M 225 M 0.16 0.28 0.11 0.01 N/A N/A 29
6 0.58 0.73 0.79 1.31 414 M 466 M 0.11 0.49 0.13 0.00 N/A N/A 30

—————————————————————————————————-

SKT 0 0.43 0.70 0.62 1.31 2523 M 3313 M 0.24 0.38 0.13 0.01 N/A N/A 25
Instructions retired: 2438 G ; Active cycles: 3501 G ; Time (TSC): 708 Gticks ; C0 (active,non-halted) core residency: 47.22 %

—————————————————————————————————- TOTAL * 0.43 0.70 0.62 1.31 2523 M 3313 M 0.24 0.38 0.13 0.01 N/A N/A N/A

C: 1.39 => corresponds to 34.83 % utilization for cores in active state. Instructions per no C1 core residency: 17.84 %; C3 core residency: 0.01 %; C6 core residency: 0.01 %; C7 core residency: 34.92 % C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %

PHYSICAL CORE I
Pminal CPU cycle: 0.86 => corresponds to 21.51 % core utilization over time interval. All the information given here is of interest, however, there is so much of it I figure the best thing to do is cut the the case and test my assertion about the CMS collector. To do that we can look at just two lines form the output for each run:

Default:
SKT 0 0.38 0.75 0.51 1.31 1982 M 2533 M 0.22 0.33 0.11 0.01 N/A N/A 22
Instructions retired: 2366 G ; Active cycles: 3166 G ; Time (TSC): 773 Gticks ; C0 (active,non-halted) core residency: 39.04 %
CMS:
0 0.43 0.70 0.62 1.31 2523 M 3313 M 0.24 0.38 0.13 0.01 N/A N/A 25
SKT:
Instructions retired: 2438 G ; Active cycles: 3501 G ; Time (TSC): 708 Gticks ; C0 (active,non-halted) core residency: 47.22 %

Discussion

We can see that under the CMS collector there were substantially more cache misses. The L2 misses were 30% greater and L2 were up 27% over the default collector. Nevertheless, the total time taken in giga ticks (708CMS/773Default) shows that all this extra data missing has not negatively impacted over all performance at all. I guess this means that a lot more research could and should be done before drawing any conclusions as to the correct approach for this application!
If you leave this post thinking that I did not fully discuss the subject, you are correct. My intention here has been to get the reader interested in thinking about this aspect of Java performance and opening the door to a new approach.

Meta: this post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!