A persistent KeyValue Server in 40 lines and a sad fact

Advent time again .. picking up Peters well written overview on the uses of Unsafe, i’ll have a short fly-by on how low level techniques in Java can save development effort by enabling a higher level of abstraction or allow for Java performance levels probably unknown to many.

My major point is to show that conversion of Objects to bytes and vice versa is an important fundamental, affecting virtually any modern java application.

Hardware enjoys to process streams of bytes, not object graphs connected by pointers as “All memory is tape” (M.Thompson if I remember correctly ..).

Many basic technologies are therefore hard to use with vanilla Java heap objects:

  • Memory Mapped Files – a great and simple technology to persist application data safe, fast & easy.
  • Network communication is based on sending packets of bytes
  • Interprocess communication (shared memory)
  • Large main memory of today’s servers (64GB to 256GB). (GC issues)
  • CPU caches work best on data stored as a continuous stream of bytes in memory

so use of the Unsafe class in most cases boil down in helping to transform a java object graph into a continuous memory region and vice versa either using

  • [performance enhanced] object serialization or
  • wrapper classes to ease access to data stored in a continuous memory region.

(source of examples used in this post can be found here, messaging latency test here)


    Serialization based Off-Heap

    Consider a retail WebApplication where there might be millions of registered users. We are actually not interested in representing data in a relational database as all needed is a quick retrieve of user related data once he logs in. Additionally one would like to traverse the social graph quickly.

    Let’s take a simple user class holding some attributes and a list of ‘friends’ making up a social graph.

    easiest way to store this on heap, is a simple huge HashMap.

    Alternatively one can use off heap maps to store large amounts of data. An off heap map stores its keys and values inside the native heap, so garbage collection does not need to track this memory. In addition, native heap can be told to automagically get synchronized to disk (memory mapped files). This even works in case your application crashes, as the OS manages write back of changed memory regions.

    There are some open source off heap map implementations out there with various feature sets (e.g. ChronicleMap), for this example I’ll use a plain and simple implementation featuring fast iteration (optional full scan search) and ease of use.

    Serialization is used to store objects, deserialization is used in order to pull them to the java heap again. Pleasantly I have written the (afaik) fastest fully JDK compliant object serialization on the planet, so I’ll make use of that.

     Done:

    • persistence by memory mapping a file (map will reload upon creation). 
    • Java Heap still empty to serve real application processing with Full GC < 100ms. 
    • Significantly less overall memory consumption. A user record serialized is ~60 bytes, so in theory 300 million records fit into 180GB of server memory. No need to raise the big data flag and run 4096 hadoop nodes on AWS ;).
    Comparing a regular in-memory java HashMap and a fast-serialization based persistent off heap map holding 15 millions user records, will show following results (on a 3Ghz older XEON 2×6):

    consumed Java Heap (MB) Full GC (s) Native Heap (MB) get/put ops per s required VM size (MB)
    HashMap 6.865,00 26,039 0 3.800.000,00
    12.000,00
    OffheapMap (Serialization based)
    63,00
    0,026
    3.050
    750.000,00
    500,00

    [test source / blog project] Note: You’ll need at least 16GB of RAM to execute them.

    As one can see, even with fast serialization there is a heavy penalty (~factor 5) in access performance, anyway: compared to other persistence alternatives, its still superior (1-3 microseconds per “get” operation, “put()” very similar).

    Use of JDK serialization would perform at least 5 to 10 times slower (direct comparison below) and therefore render this approach useless.

    Trading performance gains against higher level of abstraction: “Serverize me”

    A single server won’t be able to serve (hundreds of) thousands of users, so we somehow need to share data amongst processes, even better: across machines.

    Using a fast implementation, its possible to generously use (fast-) serialization for over-the-network messaging. Again: if this would run like 5 to 10 times slower, it just wouldn’t be viable. Alternative approaches require an order of magnitude more work to achieve similar results.

    By wrapping the persistent off heap hash map by an Actor implementation (async ftw!), some lines of code make up a persistent KeyValue server with a TCP-based and a HTTP interface (uses kontraktor actors). Of course the Actor can still be used in-process if one decides so later on.

    Now that’s a micro service. Given it lacks any attempt of optimization and is single threaded, its reasonably fast [same XEON machine as above]:

    • 280_000 successful remote lookups per second 
    • 800_000 in case of fail lookups (key not found)
    • serialization based TCP interface (1 liner)
    • a stringy webservice for the REST-of-us (1 liner).

    [source: KVServer, KVClient] Note: You’ll need at least 16GB of RAM to execute the test.

    A real world implementation might want to double performance by directly putting received serialized object byte[] into the map instead of encoding it twice (encode/decode once for transmission over wire, then decode/encode for offheaping map).

    “RestActorServer.Publish(..);” is a one liner to also expose the KVActor as a webservice in addition to raw tcp:

    C like performance using flyweight wrappers / structs

    With serialization, regular Java Objects are transformed to a byte sequence. One can do the opposite: Create  wrapper classes which read data from fixed or computed positions of an underlying byte array or native memory address. (E.g. see this blog post).

    By moving the base pointer its possible to access different records by just moving the the wrapper’s offset. Copying such a “packed object” boils down to a memory copy. In addition, its pretty easy to write allocation free code this way. One downside is, that reading/writing single fields has a performance penalty compared to regular Java Objects. This can be made up for by using the Unsafe class.

    “flyweight” wrapper classes can be implemented manually as shown in the blog post cited, however as code grows this starts getting unmaintainable.
    Fast-serializaton provides a byproduct “struct emulation” supporting creation of flyweight wrapper classes from regular Java classes at runtime. Low level byte fiddling in application code can be avoided for the most part this way.

    How a regular Java class can be mapped to flat memory (fst-structs):

    Of course there are simpler tools out there to help reduce manual programming of encoding  (e.g. Slab) which might be more appropriate for many cases and use less “magic”.

    What kind of performance can be expected using the different approaches (sad fact incoming) ?

    Lets take the following struct-class consisting of a price update and an embedded struct denoting a tradable instrument (e.g. stock) and encode it using various methods:

    a ‘struct’ in code
    Pure encoding performance:
    Structs fast-Ser (no shared refs) fast-Ser JDK Ser (no shared) JDK Ser
    26.315.000,00 7.757.000,00 5.102.000,00 649.000,00 644.000,00




    Real world test with messaging throughput:

    In order to get a basic estimation of differences in a real application, i do an experiment how different encodings perform when used to send and receive messages at a high rate via reliable UDP messaging:

    The Test:
    A sender encodes messages as fast as possible and publishes them using reliable multicast, a subscriber receives and decodes them.

    Structs fast-Ser (no shared refs) fast-Ser JDK Ser (no shared) JDK Ser
    6.644.107,00 4.385.118,00 3.615.584,00 81.582,00 79.073,00

    (Tests done on I7/Win8, XEON/Linux scores slightly higher, msg size ~70 bytes for structs, ~60 bytes serialization).

    Slowest compared to fastest: factor of 82. The test highlights an issue not covered by micro-benchmarking: Encoding and Decoding should perform similar, as factual throughput is determined by Min(Encoding performance, Decoding performance). For unknown reasons JDK serialization manages to encode the message tested like 500_000 times per second, decoding performance is only 80_000 per second so in the test the receiver gets dropped quickly:



    ***** Stats for receive rate:   80351   per second *********
    ***** Stats for receive rate:   78769   per second *********
    SUB-ud4q has been dropped by PUB-9afs on service 1
    fatal, could not keep up. exiting

    (Creating backpressure here probably isn’t the right way to address the issue 😉  )

    Conclusion:

    • a fast serialization allows for a level of abstraction in distributed applications impossible if serialization implementation is either
      – too slow
      – incomplete. E.g. cannot handle any serializable object graph
      – requires manual coding/adaptions. (would put many restrictions on actor message types, Futures, Spore’s, Maintenance nightmare)
    • Low Level utilities like Unsafe enable different representations of data resulting in extraordinary throughput or guaranteed latency boundaries (allocation free main path) for particular workloads. These are impossible to achieve by a large margin with JDK’s public tool set.
    • In distributed systems, communication performance is of fundamental importance. Removing Unsafe is  not the biggest fish to fry looking at the numbers above .. JSON or XML won’t fix this ;-).
    • While the HotSpot VM has reached an extraordinary level of performance and reliability, CPU is wasted in some parts of the JDK like there’s no tomorrow. Given we are living in the age of distributed applications and data, moving stuff over the wire should be easy to achieve (not manually coded) and as fast as possible. 
    Addendum: bounded latency

    A quick Ping Pong RTT latency benchmark showing that java can compete with C solutions easily, as long the main path is allocation free and techniques like described above are employed:

    [credits: charts+measurement done with HdrHistogram]

    This is an “experiment” rather than a benchmark (so do not read: ‘Proven: Java faster than C’), it shows low-level-Java can compete with C in at least this low-level domain.
    Of course its not exactly idiomatic Java code, however its still easier to handle, port and maintain compared to a JNI or pure C(++) solution. Low latency C(++) code won’t be that idiomatic either 😉

    About me: I am a solution architect freelancing at an exchange company in the area of realtime GUIs, middleware, and low latency CEP (Complex Event Processing).
    I am blogging at http://java-is-the-new-c.blogspot.de/,
    hacking at https://github.com/RuedigerMoeller.

    This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

    How and Why is Unsafe used in Java?

    Overview

    sun.misc.Unsafe has been in Java from at least as far back as Java 1.4 (2004).  In Java 9, Unsafe will be hidden along with many other, for-internal-use classes. to improve the maintainability of the JVM.  While it is still unclear exactly what will replace Unsafe, and I suspect it will be more than one thing which replaces it, it raises the question, why is it used at all?

    Doing things which the Java language doesn’t allow but are still useful.

    Java doesn’t allow many of the tricks which are available to lower level languages.  For most developers this is very good thing, and it not only saves you from yourself, it also saves you from your co-workers.  It also makes it easier to import open source code because you know there is limits to how much damage they can do.  Or at least there is limits to how much you can do accidentally. If you try hard enough you can still do damage.

    But why would you even try, you might wonder?  When building libraries many (but not all) of the methods in Unsafe are useful and in some cases, there is no other way to do the same thing without using JNI, which is even more dangerous and you lose the “compile once, run anywhere”

    Deserialization of objects

    When deserializing or building an object using a framework, you make the assumption you want to reconstitute an object which existed before.  You expect that you will use reflection to either call the setters of the class, or more likely set the internal fields directly, even the final fields.  The problem is you want to create an instance of an object, but you don’t really need a constructor as this is likely to only make things more difficult and have side effects.
    public class A implements Serializable {
    private final int num;
    public A(int num) {
    System.out.println("Hello Mum");
    this.num = num;
    }

    public int getNum() {
    return num;
    }
    }

    In this class, you should be able to rebuild and set the final field, but if you have to call a constructor and it might do things which don’t have anything to do with deserialization.  For these reasons many libraries use Unsafe to create instances without calling a constructor

    Unsafe unsafe = getUnsafe();
    Class aClass = A.class;
    A a = (A) unsafe.allocateInstance(aClass);

    Calling allocateInstance avoids the need to call the appropriate constructor, when we don’t need one.

    Thread safe access to direct memory

    Another use for Unsafe is thread safe access to off heap memory.  ByteBuffer gives you safe access to off heap or direct memory, however it doesn’t have any thread safe operations.  This is particularly useful if you want to share data between processes.

    import sun.misc.Unsafe;
    import sun.nio.ch.DirectBuffer;

    import java.io.File;
    import java.io.IOException;
    import java.io.RandomAccessFile;
    import java.lang.reflect.Field;
    import java.nio.MappedByteBuffer;
    import java.nio.channels.FileChannel;

    public class PingPongMapMain {
    public static void main(String... args) throws IOException {
    boolean odd;
    switch (args.length < 1 ? "usage" : args[0].toLowerCase()) {
    case "odd":
    odd = true;
    break;
    case "even":
    odd = false;
    break;
    default:
    System.err.println("Usage: java PingPongMain [odd|even]");
    return; }
    int runs = 10000000;
    long start = 0;
    System.out.println("Waiting for the other odd/even");
    File counters = new File(System.getProperty("java.io.tmpdir"), "counters.deleteme"); counters.deleteOnExit();

    try (FileChannel fc = new RandomAccessFile(counters, "rw").getChannel()) {
    MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_WRITE, 0, 1024);
    long address = ((DirectBuffer) mbb).address();
    for (int i = -1; i < runs; i++) {
    for (; ; ) {
    long value = UNSAFE.getLongVolatile(null, address);
    boolean isOdd = (value & 1) != 0;
    if (isOdd != odd)
    // wait for the other side.
    continue;
    // make the change atomic, just in case there is more than one odd/even process
    if (UNSAFE.compareAndSwapLong(null, address, value, value + 1))
    break;
    }
    if (i == 0) {
    System.out.println("Started");
    start = System.nanoTime();
    }
    }
    }
    System.out.printf("... Finished, average ping/pong took %,d ns%n",
    (System.nanoTime() - start) / runs);
    }

    static final Unsafe UNSAFE;

    static {
    try {
    Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
    theUnsafe.setAccessible(true);
    UNSAFE = (Unsafe) theUnsafe.get(null);
    } catch (Exception e) {
    throw new AssertionError(e);
    }
    }
    }

    When you run this in two programs, one with odd and the other with even. You can see that each process is changing data via  persisted shared memory.

    In each program it maps the same are of the disks cache into the process.  There is actually only one copy of the file in memory.  This means the memory can be shared, provided you use thread safe operations such as the volatile and CAS operations.

    The output on an i7-3970X is

    Waiting for the other odd/even
    Started
    … Finished, average ping/pong took 83 ns

    That is 83 ns round trip time between two processes. When you consider System V IPC takes around 2,500 ns and IPC volatile instead of persisted, that is pretty quick.

    Is using Unsafe suitable for work?

    I wouldn’t recommend you use Unsafe directly.  It requires far more testing than natural Java development.  For this reason I suggest you use a library where it’s usage has been tested already.  If you wan to use Unsafe yourself, I suggest you thoughly test it’s usage in a stand alone library.  This limits how Unsafe is used in your application and give syou a safer, Unsafe.

    Conclusion

    It is interesting that Unsafe exists in Java, and you might to play with it at home.  It has some work applications especially in writing low level libraries, but in general it is better to use a library which uses Unsafe which has been tested than use it directly yourself.

    About the Author.

    Peter Lawrey has the most Java answers on StackOverflow. He is the founder of the Performance Java User’s Group, and lead developer of Chronicle Queue and Chronicle Map, two libraries which use Unsafe to share persisted data between processes.

    Escaping from the JVM heap for memory intensive applications

    If you’ve ever allocated large Java heaps, you know that at some point – typically starting at around 4 GiB – you will start having issues with your garbage collection pauses.
    I won’t go into detail about why pauses happen in the JVM, but in short it happens when the JVM does full collections and you have a large heap. As the heap increases, those collections might become longer.

    The simplest way to overcome this is to tune your JVM garbage collection parameters to match the memory allocation and deallocation behaviour of your particular application. It is a bit of a dark art and requires careful measurements, but it’s possible to have very large heaps while avoiding mostly old generation garbage collections. If you want to learn more about Garbage Collection tuning, check out this JVM GC tuning guide.
    If you get really interested about GC in general, this is an excellent book: The Garbage Collection Handbook.

    There are JVM implementations that guarantee much lower pause times than the Sun VM, such as the Zing JVM – but normally at other costs in your system, such as increased memory usage and single threaded performance. The ease of configuration and low gc guarantees is still very appealing.

    For the purpose of this article, I will use the example of an in-memory cache or store in Java, mainly because I’ve built a couple in the past while using some of these techniques.

    We’ll assume we have a basic cache interface definition like so:


    import java.io.Externalizable;

    public interface Cache<K extends Externalizable, V extends Externalizable> {
    public void put(K key, V value);
    public V get(K key);
    }

    We’re requiring that keys and values are Externalizable just for this simple example, wouldn’t be like this IRL.

    We will show how to have different implementations of this cache that store data in memory in different ways. The simplest way to implement this cache would be using Java collections:


    import java.io.Externalizable;
    import java.util.HashMap;
    import java.util.Map;

    public class CollectionCache<K extends Externalizable, V extends Externalizable> implements Cache<K, V> {
    private final Map<K, V> backingMap = new HashMap<K, V>();

    public void put(K key, V value) {
    backingMap.put(key, value);
    }

    public V get(K key) {
    return backingMap.get(key);
    }
    }

    This implementation is straighforward. However, as the map size increases, we will be allocating a large number of objects (and deallocating), we are using boxed primitives which takes more space in memory that primitives and the map needs to be resized from time to time.
    We could certainly improve this implementation simply by using a primitive-based map. It would use less memory and objects but would still take space in the heap and possibly partition the heap, leading to longer pauses if for other reasons we do full GCs.

    Let’s look at other ways to store similar data without using the heap:

    • Use a separate process to store the data. Could be something like a Redis or Memcached instance that you connect through sockets or unix sockets. It’s fairly straightforward to implement.
    • Offload data to disk, using memory mapped files. The OS is your friend and will do a lot of heavy work predicting what you’ll read next from the file and your interface to it is just like a big blob of data.
    • Use native code and access it through JNI or JNA. You’ll get better performance with JNI and ease of use with JNA. Requires you to write native code.
    • Use direct allocated buffers from the NIO package.
    • Use the Sun specific Unsafe class to access memory directly from your Java code.

    I will focus on the solutions that use exclusively Java for this article, direct allocated buffers and the Unsafe class.

    Direct Allocated Buffers

    Direct Allocated Buffers are extremely useful and used extensively when developing high-performance network applications in Java NIO. By allocating data directly outside the heap, in a number of cases you can write software where that data actually never touches the heap.

    Creating a new direct allocated buffer is as simple as it gets:


    int numBytes = 1000;
    ByteBuffer buffer = ByteBuffer.allocateDirect(numBytes);

    After creating a new buffer, you can manipulate the buffer in a few different ways. If you’ve never used Java NIO buffers you should definitely take a look as they are really cool.

    Besides ways to fill, drain and mark different points in the buffer, you can opt to have different view on the buffer instead of a ByteBuffer – e.g. buffer.asLongBuffer() gives you a view on the ByteBuffer where you manipulate elements as longs.

    So how could these be used in our Cache example? There are a number of ways, the most straightforward way would be to store the serialized/externalized form of the value record in a big array along with a map of keys to offsets and sizes of the record in that array.

    It could look like this (very liberal approach, missing implementations and assuming fixed size records):


    import java.io.Externalizable;
    import java.nio.ByteBuffer;
    import java.util.HashMap;
    import java.util.Map;

    public class DirectAllocatedCache<K extends Externalizable, V extends Externalizable> implements Cache<K,V> {
    private final ByteBuffer backingMap;
    private final Map<K, Integer> keyToOffset;
    private final int recordSize;

    public DirectAllocatedCache(int recordSize, int maxRecords) {
    this.recordSize = recordSize;
    this.backingMap = ByteBuffer.allocateDirect(recordSize * maxRecords);
    this.keyToOffset = new HashMap<K, Integer>();
    }

    public void put(K key, V value) {
    if(backingMap.position() + recordSize < backingMap.capacity()) {
    keyToOffset.put(key, backingMap.position());
    store(value);
    }
    }

    public V get(K key) {
    int offset = keyToOffset.get(key);
    if(offset >= 0)
    return retrieve(offset);

    throw new KeyNotFoundException();
    }

    public V retrieve(int offset) {
    byte[] record = new byte[recordSize];
    int oldPosition = backingMap.position();
    backingMap.position(offset);
    backingMap.get(record);
    backingMap.position(oldPosition);

    //implementation left as an exercise
    return internalize(record);
    }

    public void store(V value) {
    byte[] record = externalize(value);
    backingMap.put(record);
    }
    }

    As you can see, this code has a number of limitations: fixed record size, fixed backing map size, limited way in which externalization is done, difficult to delete and reuse space, etc. While some of these are possible to overcome with clever ways to represent the record in byte arrays (and representing the keyToOffset map in direct allocated buffers also) or dealing with deletions (we could implement our own SLAB allocator) others such as resizing the backing map are difficult to overcome.
    An interesting improvement is to implement records as offsets to records and fields, thus reducing the amount of data we copy and do so only on demand.

    Be aware that the JVM imposes a limit to the amount of memory used by direct allocated buffers. You can tune this with the -XX:MaxDirectMemorySize option. Check out the ByteBuffer javadocs

    Unsafe

    Another way to manage memory directly from Java is using the hidden Unsafe class. Technically we’re not supposed to use this and it is implementation specific as it lives in a sun package, but the possibilities offered are endless.
    What Unsafe gives us is the ability to allocate, deallocate and manage memory directly from Java code. We can also get the actual pointers and pass them between native and java code interchangebly.

    In order to get an Unsafe instance, we need to cut a few corners:


    private Unsafe getUnsafeBackingMap() {
    try {
    Field f = Unsafe.class.getDeclaredField("theUnsafe");
    f.setAccessible(true);
    return (Unsafe) f.get(null);
    } catch (Exception e) { }
    return null;
    }

    Once we have the unsafe, we can apply this to our previous Cache example:


    import java.io.Externalizable;
    import java.lang.reflect.Field;
    import java.util.HashMap;
    import java.util.Map;

    import sun.misc.Unsafe;

    public class UnsafeCache<K extends Externalizable, V extends Externalizable> implements Cache<K, V> {
    private final int recordSize;
    private final Unsafe backingMap;
    private final Map<K, Integer> keyToOffset;
    private long address;
    private int capacity;
    private int currentOffset;

    public UnsafeCache(int recordSize, int maxRecords) {
    this.recordSize = recordSize;
    this.backingMap = getUnsafeBackingMap();
    this.capacity = recordSize * maxRecords;
    this.address = backingMap.allocateMemory(capacity);
    this.keyToOffset = new HashMap<K, Integer>();
    }

    public void put(K key, V value) {
    if(currentOffset + recordSize < capacity) {
    store(currentOffset, value);
    keyToOffset.put(key, currentOffset);
    currentOffset += recordSize;
    }
    }

    public V get(K key) {
    int offset = keyToOffset.get(key);
    if(offset >= 0)
    return retrieve(offset);

    throw new KeyNotFoundException();
    }

    public V retrieve(int offset) {
    byte[] record = new byte[recordSize];

    //Inefficient
    for(int i=0; i<record.length; i++) {
    record[i] = backingMap.getByte(address + offset + i);
    }

    //implementation left as an exercise
    return internalize(record);
    }

    public void store(int offset, V value) {
    byte[] record = externalize(value);

    //Inefficient
    for(int i=0; i<record.length; i++) {
    backingMap.putByte(address + offset + i, record[i]);
    }
    }

    private Unsafe getUnsafeBackingMap() {
    try {
    Field f = Unsafe.class.getDeclaredField("theUnsafe");
    f.setAccessible(true);
    return (Unsafe) f.get(null);
    } catch (Exception e) { }
    return null;
    }
    }

    There’s a lot of space for improvement and you need to do a number of things manually, but it’s very powerful. You can also explicitly free and reallocate memory that you’ve allocated in this way, which allows you to write some code in the same way you would to with C.

    Check out the javadocs for Unsafe

    Conclusion

    There’s a number of ways to avoid using the heap in Java and in this way, use a lot more memory. You don’t need to do this and I’ve personally seen properly tuned JVMs with 20GiB-30GiB running with no long garbage collection pauses, but it is fairly interesting.

    If you want to check out how some projects use this for the basic (and honestly untested, almost written on a napkin) cache code I wrote here, have a look at EHCache’s BigMemory or Apache Cassandra which uses Unsafe also for this type of approach.

     

     

    Meta: this post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on! Want to write for the blog? We are looking for contributors to fill all 24 slot and would love to have your contribution! Contact Attila Balazsto contribute!