Site icon JVM Advent

Project Valhalla: fast and furious java

Exploring the performance improvements of new Inline Types


Project Valhalla is one of the most interesting projects regarding the future of Java and of all JVM. It is still in an experimental state, but in this post we will look at how to try it, and we will implement a small program to verify the kind of performance improvements are possible with it.

Valhalla’s aim is to bring a new Inline Types (aka Value Types) that will: “Codes like a class works like an int”.

Currently, in the JVM there are 8 kinds of primitive types, each one is associated with a “signature letter” of the alphabet:

B byte signed byte
C char Unicode character encoded with UTF-16
D double double-precision floating-point value
F float single-precision floating-point value
I int integer
J long long integer
S short signed short
Z boolean true or false

On top of these, we have objects:

L ClassName reference an instance of class ClassName

You may have noted the signature letter if you printed out the result of toString() method for a generic object.

Project Valhalla introduces a new letter for a new kind of types:

Q ClassName inline types

These new types will eventually replace the primitive types in our applications, removing current boxed types (Integer and such) and bringing a new seamless world based on Reference and Inline types only.

This is still in the future, for the moment let’s enjoy what is already available in the current Early-Access build.


Getting Started

To simplify the set-up, the Project Valhalla team released an Early-Access Build on August the 30th. 

You can download it from here and configure it in JAVA_HOME as a normal JDK (version 14): http://jdk.java.net/valhalla/

Unfortunately at the moment (December 2019) IntelliJ and Gradle are not able to compile a source with the new features of the language. We need to use the JDK command-line tools or ant.

I shared on Github a repository with several examples and the ant build scripts needed to run them.

A Valhalla POINT

Let’s start from something very simple, let’s define a Point type with two fields for the coordinates.

inline public class Point {
    public int x;
    public int y;

    public Point(int x, int y) {
        this.x = x;
        this.y = y;
    }
}

The only change is the inline keyword on line 1. Point is an Inline Type and now we will look at the differences with a normal class.

Code like a Class

Work like an int

Work in progress

Some characteristics are still unfinished and they may not work in the current build or change how they work in future releases. In particular:

For a more comprehensive look at Project Valhalla features, see the articles linked at the end. Here we will concentrate on the performance aspect of it.

Compact Arrays

One of the most exciting features of the Inline Classes is the way in which arrays are created.

In the case of primitives, each position of the array has the direct representation of the type; for example, the 64 bits of a long in case of an array of long.

In the case of the referenced object instead, the array will only contain the reference to the object allocated memory on the heap.

This means that reading an object from an array involves first reading the reference and then fetching the memory from its actual location. If the object has referenced fields, those would also cause data to be fetched from a remote memory location.

Inline Types on the other side work like primitives.

To have a taste of how much this matters, let’s make a simple test program to sum all trades from a given account in a big array.

Here is our trade’s simple representation, with an amount, an account and a traded security.

public class TradeRef {
    public final double amount;
    public final String account;
    public final String security;

    public TradeRef(double amount, String account, String security) {
        this.amount = amount;
        this.account = account;
        this.security = security;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;
        TradeRefEncoded that = (TradeRefEncoded) o;
        return Double.compare(that.amount, amount) == 0 &&
                account == that.account &&
                security == that.security;
    }

    @Override
    public int hashCode() {
        return Objects.hash(amount, account, security);
    }
}

Similarly, we will define a TradeInline class, identical with the inline modifier:

inline public class TradeInline {
    final double amount;
    final String account;
    final String security;

    public TradeInline(double amount, String account, String security) {
        this.amount = amount;
        this.account = account;
        this.security = security;
    }
}

After the compilation, we can use javap from the command line to print the generated ByteCode:

javap -s antbuild/com/ubertob/ministring/TradeInline.class 

And this is the output:

public final value class com.ubertob.ministring.TradeInline {
  final double amount;
    descriptor: D
  final java.lang.String account;
    descriptor: Ljava/lang/String;
  final java.lang.String security;
    descriptor: Ljava/lang/String;
  public final int hashCode();
    descriptor: ()I

  public final boolean equals(java.lang.Object);
    descriptor: (Ljava/lang/Object;)Z

  public final java.lang.String toString();
    descriptor: ()Ljava/lang/String;

  public static com.ubertob.ministring.TradeInline com.ubertob.ministring.TradeInline(double, java.lang.String, java.lang.String);
    descriptor: (DLjava/lang/String;Ljava/lang/String;)Qcom/ubertob/ministring/TradeInline;
}

Note the value modifier in the first line and the methods hashCode, equals, and toString that are not present in our class.

encoding a string into a long

Our TradeInline class has still two String inside as fields. Can we inline them?

Unfortunately at the moment is not possible to flatten String and Arrays inside an Inline type. There are future plans for something called Array 2.0 that would allow for that.

What we can do now instead, is to squeeze a String inside a long type, although we need to accept some limitations.

The long type has 64 bits, so using a base 64 encoding (6 bits) we can store up to 10 characters inside.

In our case, we assume that the Security and the Account have a maximum length of 10 characters and they are composed only of uppercase letters, numbers and a limited number of special characters.

This is the code of our MiniString inline type, with the encoding and decoding static functions:

inline public class MiniString {

    long raw;

    public MiniString(String str) {
        raw = encode(str);
    }

    public String get() {
        return decode(raw);
    }

    public static final int MAX_MINISTR_LEN = 10;
    public static final int MINI_STR_BASE = 64;
    public static final String letters = "=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 _-!?.$&%@#:[]{}()*<>:;',/^";

    public static long encode(String str) {
        String prepared = prepareString(str);
        long encoded = 0;
        for (char c : prepared.toCharArray()) {
            int x = letters.indexOf(c);
            encoded = encoded * MINI_STR_BASE + x;
        }
        return encoded;
    }

    private static String prepareString(String str) {
        StringBuilder prepared = new StringBuilder();
        for (char c : str.toCharArray()) {
            if (letters.indexOf(c) >= 0)
                prepared.append(c);
            if (prepared.length() > MAX_MINISTR_LEN)
                break;
        }
        return prepared.toString();
    }

    public static String decode(long number) {
        StringBuilder decoded = new StringBuilder();
        long remaining = number;

        while (true) {
            int mod = (int) ( remaining % MINI_STR_BASE);
            char c = letters.charAt(mod);
            decoded.insert(0, c);
            if ( remaining < MINI_STR_BASE)
                break;
            remaining = remaining / MINI_STR_BASE;
        }
        return decoded.toString();
    }
}

We will define a TradeMiniString inline type using our new MiniString for Account and Security.

This is the generated ByteCode:

public final value class com.ubertob.ministring.TradeMiniString {
  final double amount;
    descriptor: D
  final com.ubertob.ministring.MiniString account;
    descriptor: Qcom/ubertob/ministring/MiniString;
  final com.ubertob.ministring.MiniString security;
    descriptor: Qcom/ubertob/ministring/MiniString;
  public final int hashCode();
    descriptor: ()I

  public final boolean equals(java.lang.Object);
    descriptor: (Ljava/lang/Object;)Z

  public final java.lang.String toString();
    descriptor: ()Ljava/lang/String;

  public static com.ubertob.ministring.TradeMiniString com.ubertob.ministring.TradeMiniString(double, java.lang.String, java.lang.String);
    descriptor: (DLjava/lang/String;Ljava/lang/String;)Qcom/ubertob/ministring/TradeMiniString;
}

Note how the descriptor for the MiniString account starts with the Q of inline types and not with the L of the reference types.

Finally, to have a fair comparison, we also define a TradeRefEncoded class using the same trick to encode an Account and Security in a field of type long.

You can find the complete code in my GitHub repository.

Before looking at the performance, let’s look how our Trade objects are using the memory.

First the standard TradeRef object:

On the left, there is our big array. Then a pointer to the TradeRef object, and then two other pointers to the strings.

Depending on the order of creation, these can be quite far apart in our heap.

This is a performance problem because fetching memory from different regions is one of the slowest operations for a CPU.

This classic post from Martin Thomson is one of the better explanations:
https://mechanical-sympathy.blogspot.com/2012/08/memory-access-patterns-are-important.html

Let’s look at a simple diagram of the InlineTrade value memory representation now:

Here the whole value is directly on the array. For this reason, the array is actually bigger than in the first case. If you don’t allocate all objects, in other words, if your array mostly contains null values, this can be a drawback.

Finally, let’s see how the TradeMiniString is allocated in the memory:

You can see that it stays completely inside the array element, with no external pointers.

This is possible only because we accept big limitations on what can be contained in those strings; still, it’s an acceptable compromise when using strings only for storing ID from a database or the stock ticker symbols.

Performance Comparison

To compare the performance we can create four arrays of 5 million elements, one for each of our trade types:

    static final int arraySize = 5_000_000;

    public TradeRef[] tradeRefs = new TradeRef[arraySize];
    public TradeRefEncoded[] tradesRefEncoded = new TradeRefEncoded[arraySize];
    public TradeInline[] tradesInline = new TradeInline[arraySize];
    public TradeMiniString[] tradesMiniString = new TradeMiniString[arraySize];

Then we fill them with identical random values:

    public static void generatingTradesAndBrowing() {

        var tr = new TradeRepository();
        tr.fillWithRandomData();

        var searcherRef = new TradeRefBrowser(tr.tradeRefs);
        var searcherInline = new InlineTradeBrowser(tr.tradesInline);
        var searcherRefEncoded = new TradeRefEncodedBrowser(tr.tradesRefEncoded);
        var searcherMiniString = new MiniStringTradeBrowser(tr.tradesMiniString);

        var account = tr.tradeRefs[1000].account;

        while (true) {
            benchmarks(searcherRef, searcherInline, searcherRefEncoded, searcherMiniString, account);
        }
    }

And finally, we do searches on each one repeatedly, printing out the time elapsed.

    private static void benchmarks(TradeRefBrowser searcherRef, InlineTradeBrowser searcherInline, TradeRefEncodedBrowser searcherRefEncoded, MiniStringTradeBrowser searcherMiniString, String account) {
        cronoSum(() -> searcherRef.sumByAccountFor(account), "Ref with for");
        cronoSum(() -> searcherRef.sumByAccountStream(account), "Ref with stream");

        cronoSum(() -> searcherRefEncoded.sumByAccountFor(account), "RefEncoded with for");
        cronoSum(() -> searcherRefEncoded.sumByAccountStream(account), "RefEncoded with stream");

        cronoSum(() -> searcherInline.sumByAccountFor(account), "Inline with for");
        cronoSum(() -> searcherMiniString.sumByAccountFor(account), "MiniString with for");
    }

We don’t care much about specific numbers here, so we just keep looping and print the results. The timings tend to stabilize after a few minutes when the Java Hotspot compiler optimizes and inlines most of the methods.

We can filter and sum using two methods: a for loop or a nicer stream map and reduce:

    public double sumByAccountStream(String account){
        return Arrays.stream(repo)
                .filter( trade -> trade.account.equals(account))
                .map(trade -> trade.amount)
                .reduce(0.0, (a, b) -> a+b);
    }

    public double sumByAccountFor(String account){
        double res = 0;
        for (int i = 0; i < repo.length; i++) {
            TradeRef tradeRef = repo[i];
            if (tradeRef.account.equals(account))
                res = res + tradeRef.amount;
        }
        return res;
    }

Unfortunately, streams are not (yet) working with Inline Classes:

Exception in thread "main" java.lang.ClassFormatError: Illegal class name "Qcom/ubertob/ministring/TradeMiniString;" in class file <Unknown>
        at java.base/jdk.internal.misc.Unsafe.defineAnonymousClass0(Native Method)
        at java.base/jdk.internal.misc.Unsafe.defineAnonymousClass(Unsafe.java:1345)
        at java.base/java.lang.invoke.InnerClassLambdaMetafactory.spinInnerClass(InnerClassLambdaMetafactory.java:324)
        at java.base/java.lang.invoke.InnerClassLambdaMetafactory.buildCallSite(InnerClassLambdaMetafactory.java:192)
        at java.base/java.lang.invoke.LambdaMetafactory.metafactory(LambdaMetafactory.java:329)
        at java.base/java.lang.invoke.BootstrapMethodInvoker.invoke(BootstrapMethodInvoker.java:127)
        at java.base/java.lang.invoke.CallSite.makeSite(CallSite.java:307)
        at java.base/java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:259)
        at java.base/java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:249)
        at com.ubertob.ministring.MiniStringTradeBrowser.sumByAccountStream(Unknown Source)

So for the moment we can only measure performance using for loops.

Fancy Graphs (finally)

The actual numbers are not very important here, so I just grabbed a significative sample running the application on my Linux laptop.

Rather than the numbers, we are interested in how fast they are relative to each other.

time in millisec, small are better

We can see here how in this case the improvement is really huge: inline TradeMiniString is more than 20x faster!

Event compared with the TradeRefEncoded is still 6x faster!

This seems almost too good to be true, so we can use a better method to measure the actual performance.

Brendan Gregg wrote some very useful tools to use Linux kernel perf tool to profile Java programs.

There are several blog posts on how to produce and how to read flame graphs:
https://medium.com/netflix-techblog/java-in-flames-e763b3d32166

And here are our flames:

You can find the actual svg file in the GitHub repository here.

We can see four green blocks (including the one with the arrow) that are a representation of how much time these four methods used the CPU. In this graph, the green blocks are Java calls, while the red ones are kernel system calls.

The spikes are other methods, related to the print out of results and garbage collection. We can ignore them.

From the left, the first block is the InlineTrade, using normal strings. It is clearly smaller than the last two blocks which represent the TradeRef and TradeRefEncoded.

As you may have guessed, the block pointed by the arrow is the TradeMiniString duration, it is so brief that you can read only a few characters of the name.

The relative duration of each method from the perf profile is reported in this graph:

number of perf samples, smaller are better

The relative timings are similar, Inline is 6 times faster than Ref and Inline Encoded is 5.5 times faster than Ref Encoded.

Conclusions

It is still too early for precise measurements, but it is already clear that Valhalla inline types have the potential to bring a massive performance boost for a certain type of critical applications.

There are still many rough edges and painful decisions to make but the overall shape of the Valhalla project is very exciting.

Since the last Early-Access build there have already been many improvements in the source code, let’s hope to have another stable build to test soon with more features.


resources to learn more

The full code of this and other examples:
https://github.com/uberto/testValueTypes

A Recent video about Valhalla at Devoxx BE
https://devoxx.be/talk/?id=41660

Brian Goetz on the State of Valhalla
http://cr.openjdk.java.net/~briangoetz/valhalla/sov/02-object-model.html

An in-depth explanation of Inline Types from Ben Evans:
https://www.infoq.com/articles/inline-classes-java

My post on the previous early-access build:
https://medium.com/@ramtop/a-taste-of-value-types-1a8a136fcfe2

Author: Uberto Barbini

Uberto is a very passionate and opinionated programmer, he helps big and small organizations in delivering value. Trainer, public speaker, and blogger. Currently working on a book on Kotlin and Functional Programming. Google Developer Expert.
Exit mobile version