JVM in the Age of AI: A Bird's-Eye View for the Mechanical Sympathizers

Alright, hold on tight, because in today’s edition of the advent calendar, we’re going to talk about AI! Cause you know, we have 2024 and we do not need any other reason.

Let’s start by discussing what this article will cover. Each time the topic of AI arises, everyone tends to think of something slightly different.

There’s a joke about what makes a good AI conference: paradoxically, it’s one where nobody actually uses the term “AI.” So, as a kind of establishing shot, let’s clarify what this article will really be about. We won’t be talking about training models or broadly about Data Engineering. We’ll also skip over LLMs (sorry, Langchain4j). Instead, we’ll focus on model inference—and more specifically, all the peculiar things that need to happen inside a virtual machine and the JDK to handle this topic effectively.

I promise – it will!

But first, what is an inference?

Imagine you have a magical gizmo capable of predicting “things”—with the precision of the Oracle of Delphi—but instead of a crystal ball, it uses math. In the world of artificial intelligence, this gizmo is a trained ML model. The model doesn’t pull predictions out of thin air; its “knowledge” comes from the data it was trained on. Inference is the process of applying this “knowledge” to new data to generate predictions.

For example:

When you ask a voice assistant about the weather, an ML model analyzes your query and predicts which weather data you need.
When you upload a photo to an image recognition app, the model identifies what’s in the picture—whether it’s a cat, a dog, or something more exotic

To understand the challenges of inference, let’s break down an ML model. Think of it as a series of filter layers through which data flows. Each layer analyzes the data in its own way, with the output of one layer becoming the input for the next. These layers contain numerous learned weights—also called parameters—which are essentially mathematical variables adjusted during training. For models like GPT-4, the number of parameters reaches hundreds of billions.

At every stage, the input data is processed through various calculations, such as matrix multiplications and additions. These operations are the essence of a model’s function—like sifting data through a series of increasingly fine filters to extract the most relevant information.

All of this would be straightforward if the data could be processed layer by layer with unlimited time. But in reality, inference needs to be extremely fast:

A chatbot with a multi-second delay? Unacceptable.
An image recognition app taking 30 seconds? The user has already closed the app.

So, ML models are like packing increasingly larger luggage into a shrinking car. Each new model demands more memory and processing power. But where do these challenges originate?

Memory Bandwidth
Even the fastest machines have limits when transferring data between memory and the processor. Imagine trying to push water through a narrow faucet—no matter how large your water tank is, the flow is restricted. Similarly, in inference, models demand more data than the memory can deliver in real-time.

Computation Time
Every layer of the model requires processing enormous amounts of data. For instance, a language model like GPT-4 must compute the probability of the next word (or token) based on billions of parameters. Imagine assembling a puzzle where every piece needs to be checked against every other piece—and you have to do this in fractions of a second.

Model Scaling
Model sizes keep growing. Just a few years ago, a model with 100 million parameters was considered massive. Today, GPT-4 has hundreds of billions of parameters. It’s like comparing carry-on luggage to moving an entire house. Larger models mean:

More operations to perform.
More data to store and process.
Higher hardware requirements, which aren’t easy to meet.

In summary, inference is a balancing act between model accuracy and computational efficiency. The larger the model, the more precise its predictions can be – but the computational costs also increase. This is why optimization – both at the level of model architecture and hardware infrastructure – is critical for practical AI applications, particularly in the Java ecosystem.

Let’s address the elephant (or snake) in the room.

Reusing Python Solutions with GraalPython

Python has long been the undisputed leader in the field of artificial intelligence. Its syntax is simple, clear, and the availability of powerful libraries like TensorFlow, PyTorch, and scikit-learn makes it the first choice for researchers and engineers working on machine learning. In research labs and during rapid prototyping, Python reigns supreme.

However, behind this dominance lies a paradox: Python itself is not particularly efficient. Its success is driven by libraries written in more performant languages, such as C or C++, which handle the bulk of computational workloads.

In this ecosystem, Python acts more as an abstraction layer, enabling developers to easily tap into these capabilities. The challenge arises when transitioning from experimental phases to production environments, where scalability and performance are paramount. In such cases, Python’s limitations, such as the Global Interpreter Lock (GIL), become increasingly evident.

Enter GraalPython—a technology that aims to bridge the flexibility and versatility of Python with the performance and stability of the JVM. GraalPython is an implementation of Python running on the GraalVM platform, itself built on Truffle, an advanced framework for creating programming language interpreters. With Truffle, GraalPython leverages just-in-time (JIT) compilation, enabling dynamic code optimizations that improve runtime speed. Moreover, its integration with the JVM allows Python to become part of larger, scalable systems historically dominated by languages like Java.

GraalPython offers something unique: the ability to use Python libraries in JVM applications without complex technological bridging. Python code can execute seamlessly in the same environment as Java, Kotlin, or Scala, simplifying integration processes and reducing technical barriers. Take a look at this example of using NumPy:

try (Context context = Context.newBuilder("python") .allowAllAccess(true).build()) { 
  
String pythonCode = """ 
       import numpy as np # Creating a 3x3 matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) 
       result = matrix.T 
       result.tolist() 
       """; 

Value result = context.eval("python", pythonCode);
System.out.println("Transposed matrix: " + result); 
} catch (Exception e) { 
e.printStackTrace(); } }

While GraalPython opens doors to new possibilities, it is not without its limitations. Its compatibility with CPython—the standard implementation of Python—is restricted. This means that some advanced features or extensions may require adaptation or may not work at all. A particularly challenging area is native extensions like NumPy or TensorFlow, which rely on libraries written in C. Supporting such extensions requires additional mechanisms, such as LLVM, which can pose technical challenges (though it must be said that the GraalVM team is making progress with every release).

Another challenge is the project’s relative youth. GraalPython is still maturing, and its community and technical support are not as robust as CPython’s. This can be a barrier for large-scale production projects that demand full reliability and broad compatibility.

GraalPython will not replace Python in its traditional role as a tool for research and prototyping. Nor does it aspire to become a new standard. Instead, it offers an alternative path—especially for teams that need seamless Python-JVM integration and want to leverage the strengths of both ecosystems. It represents a step toward combining Python’s flexibility with Java’s robustness, which is an impressive achievement in itself. I believe this will become a highly viable approach – perhaps we’ll need to wait for another two GraalVM releases (which isn’t that long, realistically) before I can confidently recommend it for production use to less adventurous teams.

So let’s back to the JDK and JVM, cause now it is time to take a look

Float 16 – Precision in Computation is not always a good thing

Precision is fundamental in computing – we want our programs to be precise – but do we always need it in abundance? Maybe we can make some shortcuts sometimes. Now it is time to understand the nuances of precision, especially the role of the mantissa.

Let’s introduce some math theory!

Floating-point numbers represent real numbers in computer memory, enabling operations on extremely large and small values. These numbers consist of three parts: the sign, indicating whether the number is positive or negative; the exponent, which determines the scale of the number; and the mantissa, which defines its precision.

For instance, the number $6.75$ in floating-point format can be represented as:

$6.75 = 1.6875 * 2^2$

Mantissa: $1.6875$ , storing the number’s fractional details (here: $1 + 0.6875$ ).
Exponent: $2$ , indicating how far the decimal point is “shifted.”

In the float32 format, the mantissa is 23 bits long, allowing for highly precise representations, such as $1.6875$ , expressed as $1.1011\dots$ in binary. By comparison, float16 reduces the mantissa to 10 bits, capturing fewer details. This results in rounding and less precise values. For example:

float32: $6.75$ remains exactly $6.75$ .
float16: $6.75$ might be approximated as $6.7421875$ , reflecting mantissa rounding.

In AI applications, such minor differences are negligible since models are trained on data with inherent noise. Would the difference between $6.75$ and $6.7421875$ alter an image recognition or translation model’s performance? Usually, the answer is no.

Reducing precision brings immediate benefits. Each number takes up less memory – critical for models with hundreds of billions of parameters, like GPT-4, where switching from float32 to float16 allows storing twice as much data in the same space. Computations are also faster since processors and GPUs can handle more numbers simultaneously when each is smaller. Finally, reduced data size lowers energy consumption and operational costs, making processing both economically and environmentally sustainable.

Until now, the JVM lacked native support for float16, forcing developers to rely on float32 even when unnecessary. Project Valhalla addresses this gap, introducing float16 to the Java Virtual Machine (JVM) ecosystem. This enables Java applications to fully leverage modern hardware, such as NVIDIA Tensor Cores, optimized for float16 operations. By embracing float16, the JVM becomes more competitive in AI computations and unlocks pathways to scalable, efficient solutions.

In summary:However, the size of the memory is just a part of the story. There is yet another more crucial component that we mentioned in the introduction – Memory Bandwidth.

Project Panama and JExtract – modern replacement for JNI

Off-heap memory plays a important role in systems designed for large model inference, where performance and precise resource management are crucial. In Java’s traditional memory model, data is allocated on the heap, and the garbage collector automatically manages its lifecycle. This approach works well in many applications, greatly simplifying memory management (as evidenced by the fact that many of you, readers, have likely never managed memory manually). However, in inference processes requiring operations on large datasets and precise resource utilization, this approach can lead to challenges. Heap memory limitations, particularly in handling large objects, become especially apparent in environments where models need to operate in real-time or on massive input matrices.

Using off-heap memory helps overcome these limitations. It provides full control over memory allocation and deallocation, eliminating the risk of unpredictable pauses in system operation, such as during critical inference tasks. This is especially useful for operations on large data blocks, such as weight matrices or tensor buffers, which in traditional heap memory can lead to fragmentation or exceed permissible object size limits. It’s worth noting that popular computation accelerators, such as GPUs with CUDA technology or other hardware accelerators, provide their libraries exclusively in native languages like C or C++. Utilizing off-heap memory enables direct integration with these libraries, eliminating the need for data copying between the JVM and native environments, significantly speeding up the inference process – a topic we will explore further.

Traditional approaches using Java Native Interface, while effective, are complex and error-prone (which knows anybody who ever compiled C headers for usage with Java). This is where Project Panama, particularly the jextract tool, comes into play. Jextract automates the generation of bindings to native libraries from C/C++ header files. This allows developers to seamlessly integrate native libraries with Java code without manually writing JNI, reducing the risk of errors and shortening implementation time.

Jextract also supports new mechanisms such as the Foreign Function & Memory API, which enable safe and efficient off-heap memory management. In the context of large model inference, these tools open new possibilities for resource optimization, overcoming the limitations of traditional methods such as memory fragmentation or object size constraints on the heap. As a result, these solutions enhance the efficiency of inference systems, providing integration with modern computation acceleration technologies. Internet is full of great Project Panama tutorials, however the very details, including design docs, can be found on the official project repo on GitHub.

Alchemical Marriage of Valhalla and Panama – Vector API

It’s time to discuss a project that bridges key elements of both Valhalla and Panama while unlocking new possibilities for high-performance computing. The Vector API, incubating in Java for several releases, merges Valhalla’s philosophy—enhancing performance through new primitive data types—with Panama’s approach of simplifying access to low-level hardware capabilities. This tool takes the language to a new level, offering explicit support for vector operations on data, which is critical in applications like machine learning inference, graphics processing, and numerical simulations.

Autovectorization, a feature of traditional Just-In-Time (JIT) compilers, has long been used to optimize Java by transforming certain loops into vectorized operations. However, its capabilities have been limited to simple, easily recognizable patterns in the code, making it unpredictable for developers. The Vector API, developed within the scope of the Panama and Valhalla projects, elevates vectorization to a new level by providing Java developers with an explicit and precise tool for designing efficient SIMD (Single Instruction, Multiple Data) operations.

In data processing, SIMD fundamentally differs from the traditional SISD (Single Instruction, Single Data) model, where each operation processes a single data element at a time. In SISD, for instance, adding two arrays involves the processor iterating through each element and performing the operation sequentially. SIMD, on the other hand, allows the same operation to be executed on multiple data elements simultaneously. This means processors can handle entire blocks of data in parallel, leading to significant time savings when working with large datasets—a common scenario in ML model inference.

Traditional autovectorization in JIT worked behind the scenes and was constrained by what the JVM could infer from the code. The Vector API changes this paradigm. Now, developers can explicitly define vector operations, signaling that data should be processed as groups rather than individual elements. The JVM, using the Vector API, automatically maps these operations to the best available SIMD instructions, such as AVX-512 on modern processors, or older ones if the hardware has other constraints.

In practice, instead of relying on hidden autovectorization mechanisms, the Vector API gives developers full control over the process. This allows for designing operations that are optimized for modern hardware right from the start, without the need for writing manual C or assembly code.

The Vector API, although still in incubation, is being actively developed, with its full implementation tied to the completion of Valhalla. Valhalla introduces new primitive data types and optimized handling of data structures, enabling the Vector API to reach its full potential within the Java ecosystem. Even now, several projects like Llama3.Java, JLama, and JVector are leveraging the Vector API, showcasing its immense potential.

Llama3.Java implements language models like LLaMA in the JVM ecosystem, using the Vector API to accelerate inference.
JLama is a modern inference engine for large language models (LLMs) written entirely in Java. It utilizes the Vector API and Project Panama to speed up inference processes, enabling Java developers to efficiently leverage models available on platforms like Hugging Face.
JVector is a fully Java-based vector search engine, employing modern graph algorithms inspired by DiskANN. Used by DataStax Astra DB and planned for integration with Apache Cassandra, JVector uses Project Panama to accelerate index building and querying, making it a cutting-edge tool for vector search in the Java ecosystem.

With the Vector API, Java is becoming a serious contender in domains previously dominated by C++.

However…

Step up – Accelerators

Although vector operations on CPUs, supported by technologies like the Vector API, significantly accelerate computations, the true revolution in computational performance came with the development of GPUs. Graphics cards, originally designed for rendering 3D graphics, have evolved into powerful computational machines capable of taking over tasks where CPUs, even with advanced SIMD instructions, couldn’t match their performance.

The origins of GPUs (Graphics Processing Units) trace back to the 1990s, when graphics cards began incorporating dedicated processors for advanced graphical functions like geometric transformations and lighting. A major breakthrough came in 1999 with the release of the NVIDIA GeForce 256—the first chip marketed as a “GPU.” Equipped with a built-in T&L (Transform andLighting) engine, this card could handle complex graphical computations, relieving the CPU of these tasks.

Over the following years, the introduction of programmable shaders marked a transformation of GPUs from purely graphics-focused chips to more versatile computational platforms. This allowed developers to leverage GPUs for tasks beyond graphics—such as physics simulations, data analytics, and image processing.

A pivotal moment in this evolution was the launch of NVIDIA’s CUDA (Compute Unified Device Architecture) platform in 2006. CUDA opened GPUs to a wide range of general-purpose computing applications (GPGPU—General-Purpose Computing on Graphics Processing Units). GPUs began accelerating machine learning, numerical computations, and scientific simulations, offering an advantage through thousands of parallel cores capable of processing large datasets simultaneously.

Around the same time, open standards like OpenCL emerged, enabling the use of GPUs from various manufacturers, including AMD and Intel. As GPU applications grew, their architecture was adapted to meet computational demands, with modern NVIDIA GPUs featuring Tensor Cores specifically designed for matrix computations in deep learning.

The fundamental difference between a CPU and a GPU lies in their architecture. CPUs are optimized for tasks requiring high logical complexity and sequential processing, which is why they feature a few powerful cores. GPUs, on the other hand, are designed for massive parallel processing, making them ideal for operations on large data matrices—crucial in machine learning and graphics.

CPU vector operations, supported by SIMD and technologies like AVX-512, accelerate data processing but have their limitations. GPUs, with thousands of cores, can process thousands of threads simultaneously, offering a significant advantage for tasks like AI model training, scientific computations, or physical simulations. The advent of libraries such as TensorFlow and PyTorch, with built-in GPU support, has made graphics cards a standard in AI processing.

Java has traditionally been a CPU-focused language. While the Vector API and SIMD operations on CPUs offer considerable speed-ups, GPUs remain irreplaceable for the most demanding computational tasks. However, our language is evolving (as you might have guessed from this lengthy introduction) towards effectively leveraging both technologies, creating a versatile ecosystem for building advanced applications.

GPUs in Java – Initial approach

Using GPUs in Java relies on efficient parallel processing, which requires structures like kernels and ndgrid. A kernel is a small piece of code that runs at the same time across hundreds or thousands of GPU threads, speeding up tasks like matrix calculations, simulations, or scientific computations. Ndgrid organizes these threads into blocks and grids, helping to manage hardware resources and coordinate data sharing. Without these structures, using GPUs effectively would be very difficult, as programmers would need to handle the complexity of parallel processing manually. JCuda provides tools in Java to define kernels in CUDA and manage their setup, but this requires a deep understanding of GPU architecture and can be error-prone.

Aparapi and Rootbeer tried to make GPU programming easier by letting developers write kernels directly in Java. Aparapi translated parts of Java code into OpenCL, handling GPU configuration automatically but limiting it to operations that could be converted into OpenCL. Rootbeer went further by allowing code written entirely in Java to be analyzed and turned into parallel GPU tasks. While these approaches made it easier to use GPUs, they also restricted advanced control over GPU resources, which could be a problem for complex applications.

Project Sumatra, created by Oracle, aimed to integrate GPUs into Java’s Stream API, allowing GPU execution without the need for developers to understand kernels or ndgrid. It promised to make GPU processing simple and accessible for many Java users. However, the project was stopped because of challenges like the lack of GPU standardization (CUDA vs. OpenCL) and difficulties managing communication between user software and hardware. These issues, combined with the rise of popular tools like TensorFlow and PyTorch, led Oracle to end the project.

TornadoVM – Engine for Heterogenous Programming

It’s finally time to dive into “modern” projects. TornadoVM is a tool that allows Java developers to speed up their applications by offloading them to hardware like GPUs, FPGAs, or multi-core CPUs, without the need to learn complex low-level code. It can be compared to a “translator” that takes Java code and converts it into instructions that such devices can understand. With TornadoVM, developers remain in the familiar Java ecosystem, while the tool ensures that their applications run faster by leveraging modern hardware.

The Task-Graph in TornadoVM acts as an abstract computation flow model, enabling developers to define tasks and their interdependencies. This structure allows parts of an application to be identified for parallel processing, while also specifying the input and output data for each task. As a result, the Task-Graph becomes the foundation for mapping computational logic onto heterogeneous hardware platforms like GPUs or FPGAs, while maintaining application flexibility and scalability.

int size = 512; 

Float2D a = new Float2D(size, size); // Matrix A 
Float2D b = new Float2D(size, size); // Matrix B 
Float2D c = new Float2D(size, size); // Resulting Matrix C
Float2D d = new Float2D(size, size); // Resulting Matrix D
 
for (int i = 0; i < size; i++) { 
    for (int j = 0; j < size; j++) { 
      a.set(i, j, i + j); b.set(i, j, i - j); 
    } 
} 
TaskGraph taskGraph = new TaskGraph() 
       .task("matrixAddition", MultiTaskGraph::add, a, b, c)
       .task("matrixScaling", MultiTaskGraph::scale, c, d, 2.0f); 

TornadoRuntime.getTornadoRuntime().submit(taskGraph); 

System.out.println(c.get(0, 0));

This is a crucial tool for simplifying the management of complex data flows, allowing developers to focus on application logic instead of low-level implementation details.

Execution Plans form a control layer over the Task-Graph, ensuring optimization and precise management of task execution. They enable dynamic task assignment to appropriate computing devices like GPUs or CPUs, depending on available resources and application requirements. This structure also allows for profiling and debugging mechanisms, making the entire process transparent to the developer. Execution Plans not only improve performance but also allow applications to adapt in real-time to changing environmental conditions, such as system load or hardware availability. The concept is somewhat analogous to how database engines optimize SQL queries.

With these mechanisms, TornadoVM introduces a high-level parallel programming model that harmoniously combines the simplicity of the Java ecosystem with the capabilities of modern hardware accelerators. If you want to learn more (and with the proper level of detail), please check The TornadoVM Programming Model Explained. It is currently one of the most exciting projects for GPU (and other “accelerator”) programming on the JVM. One of the most exciting, because there’s still another cherry on our cake.

La Grand Finale – HAT: Heterogeneous Accelerator Toolkit

The Heterogeneous Accelerator Toolkit (HAT) is the official JDK project announced on JVM Language Summit 2023. Its goal is to create a unified ecosystem enabling seamless collaboration between diverse hardware technologies, such as mentioned CPUs, GPUs, and FPGAs.

A key feature of HAT is the automation of translating Java code into optimized forms for various hardware accelerators. Through advanced mechanisms like code reflection and integration with Project Babylon, applications can be analyzed in real-time and adjusted to the specific hardware they are running on. This means that even complex operations, such as matrix calculations or machine learning models, can be executed with maximum efficiency regardless of the platform.

Project Babylon is an advanced extension of Java’s traditional reflection, enabling deep runtime analysis and transformation of code logic to optimize performance for specific hardware. Unlike standard reflection, which focuses on metadata like class names or method signatures, Babylon introduces code reflection, allowing the JVM to inspect and modify the actual computational structure of code, such as loops and operations, in real time. This enables dynamic translation of high-level Java code into hardware-specific implementations.

HAT operates on the concept of Code Models, which allow for the abstract representation of program logic, enabling the JVM to dynamically map that logic onto different platforms. For example, a Java method performing matrix operations can automatically be transformed into CUDA code optimized for GPUs. If a GPU is unavailable, HAT switches the application to a high-performance version running on the CPU, leveraging SIMD instructions like AVX. All of this happens without developer intervention, eliminating the complexity of manually tailoring code to specific hardware requirements.

One unique aspect of HAT is its seamless integration with heterogeneous computational environments. Developers write Java code focusing solely on application logic, while the JVM handles translation and optimization automatically. This enables performance comparable to low-level programming languages such as C++ or CUDA, without sacrificing the convenience and safety typical of Java.

HAT’s potential extends far beyond traditional applications, unlocking new possibilities in fields like AI, High-Performance Computing (HPC), and large-scale data processing. Its automated memory management, dynamic code adaptation, and integration with modern accelerators position HAT as a main player in Java’s future.

Summary: Where are we now, and where are we heading?

Java may not yet be ready to replace Python in AI, but it’s making tremendous strides. Thanks to projects like GraalPy, Valhalla, TornadoVM, and HAT, the JVM could become a viable alternative for production environments looking to combine AI’s power with Java’s reliability.

However, a drop of bitterness spoils the picture—the entire ecosystem has already bet on slightly different horses. As we know, an alternative must be significantly better if it wants to replace the leading solution. Otherwise, we’ll spend the rest of our lives calling FastAPI through, ironically, an API.

So, while Java may not yet be ready to fully replace Python in AI, projects like HAT demonstrate that in the future the JVM can be a viable alternative for production environments seeking flexibility, reliability, and performance in a modern, heterogeneous computing landscape. We are just running against the clock.

PS: I know it was long… but I hope you all had as much fun on this journey as I did!

Author: Artur Skowronski

Head of Java & Kotlin Engineering @ VirtusLab • Editor of JVM Weekly Newsletter

JVM in the Age of AI: A Bird’s-Eye View for the Mechanical Sympathizers

But first, what is an inference?

Reusing Python Solutions with GraalPython

Float 16 – Precision in Computation is not always a good thing

Project Panama and JExtract – modern replacement for JNI