Introduction
Some time ago I came across this life-cycle management tool (or cloud service) called Valohai and I was quite impressed by its user-interface and simplicity of design and layout. I had a good chat about the service at that time with one of the members of Valohai and was given a demo. Previous to that I had written a simple pipeline using GNU Parallel, JavaScript, Python and Bash — and another one purely using GNU Parallel, and Bash. Also thought about replacing the moving parts with ready-to-use task/workflow management tools like Jenkins X, Jenkins Pipeline, Concourse or Airflow but due to various reasons, I did not proceed with the idea.
Coming back to our original conversation, I noticed a lot of the examples and docs on Valohai were based on Python, R and the respective frameworks and libraries. There was a lack of Java/JVM based examples or docs. So I took this opportunity to do something about that.
I was encouraged by Valohai to implement something using the famous Java library called DL4J — Deep Learning for Java.
My initial experience with Valohai already gave me a good impression after getting an understanding of its design, layout and workflow. That it was developer-friendly and the makers already took into consideration various facets of both developer and infrastructure workflows. In our worlds, the latter is mostly run by DevOps or SysOps teams and we know the nuances and pain-points attached to it. You can find out more about its features from the Features section of the site.
Achtung! Just to let you know that from here onwards the post will be a bit more technical and may contain code snippets and mention of deep learning/machine learning and infrastructure-related terminologies.
What do we need and how?
For any machine learning or deep learning project or initiative, these days the two important components (from a high-level perspective) are code that will create and serve the model and infrastructure where this whole life-cycle will be executed.
Of course, there are going to be steps and components needed before, during and after the above, but to keep things simple let’s say we need code and infrastructure.
Code
For code I have chosen a modified example using DL4J, it’s an MNist project with a training set of 60,000 images and test set of 10,000 images of hand-written digits. This dataset is available via the DL4J library (just like Keras provides a stock of them). Look for the MnistDataSetIterator under DatasetIterators in the DL4J Cheatsheet for further details on this particular dataset.
Have a look at the source code we will be using before getting started, the main Java class is called org.deeplearning4j.feedforward.mnist.MLPMnistSingleLayerRunner.
Infrastructure
As it is obvious by now, we have decided to try out the Java example using Valohai as our infrastructure to run our experiments (training and evaluation of the model). Valohai recognizes git repositories and directly hooks into them and allows execution of our code, irrespective of platform or language — we will see how this works. This also means if you are a strong supporter of GitOps or Infrastructure-As-Code you will appreciate the workflow.
For this we just need an account on Valohai, we can avail a Free-tier account and have access to several instances of various configurations when we sign up. See Free-tier under Plans and Pricing and the comparison chart for more details. For what we would like to do, the Free-tier is more than enough for now.
Deep Learning for Java and Valohai
As we agreed, we’re going to use these two technologies to achieve our goal of training a single layer model and evaluating it, as well as seeing what the end-to-end experience is like on Valohai.
We will bundle the necessary build and run-time dependencies into the Docker image and use it to build our Java app, train a model and evaluate it on the Valohai platform via a simple valohai.yaml file which is placed in the root folder of the project repository.
Deep Learning for Java: DL4J
The easy part is, we won’t need to do much here, just build the jar and download the dataset into the Docker container. We have a pre-built Docker image that contains all the dependencies needed to build a Java app. We have pushed this image into Docker Hub, you can find it by searching for dl4j-mnist-single-layer (we will be using a specific tag as defined in the YAML file). We have chosen to use GraalVM 19.1.1 as our Java build and runtime for this project, and so it is embedded into the Docker image (see Dockerfile for the definition of the Docker image). To learn more about GraalVM check out the resources at the official site of graalvm.org and Awesome Graal.
Orchestration
When the uber jar is invoked from the command-line, we land into the MLPMnistSingleLayerRunner
class which directs us to the intended action depending on the parameters passed in:
public static void main(String[] args) throws Exception {
MLPMnistSingleLayerRunner mlpMnistRunner = new MLPMnistSingleLayerRunner();
JCommander.newBuilder()
.addObject(mlpMnistRunner)
.build()
.parse(args);
mlpMnistRunner.execute();
}
The parameters passed into the uber jar are received by this class and handled by the execute()
method.
We can create a model via the --action train
parameter and evaluate the created model via the --action evaluate
parameter respectively passed to the Java app (uber jar).
The main parts of the Java app that does this work can be found in the two Java classes mentioned in the sections below.
Train a model
Can be invoked from the command-line via:
./runMLPMnist.sh --action train --output-dir ${VH_OUTPUTS_DIR}
or
java -Djava.library.path="" \
-jar target/MLPMnist-1.0.0-bin.jar \
--action train --output-dir ${VH_OUTPUTS_DIR}
This creates the model (when successful, at the end of the execution) by the name mlpmnist-single-layer.pb
in the folder specified by the --output-dir
passed in at the beginning of the execution. From the perspective of Valohai, it should be placed into the ${VH_OUTPUTS_DIR} which is what we do (see valohai.yaml file).
For source code, see class MLPMNistSingleLayerTrain.java.
Evaluate a model
Can be invoked from the command-line via:
./runMLPMnist.sh --action evaluate --input-dir ${VH_INPUTS_DIR}/model
or
java -Djava.library.path="" \
-jar target/MLPMnist-1.0.0-bin.jar \
--action evaluate --input-dir ${VH_INPUTS_DIR}/model
This expects a model (created by the training step) by the name mlpmnist-single-layer.pb
to be present in the folder specified by the --input-dir
passed in when the app has been called.
For source code, see class MLPMNistSingleLayerEvaluate.java.
I hope this short illustration makes it clear how the Java app that trains and evaluates the model works in general.
That’s all is needed of us, but feel free to play with the rest of the source (along with the README.md and bash scripts) and satisfy your curiosity and understanding on how this is done! Further resources on how to use DL4J has been provided in the Resources section at the end of the post.
Valohai
Valohai as a platform allows us to loosely couple our runtime environment, our code, and our dataset, as you can see from the structure of the YAML file below. That way the different components can evolve independently without impeding or being dependent on one another. Hence our Docker container only has the build and runtime time components packed into it. At execution time we build the uber jar in the Docker container, upload it to some internal or external storage, and then via another execution step download the uber jar and dataset from storage (or another location) to run the training. This way the two execution steps are decoupled; we can e.g. build the jar once and run hundreds of training steps on the same jar. As the build and runtime environments should not change that often we can cache them and the code, dataset and model sources can be made dynamically available during execution time.
The heart of integrating our Java project with the Valohai infrastructure is defining the steps of execution of the steps in the valohai.yaml file placed in the root of your project folder. Our valohai.yaml looks like this:
---
- step:
name: Build-dl4j-mnist-single-layer-java-app
image: neomatrix369/dl4j-mnist-single-layer:v0.5
command:
- cd ${VH_REPOSITORY_DIR}
- ./buildUberJar.sh
- echo "~~~ Copying the build jar file into ${VH_OUTPUTS_DIR}"
- cp target/MLPMnist-1.0.0-bin.jar ${VH_OUTPUTS_DIR}/MLPMnist-1.0.0.jar
- ls -lash ${VH_OUTPUTS_DIR}
environment: aws-eu-west-1-g2-2xlarge
- step:
name: Run-dl4j-mnist-single-layer-train-model
image: neomatrix369/dl4j-mnist-single-layer:v0.5
command:
- echo "~~~ Unpack the MNist dataset into ${HOME} folder"
- tar xvzf ${VH_INPUTS_DIR}/dataset/mlp-mnist-dataset.tgz -C ${HOME}
- cd ${VH_REPOSITORY_DIR}
- echo "~~~ Copying the build jar file from ${VH_INPUTS_DIR} to current location"
- cp ${VH_INPUTS_DIR}/dl4j-java-app/MLPMnist-1.0.0.jar .
- echo "~~~ Run the DL4J app to train model based on the the MNist dataset"
- ./runMLPMnist.sh {parameters}
inputs:
- name: dl4j-java-app
description: DL4J Java app file (jar) generated in the previous step 'Build-dl4j-mnist-single-layer-java-app'
- name: dataset
default: https://github.com/neomatrix369/awesome-ai-ml-dl/releases/download/mnist-dataset-v0.1/mlp-mnist-dataset.tgz
description: MNist dataset needed to train the model
parameters:
- name: --action
pass-as: '--action {v}'
type: string
default: train
description: Action to perform i.e. train or evaluate
- name: --output-dir
pass-as: '--output-dir {v}'
type: string
default: /valohai/outputs/
description: Output directory where the model will be created, best to pick the Valohai output directory
environment: aws-eu-west-1-g2-2xlarge
- step:
name: Run-dl4j-mnist-single-layer-evaluate-model
image: neomatrix369/dl4j-mnist-single-layer:v0.5
command:
- cd ${VH_REPOSITORY_DIR}
- echo "~~~ Copying the build jar file from ${VH_INPUTS_DIR} to current location"
- cp ${VH_INPUTS_DIR}/dl4j-java-app/MLPMnist-1.0.0.jar .
- echo "~~~ Run the DL4J app to evaluate the trained MNist model"
- ./runMLPMnist.sh {parameters}
inputs:
- name: dl4j-java-app
description: DL4J Java app file (jar) generated in the previous step 'Build-dl4j-mnist-single-layer-java-app'
- name: model
description: Model file generated in the previous step 'Run-dl4j-mnist-single-layer-train-model'
parameters:
- name: --action
pass-as: '--action {v}'
type: string
default: evaluate
description: Action to perform i.e. train or evaluate
- name: --input-dir
pass-as: '--input-dir {v}'
type: string
default: /valohai/inputs/model
description: Input directory where the model created by the previous step can be found created
environment: aws-eu-west-1-g2-2xlarge
Explanation of the step Build-dl4j-mnist-single-layer-java-app
From
the YAML file, we can see that we define this step by first using the
Docker image and then run the build script to build the uber jar. Our
docker image has the build environment dependencies setup (i.e. GraalVM
JDK, Maven, etc…) to build a Java app. We do not specify any inputs or
parameters as this is the build step. Once the build will be successful
we want to copy the uber jar called MLPMnist-1.0.0-bin.jar
(original name) to the /valohai/outputs
folder (represented by ${VH_OUTPUTS_DIR}
).
Everything within this folder automatically gets persisted within your
project’s storage, e.g. an AWS S3 bucket. Finally, we define our job to
run in the AWS environment.
Note: The Valohai free tier does not have network access from inside the Docker container (this is disabled by default), please contact support to enable this option (I had to do the same), or else we cannot download our Maven and other dependencies during build time.
Explanation of the step Run-dl4j-mnist-single-layer-train-model
The semantics of the definition is similar to the previous step except we specify two inputs one for the uber jar (MLPMnist-1.0.0.jar
) and the other for the dataset (to be unpacked into the${HOME}/.deeplearning4j
folder). We will be passing the two parameters --action train
and --output-dir /valohai/outputs
. The model created from this step is collected into the /valohai/outputs/model
folder (represented by ${VH_OUTPUTS_DIR}/model
).
Note: In
the Input fields in the Executions tab of the Valohai Web UI, we can
select the outputs from previous executions by using the execution
number i.e. #1
or #2
, in addition to using datum:// or http:// URLs. Typing in the few
letters of the name of the file also helps search through the whole
list.
Explanation of the step Run-dl4j-mnist-single-layer-evaluate-model
Again, this step is similar to the previous step, except that we will be passing in the two parameters --action evaluate
and --input-dir /valohai/inputs/model
. Also, we have again specified two inputs:
sections defined in the YAML file called dl4j-java-app
and model
with no default
set for both of them. This will allow us to select the uber jar and the
model we wish to evaluate – that was created by the step Run-dl4j-mnist-single-layer-train-model, using the web interface.
Hope this explains the steps in the above definition file but if you require further help, please do not hesitate to look at the docs and tutorials.
Once we have an account, we can sign-in and continue with creating a project by the name mlpmnist-single-layer
and link the git repo https://github.com/valohai/mlpmnist-dl4j-example to the project and save the project, have a quick look at the tutorials to see how to create a project using the web interface.
Now you can execute a step and see how it pans out!
Building the DL4J Java app step
Go to the Executions tab in the web interface, either copy an existing or create a new execution using the [Create execution] button, all the necessary default options will be populated, select Step Build-dl4j-mnist-single-layer-java-app.
For Environment I would select AWS eu-west-1 g2.2xlarge and click on the [Create execution] button at the bottom of the page, to see the execution kick-off.
Training the model step
Go to the Executions tab in the web interface, and do the same as the previous step and select Step Run-dl4j-mnist-single-layer-train-model. You will have to select the Java app (just type jar in the field) built in the previous step, the dataset has already been pre-populated via the valohai.yaml file:
Click on [Create execution] to kick off this step.
You will see the model summary fly by in the log console:
[<--- snipped --->]
11:17:05 =========================================================================
11:17:05 LayerName (LayerType) nIn,nOut TotalParams ParamsShape
11:17:05 =========================================================================
11:17:05 layer0 (DenseLayer) 784,1000 785000 W:{784,1000}, b:{1,1000}
11:17:05 layer1 (OutputLayer) 1000,10 10010 W:{1000,10}, b:{1,10}
11:17:05 -------------------------------------------------------------------------
11:17:05 Total Parameters: 795010
11:17:05 Trainable Parameters: 795010
11:17:05 Frozen Parameters: 0
11:17:05 =========================================================================
[<--- snipped --->]
The models created can be found under the Outputs sub-tab in the Executions main tab, during and at the end of the execution:
You might have noticed several artifacts in the Outputs sub-tab. That’s because we save a checkpoint at the end of each epoch! Look out for these in the execution logs:
[<--- snipped --->]
11:17:14 o.d.o.l.CheckpointListener - Model checkpoint saved: epoch 0, iteration 469, path: /valohai/outputs/checkpoint_0_MultiLayerNetwork.zip
[<--- snipped --->]
The checkpoint zip contains the state of the model training at that point, saved in three of these files:
configuration.json
coefficients.bin
updaterState.bin
Training the model > Metadata
You might have noticed these notations fly by in the execution logs:
[<--- snipped --->]
11:17:05 {"epoch": 0, "iteration": 0, "score (loss function)": 2.410047}
11:17:07 {"epoch": 0, "iteration": 100, "score (loss function)": 0.613774}
11:17:09 {"epoch": 0, "iteration": 200, "score (loss function)": 0.528494}
11:17:11 {"epoch": 0, "iteration": 300, "score (loss function)": 0.400291}
11:17:13 {"epoch": 0, "iteration": 400, "score (loss function)": 0.357800}
11:17:14 o.d.o.l.CheckpointListener - Model checkpoint saved: epoch 0, iteration 469, path: /valohai/outputs/checkpoint_0_MultiLayerNetwork.zip
[<--- snipped --->]
These notations trigger Valohai to pickup these values (in JSON format) to be used to plot execution metrics, which can be seen during and after the execution under the Metadata sub-tab in the Executions main tab:
We were able to do this by hooking a listener class (called ValohaiMetadataCreator) into the model, such that during training attention is passed on to this listener class at the end of each iteration. In the case of this class, we print the epoch count, iteration count and the score (the loss function value), here is a code snippet from the class:
public void iterationDone(Model model, int iteration, int epoch) {
if (printIterations <= 0)
printIterations = 1;
if (iteration % printIterations == 0) {
double score = model.score();
System.out.println(String.format(
"{\"epoch\": %d, \"iteration\": %d, \"score (loss function)\": %f}",
epoch,
iteration,
score)
);
}
}
Evaluating the model step
Once the model has been successfully created via the previous step, we are ready to evaluate it. We create a new execution just like we did previously but this time select the Run-dl4j-mnist-single-layer-evaluate-model step. We will need to select the Java app (MLPMnist-1.0.0.jar) again and the created model (mlpmnist-single-layer.pb) before kicking off the execution (as shown below):
After selecting the desired model as input, click on the [Create execution] button. It is a quicker execution step than the previous one and we will see the following output:
The Evaluation Metrics and Confusion Matrix post model analysis will be displayed in the console logs.
We can see our training activity has resulted in the model that is near 97% accurate based on the test dataset. The confusion matrix helps point out the instances a digit has been incorrectly predicted as another digit. Maybe this is good feedback to the creator of the model and maintainer of the dataset to do some further investigations.
The question remains (and is outside the scope of this post) — how good is the model when faced with real-world data?
It’s easy to install and get started with the CLI tool, see Command-line Usage.
If you haven’t yet cloned the git repository then here’s what to do:
$ git clone https://github.com/valohai/mlpmnist-dl4j-example
We then need to link our Valohai project created via the web interface in the above section to the project stored on our local machine (the one we just cloned). Run the below commands to do that:
$ cd mlpmnist-dl4j-example
$ vh project --help ### to see all the project-specific options we have for Valohai
$ vh project link
You will be shown something like this:
[ 1] mlpmnist-single-layer
...
Which project would you like to link with /path/to/mlpmnist-dl4j-example?
Enter [n] to create a new project.:
Select 1 (or the selection appropriate for you) and you should see this message:
😁 Success! Linked /path/to/mlpmnist-dl4j-example to mlpmnist-single-layer.
The quickest way to know of all the CLI options with the CLI tool is:
$ vh — help
One more thing, before going any further ensure that your Valohai project is in sync with the latest git project, by doing this:
$ vh project fetch
(on the top right side in your web interface, shown with the two-arrows-pointing-to-each-other icon).
Now we can execute the steps from the CLI with:
$ vh exec run Build-dl4j-mnist-single-layer-java-app
Once the execution is on, we can inspect and monitor it via:
$ vh exec info
$ vh exec logs
$ vh exec watch
We can also see the above updates via the web interface at the same time.
Further resources on how to go about with Valohai has been provided in the Resources section at the end of the post, there are a couple of blog posts on how to use the CLI tool, see [1] | [2].
Conclusion
As you have seen both DL4J and Valohai individually or combined are fairly easy to get started with. Further, we can develop on the different components that make up our experiments i.e. build/runtime environment, code, and dataset and integrate them into an execution in a loosely coupled manner.
The template examples used in this post are a good way to get started to build more complex projects. That you can use either the web interface or the CLI to get your job done with Valohai! With the CLI you can also integrate it with your setup and scripts (or even with CRON or CI/CD jobs).
Also, it’s clear that if I’m working on an AI/ML/DL related project I don’t need to concern myself with creating and maintaining an end-to-end pipeline (which many others and I have had to do in the past) — thanks to the good work by the folks at Valohai.
Thanks to both Skymind (the startup behind DL4J, for creating, maintaining and keeping free) and Valohai for making this tool and cloud-service available for both free and commercial use.
Please do let me know if this is helpful by dropping a line in the comments below or by tweeting at @theNeomatrix369, and I would also welcome feedback.
Resources
- mlpmnist-dl4j-examples project on GitHub
- Awesome AI/ML/DL resources
- Java AI/ML/DL resources
- Deep Learning and DL4J Resources
Additional DL4J resources
Loss functions
Evaluation
Valohai resources
- valohai | docs | blogs | GitHub | Videos | Showcase | About Valohai | Slack | @valohaiai
- Blog posts on how to use the CLI tool: [1] | [2]
Other resources