Native Speed File Backed Large Data Storage In ‘Pure’ Java

Motivation:

All this started with the realisation that I could not afford a big enough computer. Audio processing requires huge amounts of memory. Audacity, an amazing free audio processor, manages this using a file backed storage system. This is a common approach for such issues where we store a huge amount of information and want random access to it. So, I wanted to develop a system for Sonic Field (my pet audio processing/synthesis project) which gave the same powerful disk based memory approach but in pure Java.

I got this to work late last year and discussed it (briefly) in the Java Advent Calendar (http://www.javaadvent.com/2014/12/a-serpentine-path-to-music.html) overview of Sonic Field. Disk based memory allows Sonic Field to process audio systems which require huge amounts of memory on my humble 16 gigabyte laptop. For example, this recent piece took over 50 gigabytes of memory to create:

Whilst this was a breakthrough, it was also inefficient. Memory intensive operations like mixing were a bottle neck in this system. Here I turn Java into a memory power house by implementing the same system but very much more efficiently. I suspect I am getting near the limit at which Java is no longer at a performance disadvantage to C++.

Last year I gave a high level overview of the method; this year I am deep diving to implementation of performance details. In so doing I will explain how we can remove the overhead of traditional Java memory access techniques and then expand the ideas for a more general approach to sharing and persisting large memory systems in JVM programming.

What Is Segmented Storage?

I admit there are a lot of concepts here. The first one to get our heads around is just how inefficient normal memory management of large memory systems is in Java. Let me be very clear indeed, I am not talking about garbage collection. Years of experience with both Java and C++ has taught me that neither collected nor explicit heap management is efficient or easy to get right. I am not discussing this at all. The issues with the JVM’s management of large memory systems are because of its bounds checking and object model. This is thrown into sharp focus when working with memory pools.

As latency or throughput performance become more critical than memory use there comes a point where one has to break out memory pools. Rather than a memory system which mixes everything together in one great glorious heap, we have pools of same sized objects. This requires more memory than a pure heap if the pool is not fully used or if the elements being mapped into pool chunks are smaller than the chunks themselves. However, pools are very fast indeed to manage.

In this post I am going to discuss pool backed segmented storage. Segmented storage is based on a pool but allows allocation of larger storage containers than a single pool chunk. The idea is that a storage container (say 1 gigabyte) can be made up of a selection of chunks (say 1 megabyte each). The segmented storage region is not necessarily made up of contiguous chunks. Indeed, this is its most important feature. It is made up of equal sized chunks from a backing pool but the chunks are scatters across virtual address space and might not even be in order. With this we have something with the request and release efficiency of a pool but but closer to the memory use efficiency of a heap and without any concerns over fragmentation.

Let us look at what a pool looks like first; then we can come back to segmentation.

A pool, in this discussion, consists of these parts:

  1. A pool (not necessarily all in one data structure) of chunks of equal sized memory.
  2. One or more lists of used chunks.
  3. One list of free chunks.

To create a segmented memory allocation from a pool we have a loop:

  1. Create a container (array or some such) of memory chunks. Call this the segment list for the allocation.
  2. Take a chunk of memory off the free list and add it to the segment list.
  3. See if the segment list contains equal or more total memory than required.
  4. If not repeat from 2.

Now we have an allocation segment list which has at least enough memory for the requirement. When we free this memory we simply put the chunks back on the free list. We can see from this that very quickly the chunks on the free list will no longer be in order and even if we were to sort them by address, they still would not be contiguous. Thus any allocation will have enough memory but not in any contiguous order.

Here is a worked example:

We will consider 10 chunks of 1 megabyte which we can call 1,2…10 which are initial in order.

Start:
  Free List: 1 2 3 4 5 6 7 8 9 10
Allocate a 2.5 megabyte store:
  Free List: 1 2 3 4 5 6 7
  Allocated Store A: 8 9 10
Allocate a 6 megabyte store:
  Free List: 1 
  Allocated Store A: 8 9 10
  Allocated Store A: 7 6 5 4 3 2
Free Allocated Store A:
  Free List: 10 9 8 1
  Allocated Store A: 7 6 5 4 3 2
Allocate a 3.1 megabyte store:
  Free List: 
  Allocated Store A: 7 6 5 4 3 2
  Allocated Store C:10 9 8 1

One can note that such an approach is good for some situations for systems like 64bit C++ but its true power is for Java. In current JVMs the maximum addressable array or ByteBuffer contains only 2**31 elements segmented storage offers an efficient way of addressing much greater quantities of memory and backing that memory with memory mapped files if required.. Consider that we need 20 billion doubles, we cannot allocate them into an array or a ByteBuffer; but we can use segmented memory so that we can achieve our goal.

Using anonymous virtual memory in Java for very large memory objects can be inefficient. In use cases where we want to handle very much more memory than the RAM on the machine, we are better off using memory mapped files than just using anonymous swap space. This means that the JVM is not competing with other programs for swap space (to an extent) but what is more important is that garbage collected memory distributes object access which is particularly poor for anonymous virtual memory. We want to concentrate access to particular pages in the time domain so that we attract as few hard page faults as possible. I have discuss other concepts in this area here: https://jaxenter.com/high-speed-multi-threaded-virtual-memory-in-java-105629.html.

Given this. if we narrow our requirement to 20 billion doubles as a memory mapped file then we are not even going to be able to use magic in sun.misc.Unsafe (see later) to help. Without JNI the largest memory mapped file ‘chunk’ we can manage in Java is just 2^31 bytes. It is this requirement for memory mapped files and the inherent allocation/freeing efficiency of segmented storage approaches that lead to me using it for Sonic Field (where I often need to manage over 100G of memory on a 16G machine).

Drilling Into The Implementation:

We now have a clear set of ideas to implement. We need mapped byte buffers. Each buffer is a chunk in a pool for free chunks. When we want to allocate a storage container we need to take some of these mapped byte buffer chunks out of the free pool and into our container. When the container is freed we return our chunks to the free pool. Simple, efficient and clean.

Also, one important thing is that the mapped byte buffers are actually java.nio.DirectByteBuffer objects with file back memory. We will use this concept later; for now we can just think of them as ByteBuffers.

On Sonic Field (which is the code for which I developed the technique of segmented storage using mapped byte buffers. – see https://github.com/nerds-central/SonicFieldRepo). In that code base I have defined the following:

    private static final long  CHUNK_LEN        = 1024 * 1024;

To get the sample we can consider each chunk as a CHUNK_LEN ByteBuffer. The code for accessing an element from an allocated memory chunk was before my speedup work:

   private static final long  CHUNK_SHIFT      = 20;
   private static final long  CHUNK_MASK       = CHUNK_LEN - 1;
...
   public final double getSample(int index)
   {
       long bytePos = index << 3;
       long pos = bytePos & CHUNK_MASK;
       long bufPos = (bytePos - pos) >> CHUNK_SHIFT;
       return chunks[(int) bufPos].getDouble((int) pos);
   }

So the allocated segment list in this case is an array of ByteBuffers:

  1. Find the index into the list by dividing the index required by the chunk size (use shift for efficiency).
  2. Find the index into the found chunk by taking the modulus (use binary and for efficiency).
  3. Look up the actual value using the getDouble intrinsic method (looks like a method but the compiler knows about it an elides the method call).

All this looks fine, but it does not work out all that well because there are some fundamental issues with the way Java lays out objects in memory which prevent segmented access being properly optimised. On the face of it, accessing a segmented memory area should be a few very fast shift and logic operations and an indirect lookup but that does not work out so for Java; all the problems happen in this line:

 return chunks[(int) bufPos].getDouble((int) pos);

This is what this line has to do:

  1. Look up the chunks object from its handle.
  2. Bounds check.
  3. Get the data from its data area.
  4. From that object handle for the ByteBuffer look up actual object.
  5. Look up its length dynamically (it can change so this is a safe point and an object field lookup).
  6. Bounds check.
  7. Retrieve the data.

Really? Yes, the JVM does all of that which is quite painful. Not only is it a lot of instructions it also requires jumping around in memory will all the consequent cache line flushing and memory pauses.

How can we improve on this? Remember that our ByteBuffers are DirectByteBuffers, this means that their data is not stored on the Java heap; it is located in the same virtual address location throughout the object lifetime. I bet you have guessed that the key here is using sun.misc.Unsafe. Yes, it is; we can bypass all this object lookup by using offheap memory. To do so means bending a few Java and JVM rules but the dividends are worth it.

From now on everything I discuss is relevant to Java 1.8 x86_64. Future versions might break this approach as it is not standards compliant.

Consider this:

   private static class ByteBufferWrapper
   {
       public long       address;
       public ByteBuffer buffer;
       public ByteBufferWrapper(ByteBuffer b) throws
                      NoSuchMethodException,
                      SecurityException,
                      IllegalAccessException,
                      IllegalArgumentException,
                      InvocationTargetException
       {
           Method addM = b.getClass().getMethod("address");
           addM.setAccessible(true);
           address = (long) addM.invoke(b);
           buffer = b;
       }
   }

What we are doing is getting the address in memory of the data stored in a DirectByteBuffer. To do this I use reflection as DirectByteBuffer is package private. DirectByteBuffer has a method on it called address() which returns a long. On x86_64 the size of an address (64 bits) is the same as long. Whilst the value of long is signed, we can just used long as binary data and ignore its numerical value. So the long returned from address() is actually the virtual address of the start of the buffer’s storage area.

Unlike ‘normal’ JVM storage (e.g. arrays) the storage of a DirectByteBuffer is ‘off heap’. It is virtual memory just like any other, but it is not owned by the garbage collector and cannot be moved by the garbage collector; this makes a huge difference to how quickly and by which techniques we can access it. Remember, the address returned by address() never changes for a given DirectByteBuffer object; consequently, we can use this address ‘forever’ and avoid object lookups.

Introducing sun.misc.Unsafe:

Whilst it would be lovely to believe that calling getDouble(int) on a DirectByteBuffer is super efficient, it does not appear that it is so. The bounds check slows it down despite the method being intrinsic [a magic function which the JVM JIT compiler knows about and can replace with machine code rather than compiling in a normal fashion]. However, with our address we can now use sun.misc.Unsafe to access the storage.

Rather than:

b.getDouble(pos);

We can:

unsafe.getDouble(address+pos);

The unsafe version is also intrinsic and compiles down to pretty much the same machine code as a C compiler (like gcc) would produce. In other words, it is as fast as it can get; there are no object dereferences or bounds checks, it just loads a double from an address.

The store equivalent is:

unsafe.putDouble(address+pos,value);

What is this ‘unsafe’ thing? We get that with another reflection hack around:

   private static Unsafe getUnsafe()
   {
       try
       {
           Field f = Unsafe.class.getDeclaredField("theUnsafe");
           f.setAccessible(true);
           return (Unsafe) f.get(null);
       }
       catch (Exception e)
       {
           throw new RuntimeException(e);
       }
   }
   private static final Unsafe unsafe = getUnsafe();

It is important to load the unsafe singleton into a final static field. This allows the compiler to assume the object reference never changes and so the very most optimal code is generated.

Now we have very fast acquisition of data from a DirectByteBuffer but we have a segmented storage model so we need to get the address for the correct byte buffer very quickly. If we store these in an array we risk the array bounds check and the array object dereference steps. We can get rid of these by further use of unsafe and offheap memory.

   private final long  chunkIndex;
...
   try
   {
       // Allocate the memory for the index - final so do it here
       long size = (1 + ((l << 3) >> CHUNK_SHIFT)) << 3;
       allocked = chunkIndex = unsafe.allocateMemory(size);
       if (allocked == 0)
       {
           throw new RuntimeException("Out of memory allocating " + size);
      }
      makeMap(l << 3l);
   }
   catch (Exception e)
   {
       throw new RuntimeException(e);
   }

Again we use the ‘final’ trick to let the compiler make the very best optimisations. The final here is a long which is just an address. We can directly allocate offheap memory using unsafe. The imaginatively called function to do this is allocateMemory(long). This returns a long which we store in chunkIndex. allocateMemory(long) actually allocates bytes but we want to store what is effectively an array of longs (addresses); this is what the bit of bit twiddling logic is doing when it computes size.

Now that we have a chunk of offheap memory large enough to store the addresses for the DirectByteBuffer segments for our storage container we can put the addresses in and retrieve them using unsafe.

During storage construction we:

    // now we have the chunks we get the address of the underlying memory
   // of each and place that in the off heap lookup so we no longer
   // reference them via objects but purely as raw memory
   long offSet = 0;
   for (ByteBufferWrapper chunk : chunks)
   {
       unsafe.putAddress(chunkIndex + offSet, chunk.address);
       offSet += 8;
   }

Which means our new code for getting and setting data can be very simple indeed:

    private long getAddress(long index)
   {
       long bytePos = index << 3;
       long pos = bytePos & CHUNK_MASK;
       long bufPos = (bytePos - pos) >> CHUNK_SHIFT;
       long address = chunkIndex + (bufPos << 3);
       return unsafe.getAddress(address) + pos;
   }

   /* (non-Javadoc)
    * @see com.nerdscentral.audio.SFSignal#getSample(int)
    */
  [email protected]
   public final double getSample(int index)
   {
       return unsafe.getDouble(getAddress(index));
   }

   /* (non-Javadoc)
    * @see com.nerdscentral.audio.SFSignal#setSample(int, double)
    */
  [email protected]
   public final double setSample(int index, double value)
   {
       unsafe.putDouble(getAddress(index), value);
       return value;
   }

The wonderful thing about this is the complete lack of object manipulation or bounds checking. OK, if someone asks for at sample which is out of bounds, the JVM will crash. That might not be a good thing. This sort of programming is very alien to many Java coders and we need to take its dangers very seriously. However, it is really quite fast compared to the original.

In my experiments, I have found that the default JVM inline settings are a little too conservative to get the best out of this approach. I have seen large speedups (up to a two times performance improvement) with the following command line tweaks.

-XX:MaxInlineSize=128 -XX:InlineSmallCode=1024

These just let the JVM make a better job of utilising the extra performance available through not being forced to perform bounds checks and object lookups. In general, I would not advise fiddling with JVM inline settings, but in this case I have real benchmark experience to show a benefit for complex offheap access work.

Testing – How Much Faster Is It?

I wrote the following piece of Jython to test:

import math
from java.lang import System

sf.SetSampleRate(192000)
count=1000
ncount=100

def test():
   t1=System.nanoTime()
   for i in range(1,ncount):
       signal=sf.Mix(+signal1,+signal2)
       signal=sf.Realise(signal)
       -signal
   t2=System.nanoTime()
   d=(t2-t1)/1000000.0
   print "Done: " + str(d)
   return d

signal1=sf.Realise(sf.WhiteNoise(count))
signal2=sf.Realise(sf.WhiteNoise(count))
print "WARM"
for i in range(1,100):
   test()
   
print "Real"
total=0.0
for i in range(1,10):
   total+=test()

print "Mean " + str(total/9.0)

-signal1
-signal2

What this does is create some stored doubles and then create new ones and reading from the old into the new over and over. Remember that we are using segmented storage backed by a pool; consequently, we only truly allocate that storage initially and after that the ‘chunks’ are just recycled. This architecture means that our execution time is dominated by executing getSample and setSample, not allocation or any other paraphernalia.

How much faster is our off heap system? On my Macbook Pro Retina I7 machine with Java 1.8.0 I got these figures for the ‘Real’ (i.e. post warm up) operations (smaller is better):

For the unsafe memory model:

  • Done: 187.124
  • Done: 175.007
  • Done: 181.124
  • Done: 175.384
  • Done: 180.497
  • Done: 180.688
  • Done: 183.309
  • Done: 178.901
  • Done: 181.746
  • Mean 180.42

For the traditional memory model:

  • Done: 303.008
  • Done: 328.763
  • Done: 299.701
  • Done: 315.083
  • Done: 306.809
  • Done: 302.515
  • Done: 304.606
  • Done: 300.291
  • Done: 342.436
  • Mean 311.468

So our unsafe memory model is 1.73 times faster than the traditional Java approach!

Why Is It 1.73 Times Faster

We can see why.

If we look back at the list of things required to just read a double from the traditional DirectByteBuffer and array approach:

  1. Look up the chunks object from its handle.
  2. Bounds check.
  3. Get the data from its data area.
  4. From that object handle for the ByteBuffer look up actual object.
  5. Look up its length dynamically (it can change so this is a safe point and an object field lookup).
  6. Bounds check.
  7. Retrieve the data.

With the new approach we have:

  1. Retrieve the address of the chunk
  2. Retrieve the data from that chunk

Not only are there very many fewer machine instructions being issued, the memory access is much more localised which almost certainly enhances the cache usage during data processing.

The source code for the fast version of the storage system as described here is: https://github.com/nerds-central/SonicFieldRepo/blob/cf6a1b67fb8dd07126b0b1274978bd850ba76931/SonicField/src/com/nerdscentral/audio/SFData.java

I am hoping that you, the reader, have spotted one big problem I have not addressed! My code is allocating offheap memory when ever it creates a segmented storage container. However, this memory will not be freed by the garbage collector. We could try to free with with finalizers but there are many reasons why this is not such a great idea.

My solution is to use explicit resource management. Sonic Field uses try with resources to manage its memory via reference counts. When the reference count for a particular storage container hits zero, the container is freed which places it storage chunks back in the free list and uses unsafe to free the address lookup memory.

Other Uses And New Ideas

Nearly a year ago now I posted ‘Java Power Features To Stay Relevant‘; I guess it was a controversial post and not everyone I have spoken to about my ideas finds them agreeable (to say the least). Nevertheless, I still believe the JVM has a challenge on its hands. The complex multi-threaded model of Java and the JVM its self is not necessarily the huge benefit people think it should be in the world of multi-core computing. There is still a lot of interest in using multiple small processes which communicate via shared memory or sockets. With the slow but inevitable increase in RDMA based networking, these approaches will seem more and more natural to people.

Java and JVM languages seem to have managed to make themselves uniquely unable to take advantage of these shifts in thinking. By developing a ‘walled garden’ approach the JVM has become very efficient at working internally but not great at working with other processes. This is a performance issue and also a stability issue; no matter how hard we try, there is always a chance the JVM will crash or enter an unstable state (OutOfMemoryError anyone?). In production systems this often necessitates several small JVM instances working together so if one goes away the production system stays up. Memory mapped files are a great way to help with persisting data even when a JVM process goes away.

All these issues lead me to another reason I am very interested in efficient offheap, mapped file architectures for the JVM. This technology sits at the overlap of shared memory and mapped file technologies which are now driving forces behind high speed, stable production environments. Whilst the system I discussed here is for a single JVM, using offheap atomics (see here: http://nerds-central.blogspot.co.uk/2015/05/synchronising-sunmiscunsafe-with-c.html) we can put the  free list offheap and share it between processes. Shared memory queues can then also give interprocess arbitration of segmented storage allocation and utilisation. Suddenly, the segmented storage model becomes an efficient way for multiple processes, both JVM and other technologies (Python, C++ etc) to share large, file persisted memory systems.

Right now there are some issues. The biggest of which is that whilst Java supports shared memory via memory mapped files it does not support that via pure shared memory. File mapping is an advantage if we are interested in large areas of memory (as in this example) but it is an unnecessary performance issue for small areas of rapidly changing memory which do not require persistence. I would like to see a true shared memory library in the JDK; this is unlikely to happen any time soon (see my point about a walled garden). JNI offers a route but then JNI has many disadvantages we well. Maybe project Panama will give the required functionality and finally break down the JVM’s walls.

To bring all this together the next trick I want to try is mapping files to a ramdisk (there is an interesting write up on this here: http://www.jamescoyle.net/knowledge/951-the-difference-between-a-tmpfs-and-ramfs-ram-disk). This should be quite easy on Linux and would let us place interprocess queues in a pure RAM shared memory areas without using JNI. With this piece done, a pure Java high speed interprocess shared memory model would be insight. Maybe that will have to wait for next year’s calendar?

A Musical Finale

What could be more fitting than Christmas music for Christmas Eve?

In this post I want to discuss the joy of making music with Java and why/how I have come to use Python…

But first, let us celebrate the season!

We are all human and irrespective of our beliefs, it seems we all enjoy music of some form. For me some of the most beautiful music of all was written by Johan Sebastian Bach. Between 1708 and 1717 he wrote a set of pieces which are collectively called Orgelbüchlein (Little Organ Book). For this post and to celebrate the Java Advent Calendar I tasked Sonic Field to play this piece of music modelling the sounds of a 18th century pipe organ. If you did not know, yes some German organs of about that time really were able to produce huge sounds with reed pipes (for example, Passacaglia And Fugue the Trost Organ). The piece here is a ‘Choral Prelude’ which is based on what we would in English commonly call a Carol to be sung by an ensemble.

BWV 610 Jesu, meine Freude [Jesus, my joy]
This performance dedicated to the Java Advent Calendar
and created exclusively on the JVM using pure
mathematics.
How was this piece created?
Step one is to transcribe the score into midi. Fortunately, someone else already did this for me using automated score reading software. Not so fortunately, this software makes all sorts of mistakes which have to be fixed. The biggest issue with automatically generated midi files is that they end up with overlapped notes on the same channel; that is strictly impossible in midi and ends up with an ambiguous interpretation of what the sound should be. Midi considers audio as note on, note off. So Note On, Note On, Note Off, Note Off is ambiguous; does it mean:

One note overlapping the next or:
—————–
             —————

One note entirely contained in a longer note?
—————————-
             —-

Fortunately, tricks can be used to try and figure this out based on note length etc. The Java decoder always treats notes as fully contained. The Python method looks for very short notes which are contained in long ones and guesses the real intention was two long notes which ended up overlapped slightly. Here is the python (the Java is here on github).


def repareOverlapMidi(midi,blip=5):
print "Interpretation Pass"
mute=True
while mute:
endAt=len(midi)-1
mute=False
index=0
midiOut=[]
this=[]
next=[]
print "Demerge pass:",endAt
midi=sorted(midi, key=lambda tup: tup[0])
midi=sorted(midi, key=lambda tup: tup[3])
while index<endAt:
this=midi[index]
next=midi[index+1]
ttickOn,ttickOff,tnote,tkey,tvelocity=this
ntickOn,ntickOff,nnote,nkey,nvelocity=next

# Merge interpretation
finished=False
dif=(ttickOff-ttickOn)
if dif<blip and tkey==nkey and ttickOff>=ntickOn and ttickOff<=ntickOff:
print "Separating: ",this,next," Diff: ",(ttickOff-ntickOn)
midiOut.append([ttickOn ,ntickOn ,tnote,tkey,tvelocity])
midiOut.append([ttickOff,ntickOff,nnote,nkey,nvelocity])
index+=1
mute=True
elif dif<blip:
print "Removing blip: ",(ttickOff-ttickOn)
index+=1
mute=True
continue
else:
midiOut.append(this)
# iterate the loop
index+=1
if index==endAt:
midiOut.append(next)
if not mute:
return midiOut
midi=midiOut

[This AGPL code is on Github]

Then comes some real fun. If you know the original piece, you might have noticed that the introduction is not original. I added that in the midi editing software Aria Maestosa. It does not need to be done this way; we do not even need to use midi files. A lot of the music I have created in Sonic Field is just coded directly in Python. However, from midi is how it was done here.

Once we have a clean set of notes they need to be converted into sounds. That is done with ‘voicing’. I will talk a little about that to set the scene then we can get back into more Java code oriented discussion. After all, this is the Java advent calendar!

Voicing is exactly the sort of activity which brings Python to the fore. Java is a wordy language which has a large degree of strictness. It favours well constructed, stable structures. Python relies on its clean syntax rules and layout and the principle of least astonishment. For me, this Pythonic approach really helps with the very human process of making a sound:


def chA():
global midi,index
print "##### Channel A #####"
index+=1
midi=shorterThanMidi(midis[index],beat,512)
midi=dampVelocity(midi,80,0.75)
doMidi(voice=leadDiapason,vCorrect=1.0)
postProcess()

midi=longAsMidi(midis[index],beat,512)
midi=legatoMidi(midi,beat,128)
midi=dampVelocity(midi,68,0.75)
midi=dampVelocity(midi,80,0.75)
doMidi(voice=orchestralOboe,vCorrect=0.35,flatEnv=True,pan=0.2)
postProcessTremolate(rate=3.5)
doMidi(voice=orchestralOboe,vCorrect=0.35,flatEnv=True,pan=0.8)
postProcessTremolate(rate=4.5)

Above is a ‘voice’. Contrary to what one might think, a synthesised sound does not often consist of just one sound source. It consists of many. A piece of music might have many ‘voices’ and each voice will be a composite of several sounds. To create just the one voice above I have split the notes into long notes and short notes. Then the actual notes are created by a call to doMidi. This takes advantage of Python’s ‘named arguments with default values’ feature. Here is the signature for doMidi:


def doMidi(voice,vCorrect,pitchShift=1.0,qFactor=1.0,subBass=False,flatEnv=False,pure=False,pan=-1,rawBass=False,pitchAdd=0.0,decay=False,bend=True):

The most complex (unsurprisingly) voice to create is that
of a human singing. I have been working on this for
a long time and there is a long way to go; however, its
is a spectrogram of a piece of music which does
a passable job of sounding like someone singing.

The first argument is actually a reference to a function which will create the basic tone. The rest of the arguments describe how that tone will be manipulated in the note formation. Whilst an approach like this can be mimicked using a builder pattern in Java; this latter language does not lend it self to the ‘playing around’ nature of Python (at least for me).

For example, I could just run the script and add flatEvn=True to the arguments and run it again and compare the two sounds. It is an intuitive way of working.

Anyhow, once each voice has been composited from many tones and tweaked into the shape and texture we want, they turn up as a huge list of lists of notes which are all mixed together and written out to disk as a flat file format which is basically just a dump of the underlying double data. At this point it sounds terrible! Making the notes is often only half the story.

Voice Synthesis by Sonic Field
played specifically for this post.

You see, real sounds happen in a space. Our Choral is expected to be performed in a church. Notes played without a space around them sound completely artificial and lack any interest. To solve this we use impulse response reverberation. The mathematics behind this is rather complex and so I will not go into it in detail. However in the next section I will start to look at this as a perfect example of why Java is not only necessary but ideal as the back end to Python/Jython.

You seem to like Python Alex – Why Bother With Java?

My post might seem a bit like a Python sales job so far. What has been happening is simply a justification of using Python when Java is so good as a language (especially when written in a great IDE like Eclipse for Java). Let us look at something Python would be very bad indeed at. Here is the code for performing the Fast Fourier Transform, which is a the heart of putting sounds into a space.


package com.nerdscentral.audio.pitch.algorithm;

public class CacheableFFT
{

private final int n, m;

// Lookup tables. Only need to recompute when size of FFT changes.
private final double[] cos;
private final double[] sin;
private final boolean forward;

public boolean isForward()
{
return forward;
}

public int size()
{
return n;
}

public CacheableFFT(int n1, boolean isForward)
{
this.forward = isForward;
this.n = n1;
this.m = (int) (Math.log(n1) / Math.log(2));

// Make sure n is a power of 2
if (n1 != (1 << m)) throw new RuntimeException(Messages.getString("CacheableFFT.0")); //$NON-NLS-1$

cos = new double[n1 / 2];
sin = new double[n1 / 2];
double dir = isForward ? -2 * Math.PI : 2 * Math.PI;

for (int i = 0; i < n1 / 2; i++)
{
cos[i] = Math.cos(dir * i / n1);
sin[i] = Math.sin(dir * i / n1);
}

}

public void fft(double[] x, double[] y)
{
int i, j, k, n1, n2, a;
double c, s, t1, t2;

// Bit-reverse
j = 0;
n2 = n / 2;
for (i = 1; i < n - 1; i++)
{
n1 = n2;
while (j >= n1)
{
j = j - n1;
n1 = n1 / 2;
}
j = j + n1;

if (i < j)
{
t1 = x[i];
x[i] = x[j];
x[j] = t1;
t1 = y[i];
y[i] = y[j];
y[j] = t1;
}
}

// FFT
n1 = 0;
n2 = 1;

for (i = 0; i < m; i++)
{
n1 = n2;
n2 = n2 + n2;
a = 0;

for (j = 0; j < n1; j++)
{
c = cos[a];
s = sin[a];
a += 1 << (m - i - 1);

for (k = j; k < n; k = k + n2)
{
t1 = c * x[k + n1] - s * y[k + n1];
t2 = s * x[k + n1] + c * y[k + n1];
x[k + n1] = x[k] - t1;
y[k + n1] = y[k] - t2;
x[k] = x[k] + t1;
y[k] = y[k] + t2;
}
}
}
}
}

[This AGPL code is on Github]

It would be complete lunacy to implement this methematics in JPython (dynamic late binding would give unusably bad performance). Java does a great job of running it quickly and efficiently. In Java this runs just about as fast as it could in any language plus the clean, simple object structure of Java means that using the ‘caching’ system as straight forward. The caching comes from the fact that the cos and sin multipliers of the FFT can be re-used when the transform is the same length. Now, in the creation of reverberation effects (those effects which put sound into a space) FFT lengths are the same over and over again due to windowing. So the speed and object oriented power of Java have both fed into creating a clean, high performance implementation.

But we can go further and make the FFT parallelised:


def reverbInner(signal,convol,grainLength):
def rii():
mag=sf.Magnitude(+signal)
if mag>0:
signal_=sf.Concatenate(signal,sf.Silence(grainLength))
signal_=sf.FrequencyDomain(signal_)
signal_=sf.CrossMultiply(convol,signal_)
signal_=sf.TimeDomain(signal_)
newMag=sf.Magnitude(+signal_)
if newMag>0:
signal_=sf.NumericVolume(signal_,mag/newMag)
# tail out clicks due to amplitude at end of signal
return sf.Realise(signal_)
else:
return sf.Silence(sf.Length(signal_))
else:
-convol
return signal
return sf_do(rii)

def reverberate(signal,convol):
def revi():
grainLength = sf.Length(+convol)
convol_=sf.FrequencyDomain(sf.Concatenate(convol,sf.Silence(grainLength)))
signal_=sf.Concatenate(signal,sf.Silence(grainLength))
out=[]
for grain in sf.Granulate(signal_,grainLength):
(signal_i,at)=grain
out.append((reverbInner(signal_i,+convol_,grainLength),at))
-convol_
return sf.Clean(sf.FixSize(sf.MixAt(out)))
return sf_do(revi)

Here we have the Python which performs the FFT to produce impulse response reverberation (convolution reverb is another name for this approach). The second function breaks the sound into grains. Each grain is then processes individually and they all have the same length. This performs that windowing effect I talked about earlier (I use a triangular window which is not ideal but works well enough due to the long window size). If the grains are long enough, the impact of lots of little FFT calculation basically the same as the effect of one huge one. However, FFT is a nLog(n) process, so lots of little calculations is a lot faster than one big one. In effect, windowing make FFT become a linear scaling calculation.

Note that the granulation process is performed in a future. We define a closure called revi and pass it to sf_do() which is executed it at some point in the future base on demand and the number of threads available.  Next we can look at the code which performs the FFT on each grain – rii. That again is performed in a future. In other words, the individual windowed FFT calculations are all performed in futures. The expression of a parallel windowed FFT engine in C or FORTRAN ends up very complex and rather intractable. I have not personally come across one which is integrated into the generalised, thread pooled, future based schedular. Nevertheless, the combination of Jython and Java makes such a thing very easy to create.

How are the two meshed?

Now that I hope I have put a good argument for hybrid programming between a great dynamic language (in this case Python) and a powerful mid level static language (in this case Java) it is time to look at how the two are fused together. There are many ways of doing this but Sonic Field picks a very distinct approach. It does not offer a general interface between the two where lots of intermediate code is generated and each method in Java is exposed separately into Python; rather it uses a uniform single interface with virtual dispatch.

Sonic Field defines a very (aggressively) simple calling convention from Python into Java which initially might look like a major pain in the behind but works out to create a very flexible and powerful approach.

Sonic Field defines ‘operators’ which all implement the following interface:


/* For Copyright and License see LICENSE.txt and COPYING.txt in the root directory */
package com.nerdscentral.sython;

import java.io.Serializable;

/**
* @author AlexTu
*
*/
public interface SFPL_Operator extends Serializable
{

/**
* <b>Gives the key word which the parser will use for this operator</b>
*
* @return the key word
*/
public String Word();

/**
* <b>Operate</b> What ever this operator does when SFPL is running is done by this method. The execution loop all this
* method with the current execution context and the passed forward operand.
*
* @param input
* the operand passed into this operator
* @param context
* the current execution context
* @return the operand passed forward from this operator
* @throws SFPL_RuntimeException
*/
public Object Interpret(Object input, SFPL_Context context) throws SFPL_RuntimeException;
}
The word() method returns the name of the operator as it will be expressed in Python. The Interpret() method processes arguments passed to it from Python. As Sonic Field comes up it creates a Jython interpreter and then adds the operators to it. The mechanism for doing this is a little involved so rather than go into detail here, I will simply give links to the code on github:
The result is that every operator is exposed in Python as sf.xxx where xxx is the return from the word() method. With clever operator overloading and other syntactical tricks in Python I am sure that the approach could be refined. Right now, there are a lot of sf.xxx calls in Sonic Field Python ( I call it Synthon ) but I have not gotten around to improving on this simple and effective approach.

You might have noticed that everything passed into Java from Python is just ‘object’. This seems a bit crude at first take. However, as we touched on in the section of futures in the previous post, it offers many advantages because the translation from Jython to Java is orchestrated via the Caster object and a layer of Python which transparently perform many useful translations. For example, the code automatically translates multiple arguments in Jython to a list of objects in Java:


def run(self,word,input,args):
if len(args)!=0:
args=list(args)
args.insert(0,input)
input=args
trace=''.join(traceback.format_stack())
SFSignal.setPythonStack(trace)
ret=self.processors.get(word).Interpret(input,self.context)
return ret

Here we can see how the arguments are processed into a list  (which is Jython is implemented as an ArrayList) if there are more than one but are passed as a single object is there is only one. We can also see how the Python stack trace is passed into a thread local in  the Java SFSignal object. Should an SFSignal not be freed or be double collected, this Python stack is displayed to help debug the program.

Is this interface approach a generally good idea for Jython/Java Communication?

Definitely not! It works here because of the nature of the Sonic Field audio progressing architecture. We have processors which can be chained. Each processor has a simple input and output. The semantic content passed between Python and Java is quite limited. In more general purpose programming, this simple architecture, rather than being flexible and powerful, would be highly restrictive. In this case, the normal Jython interface with Java would be much more effective. Again, we can see a great example of this simplicity in the previous post when talking about threading (where Python access Java Future objects). Another example is the direct interaction of Python with SFData objects in this post on modelling oscillators in Python.


from com.nerdscentral.audio import SFData
...
data=SFData.build(len)
for x in range(0,length):
s,t,xs=osc.next()
data.setSample(x,s)

Which violated the programming model of Sonic Field by creating audio samples directly from Jython, but at the same time illustrates the power of Jython! It also created one of the most unusual soundscapes I have so far achieved with the technology:

Engines of war, sound modelling
from oscillators in Python.

Wrapping It Up

Well, that is all folks. I could ramble on for ever, but I think I have answered most if not all of the questions I set out in the first post. The key ones that really interest me are about creativity and hybrid programming. Naturally, I am obsessed with performance as I am by profession an optimisation consultant, but moving away from my day job, can Jython and Java be a creating environment and do they offer more creativity than pure Java?

Transition State Analysis using
hybrid programming

Too many years ago I worked on a similar hybrid approach in scientific computing. The GRACE software which I helped develop as part of the team at Bath was able to break new ground because it was easier to explore ideas in the hybrid approach than writing raw FORTRAN constantly. I cannot present in deterministic, reductionist language a consistent argument for why this applied then to science or now to music; nevertheless, experience from myself and others has show this to be a strong argument.

Whether you agree or disagree; irrespective of if you like the music or detest it; I wish you a very merry Christmas indeed.

This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

A Serpentine Path To Music

A snapshot of the ffmpeg visualisation for the
Goldberg Variations as synthesised by Sonic Field.

Art And Programming Collide: An Introduction To The Post

Whilst there are a number of ways of making music with Java, as far as I can tell, Sonic Field is the only fully fledged, end to end synthesis system written in pure Java (no JNI and no call out to native servers etc.) which does not use the Javax Synthesiser extensions (so does not rely on the operating system to supply the actual synthesis engine).

Also, in many other ways, its heavy weight, uncompromisingly mathematical approach to music is unique. This project has consumed far too much of my time for the better part of three years; what can be said about this ab-initio synthesis system in just one post and what can we learn about Java programming from the project?

The answer is why?

  • Why write a mathematical engine in Java?
  • Why control it from a dynamic language (Python)?
  • Why make music using programs at all?
  • Why use ab-initio synthesis (and what is it)?

Maybe,  most interesting of all: 
  • Why is large scale musical mathematics so computationally demanding?
I will attempt to answer all these questions and shed some light into very ill frequented corners of Java/JVM programming. My approach will not be to answer each in turn, but these questions will form the armature upon which what follows will be constructed.

Sonic Field and how to achieve complex audio synthesis on the JVM proved to be too large a subject for one post. I have am extremely lucky to have been given chance to make two. This post is the hard work; we will cover the difficult subjects like threading, memory management and some digital audio concepts. The next post is a lot more fun; it will look at why I chose Python and some thoughts on creativity.

Now let us look at how digital sounds is represented and manipulated. 

Believe it or not, there are two ways right from the start. However, everyone (including myself with Sonic Field) picks the same one. Sound is made of waves. Waves can be represented mathematically using trigonometrical functions like sin and cos. We can then apply non linear differential equations to those mathematical forms to produce more complex equations. Keep adding more and more transformations and eventually we end up with one equation which represents an entire piece of music. This never happens because it is far, far too hard! Such an approach is referred to as ‘analytic’ and is practical only for the simplest of sounds.

The spectrogram of the piece of music
I rendered with Sonic Field for the
Java Advent Calendar
More tractable are numerical approaches. Here we consider sounds as a series of samples. Just like the samples that make up a video, one frame at a time, sound is generated and manipulated one sample at a time. We must understand this is an compromise and it brings with it a whole series of computational challenges of its own.

A quick note on licensing: Sonic Field is free as freedom:


Sonic Field Audio Processing And Synthesis Program
Copyright (C) 2012-2014 Dr Alexander J Turner

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
All the source code, including any Sonic Field patch scripts in this post are covered by the above licence. It only seems fair that anyone reading this should respect this as I am giving it away for free on Github!

Why samples are a challenge:

Consider that we want audio to be modelled up to 20kHz, i.e. we want to generate sounds which have frequencies up to 20 000 cycles per second. To actually generate that all using a sampling, digital system we need 40 000 samples per second. This is intuitively easy to understand as a sound goes ‘up and down’ in a wave so we need an up sample and a down sample for each cycle of the wave.
Now let us consider making a sound. We can start with a sine wave. This is a pure sound, and it is very boring indeed. How about making a more interesting sound? Well, real sounds consist of many different sine waves mixed together. A note at concert pitch A is 440Hz, it this were played on a trumpet we might get a sequence of frequencies like this:
440,880,1320,1760,2200,2640,3080,3520,3960,4400 and so on.
At the beginning of our piece we can see
that even a complex note is made from
bands of frequencies. 

Each frequency will be slightly less intense then the one beneath (usually) but they are all there. Now let us just consider 4400 or 8800; what happens when we pass that through a distortion filter to make it sound cooler? Well, we will end up adding frequencies which are multiples of the frequencies which made up the original trumpet sound. All this multiples are call harmonics.

Bored? we need some code! Sonic Field is controlled using Jython, the Python interpreter which runs on the JVM. I hope the reason for this will become obvious, but I will explicitly discuss this int he second post. Here is a piece of code which makes the basic sound like a trumpet (well, at least the trumpet rank from an organ):


def trumpetBase(length,freq,z=1.0):
voxA=[]
hc=1.0
freq=float(freq)
while hc*freq<20000:
hf=hc*freq
voxA.append(sf.NumericVolume(sf.PhasedSineWave(length,hf,random.random()),(1.0/hc)**z))
hc+=1

vox=sf.Mix(voxA)
vox=sf.Clean(vox)
vox=polish(sf.FixSize(vox),freq)
return sf.FixSize(vox)

OK, so you might not know Python (this being a Java Advent Calendar and all) so I will explain a little of what is happening here. Simply out their is a while loop which creates samples of different frequencies of sine waves to create the harmonics of the trumpet sound. The phase of the sine waves is randomised. It really does not matter if you do not know what that means. Anyhow, the loop which accumulates sine waves, then we mix them together and finally fix the volume (FixSize) to deviate +-1 from zero.
Not long now and we will be deep diving into Java, but a few more sentences to set the scene. But just back to that distortion. Consider I want the trumpet to sound more fragile and bright?

sig=sf.Power(trumpetBase(440,10000),1.25)
Now I have created a new ‘signal’ (i.e. a sequence of samples) which is a distortion of the original. In so doing multiples of the original harmonics will have been introduced. 5×4400 is over 20000. Our 40000 samples per second will no longer be enough. What we need to do is sample at a higher rate. But eventually we are defeated by this approach as we add distortion on distortion. The solution is to sample at a higher rate and then filter out the high frequencies periodically. We perform synthesis by repeating two basic steps, process then filter. The process adds frequencies and the filter takes out everything above when we need.
OK, so what rate? It seems that for most of my work 96 000 samples per second is about right. More makes near no difference to the sound quality but as I go below this number a horrid harsh effect called ‘aliasing’ starts to be unavoidable. This is caused by trying to model a frequency higher than twice the sample rate.

Here is a video I made talking about, explaining and illustrating aliasing.
Please note that the Sonic Field patch shown is pre the use of Python
and so is in a bespoke language I wrote called SFPL which is 
no longer used.
Finally we have started to see why making music with Java is so hard. 96 000 samples per second is a lot! In my research anything other than double precision floating point mathematics adds noise and reduces the over all quality of the sound. So, 96 000 * 8 bytes for 1 second of music. Basically, a single note can take many megabytes of memory to store.

How Sonic Field Represents Sounds:

The simplest way would be to just use a Java array of doubles. Sadly this is not anything like enough. I use a 16Gig machine and it would run out of memory very quickly if I used simple arrays. The reason being that sounds are made of sounds (as we saw before). Each of those sine waves in the trumpet sound might be 10 seconds long (about 5 megabytes) and there could be 40 harmonics, so that is 400 mega bytes. Then we want to generate notes in multiple threads, for 8 threads that might be 8 notes at once which is 1.6 Gigabytes. That is just to great a 10 second note which is relatively simple. More complex and longer notes (especially in ambient music where notes can last a very long time) quickly exhaust the capacity of my machine. I could use a bigger machine and address say 128 Gig; that would do for most music, but again, such a machine is likely to have more cores and the parallel memory use issue appears again. Basically, sound processing is memory lop sided compared to most other forms of processing.
Sonic Field address this issue with two techniques: Memory mapped files and generators. Of these, generators are probably the most interesting programmatically because they are surprisingly hard to implement and do not work as well as we might hope.


package com.nerdscentral.audio;

import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicLong;

import com.nerdscentral.audio.pitch.CubicInterpolator;
import com.nerdscentral.sython.SFMaths;
import com.nerdscentral.sython.SFPL_RuntimeException;

public abstract class SFSignal implements AutoCloseable
{

private final static AtomicLong uniqueId = new AtomicLong(0);
private final long myId;
protected final AtomicInteger referenceCount = new AtomicInteger(1);

public abstract boolean isKilled();

public abstract SFSignal replicate();

private static ThreadLocal<String> pythonStack = new ThreadLocal<>();

public final SFData replicateEmpty()
{
return SFData.build(this.getLength());
}

public abstract double getSample(int index);

protected SFSignal()
{
myId = uniqueId.incrementAndGet();
}

/**
* Returns a linearly interpolated sample based on the samples either side of the the passed index. This is used for super
* sampling or pitch effects.
*
* @param index
* @return
*/
public final double getSampleLinear(double index)
{
double s = SFMaths.floor(index);
double e = SFMaths.ceil(index);
if (s < 0 || e >= getLength())
{
return 0;
}
if (s == e) return getSample((int) s);
double a = getSample((int) s);
double b = getSample((int) e);
return ((index - s) * b + (e - index) * a);
}

/**
* Returns a cubic interpolated sample based on the samples either side of the the passed index. This is used for super
* sampling or pitch effects. Cubic interpolation uses two samples either side of the required point and so at the ends of
* the sample this will fall back to linear interpolation.
*
* @param index
* @return
*/
public final double getSampleCubic(double index)
{
int s = (int) SFMaths.floor(index);
int e = (int) SFMaths.ceil(index);
if (s < 0 || e >= getLength())
{
return 0;
}
if (s > getLength() - 3 || index < 1)
{
if (s == e) return getSample(s);
double a = getSample(s);
double b = getSample(e);
return ((index - s) * b + (e - index) * a);
}
return CubicInterpolator.getValue(getSample(s - 1), getSample(s), getSample(s + 1), getSample(s + 2), index - s);
}

public abstract double setSample(int index, double value);

public abstract int getLength();

public abstract void setAt(int pos, SFSignal data) throws SFPL_RuntimeException;

public abstract void setFrom(int pos, SFSignal dataIn) throws SFPL_RuntimeException;

public abstract double[] getDataInternalOnly();

/**
* Get a globally unique identifier for this SFSingal
*
* @return
*/
public final long getUniqueId()
{
return myId;
}

public abstract void release();

@Override
public void close() throws RuntimeException
{
int c = referenceCount.decrementAndGet();
if (c == 0) release();
else if (c < 0) throw new RuntimeException(Messages.getString("SFSignal.1")); //$NON-NLS-1$
}

public void incrReference()
{
referenceCount.incrementAndGet();

}

public SFSignal __pos__()
{
incrReference();
return this;
}

public SFSignal __neg__()
{
close();
return this;
}

public int getReference()
{
return referenceCount.get();
}

public void decrReference()
{
referenceCount.decrementAndGet();
}

public static String getPythonStack()
{
return pythonStack.get();
}

public static void setPythonStack(String ps)
{
SFSignal.pythonStack.set(ps);
}

public void clear()
{
// NOOP on root class
}
}

[This AGPL code is on Github]

This is the abstract root class for all signal data in Sonic Field. It is a bit of a mess to be honest, the perfect example of a one person generated class which has ended up a little unfocused. However, it works and concrete sub-classes of this make up generators and memory mapped file handling. We can also see a bunch of code for integrating with Python in a nice way and for explicit resource management; normal Java garbage collection is no use for memory mapped file based storage!

So what is a generator?

The basic architecture of synthesis ends up being (you can argue theoretically about this, but in reality this is what happens):

  1. Make a signal
  2. Manipulate that signal making a new signal
  3. Repeat until the wanted sound appears

The obvious architecture is then to:

  1. Signals are samples.
  2. Read samples.
  3. Apply and algorithm.
  4. Write samples.

We represent samples as indexed sequences of double . Therefore, for the above model, each sound manipulator algorithm reads from a sequence and writes out to a sequence. However, this involves a huge amount of memory access and memory access is very slow. When we run out of memory the operating system starts flushing data to the memory mapped files to help cope; reading and writing then becomes even slower. Consider a manipulation like this:

  1. Distort a signal using the ‘Saturate’ algorithm
  2. Halve the resulting volume
  3. Mix with another signal

We do not need to create intermediate data storage for these steps, they can be chained together. This is what generators do; they chain with other generators overriding the getSample method. Here is the saturate implementation:

 
public static class Translator extends SFSingleTranslator
{

protected Translator(SFSignal input)
{
super(input);
}

@Override
public double getSample(int index)
{
double x = getInputSample(index);
double y = x >= 0 ? x / (x + 1) : x / (1 - x);
return y;
}
}
Here the base class SFSingleTranslator implements the ability to contain and access the signal source. Mix is a little more complex as it has multiple source signals, but again, it is hardly expensive:

static class Translator extends SFMultipleTranslator
{

protected Translator(List<SFSignal> input)
{
super(input);
}

@Override
public double getSample(int index)
{
double d = 0;
for (int mindex = 0; mindex < getNMembers(); ++mindex)
{
d += getInputSample(mindex, index);
}
return d;
}
}
I hope it is clear how each of these can use the previous as a source and so no storage is required.

Why generators are not the solution to everything:

Generators seem so cool that we might think signal processing is a solved problem; just make all manipulations a generator and the reading/writing issue only comes at the end of signal generation when we write our finished music to disk. Sadly, the real world is more complex. Whilst this ideal concept might work for a restricted notion of real time signal manipulation, for the complex world of non realtime manipulation in Sonic Field we have non sequential access and multi-threading to contend with.
First, let us consider the simplest form of non sequential access; this is multiple access to the same signal. Whilst the previous algorithms were simple, some are not. If we consider a 6 pole infinite impulse response filter, then we are looking at something which will take hundreds or thousands of cycles to compute one signal sample. We do not want to accidentally do this twice! However, generators can cause that to happen. Consider that we split a signal into two (for left and right).

# Read the signal
signal=sf.ReadFile("mono.wav")
# Filter it
signal=sf.ButterworthLowPass(signal,1000,6)
# Split it
left=+signal
right=signal
# Add Haas effect
left  = sf.Concatenate(sf.Silence(30),left)
# Write it out
sf.WriteFile32((left,right,"stereo,wav")
If the Butterworth Low Pass filter is a generator then it will get run twice, once for the left and once for  the right. This would be insanely expensive. In the case of expensive algorithms we want to write and the result not keep recomputing it. So, for low cost transformations Sonic Field uses generators and for high cost it does indeed translate to a SFData; i.e. a memory mapped file.
Left get even more complex than that; we also have some processing which can take place in many, many non sequential subsections. A processor might cut a signal into pieces, for granular synthesis for example. In granular synthesis we take sound and chop it into tiny pieces and then manipulate each individually. Here is a video I did describing the technique:

In cases like this we also do not want the signal being non sequentially processes to be a generator. To cope with such situations we have the ‘realise’ method which converts generators to ‘real’ signals but has no effect on those which are already storage backed.

Similarly, consider white noise. This is ‘random’ so one might think it could easily be created from a generator. However, we want the same point in a signal (say sample 112121) to always have the same value each time we read it. So, in effect, white noise must be ‘recorded’ into a real data store. Finally we have the filters which have an internal state. Things like low pass filters are implemented as a state machine. That state is based on the order in which samples are passed through the filter. As such, they cannot be implemented as a generator as samples in Sonic Field are random access exactly because we might want to do things like granulate!

How Java Is Terrible At Memory Management (and approaching a solution):

Now we have some background to why managing audio data is such a challenge we are ready to see why Java is a bit of a catastrophe for managing memory. Java has garbage collection which means we do not know how much memory is actually being used. There are some forms of instrumentation which are supposed to tell us the amount of memory being used, but they are not very ‘real time’. In a C++ based system (for example) we could intercept malloc and know, to the byte, the amount of memory being used from the heap. This luxury is not given to us in Java. I tried, oh I tried so hard, to get a reliable way of estimating when the JVM was about to throw an out of memory exception; but as far as I can tell, there is no fail-safe way of doing so.

However, all is not lost as there is a usable work around. Indeed, the work around is always evolving and I would not be shocked it it became almost as effective as a hand crafted memory management system in C++ but with the platform independence of Java. It is not there yet, but it is getting there.

Memory Mapped Signal Pages

Memory mapped files are somewhat supported in Java. It is frustrating as anything that we can only memory map byte buffers and not double arrays; nevertheless, we can use them with the appropriate accessor methods. So, what Sonic Field does is have 2 megabyte ‘chunks’ which are ByteBuffers created by memory mapping a file. Using explicit resource management (thanks to Java 7) we know when a chunk is being used and when it is free. When one is freed it is added to a ConcurrentLinkedDeque of free chunks. This means our chunks are recyclable. How about we take a look at some of the code which does this:


static
{
tempDir = new File(System.getProperty(SONIC_FIELD_TEMP));
String swapLimitraw = System.getProperty(SONIC_FIELD_SWAP_LIMIT);
if (swapLimitraw == null)
{
swapLimit = MAX_IN_RAM; // 64 megabytes or 87 seconds at 96000
}
else
{
swapLimit = Long.parseLong(swapLimitraw);
}
chunkLen = swapLimit / 2;

String pid = ManagementFactory.getRuntimeMXBean().getName();
try
{
if (tempDir != null)
{
coreFile = File.createTempFile("SonicFieldSwap" + pid, ".mem", tempDir); //$NON-NLS-1$ //$NON-NLS-2$
}
else
{
coreFile = File.createTempFile("SonicFieldSwap" + pid, ".mem"); //$NON-NLS-1$//$NON-NLS-2$
}
coreFile.deleteOnExit();
// Now create the actual file
coreFileAccessor = new RandomAccessFile(coreFile, "rw"); //$NON-NLS-1$
}
catch (IOException e)
{
throw new RuntimeException(e);
}
channelMapper = coreFileAccessor.getChannel();
}

}

[This AGPL code is on Github]

Here we can see the creation of the swap file (the name I use for the memory mapped file). The most important thing is that we get the channel for the swap file. This is the object we can then use for memory mapping. Now we can see the code which creates the mapping for a new SFData object:


private void makeMap(long size) throws IOException
{
long countDown = size;
int chunkCount = 0;
while (countDown > 0)
{
++chunkCount;
countDown -= chunkLen;
}

countDown = size;
chunks = new ByteBuffer[chunkCount];
chunkCount = 0;
while (countDown > 0)
{
ByteBuffer chunk = freeChunks.poll();
if (chunk == null) break;
chunks[chunkCount] = chunk;
++chunkCount;
countDown -= chunkLen;
}
if (countDown > 0)
{
synchronized (coreFile)
{
long from = coreFile.length();
while (countDown > 0)
{
ByteBuffer chunk = channelMapper.map(MapMode.READ_WRITE, from, chunkLen);
chunk.order(ByteOrder.nativeOrder());
chunks[chunkCount] = chunk;
++chunkCount;
from += chunkLen;
countDown -= chunkLen;
}
}
}
}

[This AGPL code is on Github]

It is lovely and simple! We just use up chunks on the free ‘stack’ and if there are not enough to represent all the signal we create a few more until we have enough.  I use a deque as a stack because the most recently added chunk is also most likely not to be ‘swapped out’ by the operating system; optimum performance is achieve by reusing the same chunks over and over. There is no concurrent stack in the standard JDK, so the deque made a good choice.

It is worth noting that recycling the last used chunk rather and using a first in first out approach gave a huge performance improvement on my machine. When I tried a first in first out approach the system would run fast until swapping started and then slow right down as it constantly read back in swapped out chunks. This was a complete waste of time because the swapped out chunks contained data that was not going to be used again. By recycling last in first out, this performance issue disappeared and Sonic Field achieved very efficient use of the operating system file cache and virtual memory systems.

Using explicit resource management, we can ensure that chunks which are no longer required get reused:


@Override
public void close() throws RuntimeException
{
int c = referenceCount.decrementAndGet();
if (c == 0) release();
else if (c < 0) throw new RuntimeException(Messages.getString("SFSignal.1")); //$NON-NLS-1$
}

Yes, Sonic Field uses reference counting. This is tide into the explicit resource management in that SFSingal implements AutoCloseable. I will discuss the mechanism behind this a little more when we discuss the way the Java interacts with Python. However, what matters for now is that when the reference count hits zero ‘release()’ is called:


@Override
public void release()
{
for (ByteBuffer chunk : chunks)
{
freeChunks.push(chunk);
}
chunks = null;
resourceTracker.remove(this);
}

Release is responsible for putting the now unused chunks back onto the free deque. As the free deque is concurrent, we do not have to use synchronization blocks anywhere other than when creating new chunks.

Heavy disk IO as the OS starts serialising
blocks of the memory mapped file to disk

Now I can hear you screaming at me. My ears are ringing and it has nothing todo with this noisy train I am on. “But you should use a direct double buffer” you say. Yes, nio has these double buffers so we would use put() and get() rather than putDouble() and getDouble(). Surely these are ‘blessed’ in some magical way to make them perform. Nope, that approach is very slightly slower than using ByteBuffers directly. In other words, getDouble and putDouble being used ‘under the covers’ or something very similar. Here is a simple test which creates a memory mapped signal and reads from it then creates another etc:

from java.lang import System
sig=sf.WhiteNoise(10000)
t0=System.nanoTime()
for x in range(1,10):
    sig=sf.Swap(sig)
    for z in range(1,5):
        sig1=sf.Realise(sf.Pcnt10(+sig))
    sig=sig1
        
t1=System.nanoTime()
for y in range(1,5):
    t0=System.nanoTime()
    for x in range(1,10):
        sig=sf.Swap(sig)
        for z in range(1,5):
            sig1=sf.Realise(sf.Pcnt10(+sig))
        sig=sig1
    t1=System.nanoTime()
    print t1-t0

Times using a double buffer:
1497275000
1474564000
1410076000
1474493000
Sonic Field using 48 Gig of swap file
on a machine with 16 Gig of RAM

Times using a byte buffer only:

1375515000
1376532000
1478389000
1438900000
On my machine the memory mapped files would not have had to actually resort to disk (SSD actually) as there was plenty of RAM available. Making sure the byte order is correct for the current platform (in my case LITTLE_ENDIAN, which one can get from ByteOrder.nativeOrder()) does help a little, which is interesting as it shows some form of optimisation going on.

A Quick Note On Caching

To anyone who has worked in software performance tuning, seeing a system with serialisation and deserialisation and writing to memory mapped files will immediately cause the word ‘cache’ to appear in bright neon in their mind’s eye. This is certainly the case for me, and until recently Sonic Field did have a caching scheme which simultaneously helped a bit, caused immense complexity and produced constant out of memory errors. As of now I have given up on that approach. Basically, the serialisation costs are painful but acceptable, especially considering that they can be reduced by ever more  aggressive use of generators. Caching systems add an overhead in themselves and so they offer less benefit than one might expect. One day I might dream up a better solution, but for now, this seems to be reliable if not exactly stella performance wise.

To finish this section on memory management, here is a piece of music which required a 48G swap file. This entire 32.5 minute sound scape, including all the reverberation, harmonic excitation and delay effects were all done in Sonic Field using double precision floating point mathematics at 96 000 samples per second. When I say it like that, I guess it is not such a shock that it took a lot of resources!

What Is Arbitrary Complexity Ab-initio Synthesis?

A quick lesson in synthesis. The pioneers of electronic music made notes and sounds. Later came the work of the legendary Dr Robert Moog. He created the concept of what we now call a synthesiser. As always, we could include many others in this creation, some working with him and some in parallel work. But let us just think about his work. He realised that very flexible sound creation could be achieved by oscillators, which create tones, and filters, which remove unwanted tones. These were ‘analogue’ electronic circuits which produced a limited number of tones and could filter them in a limited number of ways to produce sounds in ‘real time’.  The complexity of the sound was controlled by the complexity of the synthesiser. However, limitations in analogue technology set an upper limit to complexity; every circuit added some distortion and some noise, eventually these artefacts would overwhelm the desired sound and so the upper complexity limit was set.

These early pioneers created what we now think of as the ‘principles of synthesis’ and by so doing formed a wonderful working model for people to learn but also restricted creativity. Computational and digital synthesis tends to continue to follow the same path as the original analogue work. It is real time (press a key and out comes sound) and restricted (there are a fixed number of digital sound sources and filters etc). “Arbitrary complexity” simply means that there is no upper limit to the number of filters, oscillators and such that can be used. One key concept behind Sonic Field (which I have not yet quite gotten around to implementing) is that it can function in a cloud environment an coordinate an arbitrarily large number of nodes to create an arbitrarily complex piece of music or soundscape. Whilst the implementation is not quite there yet, the current architecture is very much suited to this approach.

There exists a completely different sort of digital sound creation which some people call synthesis (but I do not). This is sample looped based. Here sounds are recorded or created digital and stored as a ‘sample’ of that sound. These samples are then played back in a loop to create the notes heard. This is not ‘from first principles’ because the sounds already existed. Creating sounds by the direct manipulation of mathematical equations or the direct application of electronic circuits creates the sounds from first principles. This is where ‘ab-initio’ comes from.

So Why Java?

I get asked this quite a bit when I discuss Sonic Field with people. The first question often is why not do this in C++? There are many answers and as this post is in a Java advent calendar I do not intend to defend Java. I think the question comes from an incorrect view that Java is somehow non performant. I often hear people saying Java is interpreted, even now! My view is that Java has some huge specific benefits for this work. They centre around multi-threading, clustering and portability:
  1. I originally intended to run Sonic Field on top of hadoop so processing could be distributed in a cloud. I might well still do something like this. Java’s serialisation system and the maturity of the tooling around cloud computing in Java makes it an ideal candidate.
  2. Heavy multi-threading is more mature in Java than in C++. At the time is started the project C++11 was only just coming on line, so the contrast was even more stark.
  3. Being able to run Sonic Field anywhere and not having to recompile it for different platforms has been a huge benefit. I have run it on Windows, Linux and (almost exclusively for a couple of years now) on the Mac. I have never had to change a line of code to do these things.
  4. I like doing strange things which people do not expect; proving Java can handle large scale mathematical calculations has a perverse fun streak running through it!
On personal reflection, I do not think the project would have survived this long if it had been in C++. The effort of maintaining the build and updating from one version of gcc to the next and across to Visual Studio would have caused me to give up. Seriously, the memory management issues might one day start me porting off the JVM but I still hold hope I will keep coming up with a fixes for them rather than give up on this amazing platform; current evidence is that the memory mapped files approach is ‘good enough’ and I live in hope of coming up with either more tweaks to that or an even more effective approach.

An Aside On Threading

I said Java was good for threading. If you know anything about Python in C you will know it is single threaded (well, the interpreter is through a thing called the global interpreter lock). Jython is very multi-threaded. One of the driving concerns in Sonic Field has always been parallel execution be that via clustering or local threading. So I think it is time we took a look at the current threading model in Sonic Field (which changes almost as often as the memory management).

Now, this is a bit complex for a single post like this, so I will skim over a lot of the details. Nevertheless, I think this mechanism is a good demonstration of how Jython and Java can work well together.

import threading
import time
from java.util.concurrent import Executors, TimeUnit
from java.util.concurrent import Callable, Future
from java.lang import System
from java.lang import ThreadLocal
from java.lang import Thread
from java.util.concurrent import TimeUnit
from java.util.concurrent.locks import ReentrantLock

SF_MAX_CONCURRENT = int(System.getProperty("synthon.threads"))
SF_MAX_CONQUEUE = SF_MAX_CONCURRENT

print "Concurrent Threads: " + SF_MAX_CONCURRENT.__str__()
SF_POOL = Executors.newCachedThreadPool()

class sf_callable(Callable):
def __init__(self,toDo):
self.toDo=toDo

def call(self):
return self.toDo()

class sf_futureWrapper(Future):
def __init__(self,toDo):
self.toDo=toDo
self.gotten=False

def __iter__(self):
return iter(self.get())

def get(self):
if self.gotten:
return self.result
else:
self.result=self.toDo.get()
self.gotten=True
return self.result

def __pos__(self):
obj=self.get()
return +obj

def __neg__(self):
obj=self.get()
return -obj

class sf_getter(Future):
def __init__(self,toDo):
self.toDo=toDo
self.result=self.toDo()

def get(self):
return self.result

def __iter__(self):
return iter(self.get())

def __pos__(self):
obj=self.get()
return +obj

def __neg__(self):
obj=self.get()
return -obj

class sf_taskQueue(ThreadLocal):
def initialValue(self):
return []

SF_TASK_QUEUE=sf_taskQueue()

class sf_superFuture(Future):

def __init__(self,toDo):
self.toDo=toDo
queue=SF_TASK_QUEUE.get()
queue.append(self)
if len(queue)>SF_MAX_CONQUEUE:
self.submitAll()

def submit(self):
count=SF_POOL.getActiveCount()
if count<SF_MAX_CONCURRENT:
task=sf_callable(self.toDo)
self.future=sf_futureWrapper(SF_POOL.submit(task))
else:
self.future=sf_getter(self.toDo)

def submitAll(self):
queue=SF_TASK_QUEUE.get()
while(len(queue)):
queue.pop().submit()

def get(self):
self.submitAll()
while not hasattr(self,'future'):
Thread.yield()
r = self.future.get()
return r

def __iter__(self):
return iter(self.get())

def __pos__(self):
obj=self.get()
return +obj


def __neg__(self):
obj=self.get()
return -obj

def sf_do(toDo):
return sf_superFuture(toDo)

def shutdown_and_await_termination(pool, timeout):
pool.shutdown()
try:
if not pool.awaitTermination(timeout, TimeUnit.SECONDS):
pool.shutdownNow()
if not pool.awaitTermination(timeout, TimeUnit.SECONDS):
print >> sys.stderr, "Pool did not terminate"
except InterruptedException, ex:
pool.shutdownNow()
Thread.currentThread().interrupt()

def shutdownConcurrnt():
shutdown_and_await_termination(SF_POOL, 5)
This mechanism launches futures to process threads. It does this using closures which are really easy to do in Python. It gets much more complex than one might think though because of the issue of recursive launching (which I will touch on here in a bit but is covered in more detail in my post ‘The Embrace Of Meh‘). Sonic Field aims to have a simple approach to threading. The very idea that such a thing exists is a little naive! However, it does get some way towards this goal. Here is an example. Consider that we have function sf.FrequencyDomain() which is very expensive and a function xmulti() which cross multiplies the output of this expensive processor.


# Form A
def mixer(siga,sigb):
    return sf.CrossMultiply(
        sf.FrequencyDomain(siga),
        sf.FrequencyDomain(sigb)
    )

We can convert that into running in parallel simply by making the bulk of the work happen in a closure:


# Form B
def mixer(siga,sigb):
    def mixerInner():
        return sf.CrossMultiply(
            sf.FrequencyDomain(siga),
            sf.FrequencyDomain(sigb)
       )
    return sf_do(mixerInner)

I love this because it requires so little syntax and just works. sf_do creates a future for the execution of the closure. The snag comes when we want to use that future. Because in the Java world we need the result (a signal) not a future. How can we  fix this type safety issue? Despite the parallel submission code being written in Python it uses java.util.concurrent.Future from the JDK. This means we can dynamically dispatch of the instance type of arguments. All Sonic Field processors convert their argument dynamically from Object to what ever they require. They do this via the static methods on the com.nerdscentral.synthon.Caster class. The key one of those methods being this:


public static Object checkAutoTranslation(Object o) throws SFPL_RuntimeException
{

if (o instanceof Future)
{
Future doer = (Future) o;
try
{
Object d = doer.get();
return checkAutoTranslation(d);
}
catch (Throwable t)
{
throw new SFPL_RuntimeException(t);
}
}

if (o == null)
{
throw new SFPL_RuntimeException(Messages.getString("Caster.12")); //$NON-NLS-1$
}

return o;
}

I have highlighted in bold where the conversion occurs.

Naturally, tasks can submit tasks, so the system must be recursive. It will keep calling get() on Futures until it find a non Future result. However, you might have noticed that the code for creating futures is not as simple as it might be. What is a ‘SuperFuture’? This is a huge challenge. There is a finite number of threads available on any computer; when we have tasks submitting new tasks and each is in a new thread we can exhaust the maximum number easily. The classic solution is to use a thread pool. Sadly, if say we have 16 threads in a pool then one task submits 2 and then those submit 2 each and then those submit 2 each we have 1+2+4+8=15 threads. If any of the new leaf tasks then submit tasks we will see the pool exhausted. This means they the new tasks queue up for the pool to empty. But the pool never empties because it is full of blocked threads which are waiting for the completion of tasks which are queued up.

SuperFutures work around this because they can either result in a true Future which is asynchronously executed or they can, if the pool is getting dangerously full, be executed serially. i.e. a SuperFuture converts into either a Future (parallel) or Getter (serial) when it is executed. Its execution is delayed until an input queue hits a particular size of the result of the SuperFuture is requested from another Future or Getter. This delay effect helps fine grained tasks to end up as Getters and their parents to be the Futures.

I have not implemented a full work stealing system because the complexity of working out the dependencies in the Python puts me off. I want the system to be easy to use and I have not worked out an easy to use work stealing system yet; so for now the super future appraoch appears to work OK.

Wrapping It Up?

There is so much I wanted to discuss and did not! For example, I wanted to discuss explicit resource management in much more detail. I would love to have discussed caching in FFT. How about sub-normal floating point performance? Immutable data structures and threading for large scale mathematics?

What I hope I have shown is how some of the ‘heavy engineering’ in Sonic Field has allowed something one would often not associated with Java – huge scale mathematics – to work rather cleanly in this language. Sonic Field has taught me so much about practical threading issues, real memory management challenges and effective flexible processing architecture. I hope that at least some of these concepts will be of use to you the reader.

This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

Using Intel Performance Counters To Tune Garbage Collection

<

div dir=”ltr” style=”text-align: left;”>

Introduction

I have to admit that I was shocked. Indeed, quite shaken when I realised this advent calendar post would be about garbage collection. The topic of GC has raised such passion amongst the advocates of Java and those who believe memory management should be manual. Many an article has been written regarding tiny subtle changes in strange looking command line arguments which have fractions of a percentage point performance impact on Java applications. How could I add to this huge body work? I hope this post will not add to the GC hot air but rather be a breath of fresh air instead. Let us not look at the CPU time consumed by the garbage collector or pause times; how about looking at a hidden yet potentially critical aspect of memory management in general and garbage collection in particular: Data caching is one of the major challenges of modern computer software design (the others being instruction caching and multi-core work distribution). Modern CPUs run so fast that main memory has no hope of keeping up. The way that some of this catastrophic performance penalty can be clawed back is caching. Memory is pulled in parallel into high speed cache memory and then the CPU accesses this cache. If we are lucky, and the code causes the CPU to read and write the same memory a few times (inside a loop for example) then the CPU can happily access the cache and be saved from waiting for loads and stores to and from main memory. “How does garbage collection impact on the performance of caches” one might ask? There are many ways, some of them very subtle, however, here is a grab bag of some important ones: The garbage collector traverses references in memory. This causes cache lines (blocks of memory in the cache) to contain the memory surrounding the reference and hence no longer hold other data which the program is using. Whilst we call it a garbage collector, it is actually an allocator, mover and collector. This really matters when we think about data caching:

  • Allocation: Based on hardware rules, memory addresses are matched to cache lines. If pieces of memory share a cache line but are actually accessed from different threads we get an effect called false sharing. However, if little bits of data are spread out but accessed from the same thread we get poor cache utilisation.
  • Moving: Objects are not left in one place throughout their life times. The garbage collector avoids memory fragmentation by moving objects around. This has the interesting effect of guaranteeing that the cache lines associated with the object will no longer be associated with it after a move.
  • Collecting: The funny thing is that collecting is the easy bit. It can be as simple as just marking the memory as available for reuse. It is the traversal of the object graphs (multiple roots) to find out what can be collected which is going to cause data cache line loads and thus evict lines from the cache which were being read from or written to by user code.

So we can now see that the design of the garbage collector is critical to the operation of the data cache. Swapping which collector we use will not only have a impact on GC pauses and other obvious issues, it will also effect, at a low level and in a fundamental way, all of user code.

An Example

I am not going to present an exhaustive scientific paper on this concept. The purpose of this post is to show an alternative way of approaching JVM tuning. So, I ran a simple, short, multi-threaded patch in my personal synthesiser program Sonic-Field. The patch uses feedback, resonance and a bunch of other concepts to synthesise string instruments and then convolution to place the sounds in an acoustic environment.The reason for picking sonic field is not just because it is of reasonable complexity, highly threaded and uses Spring but because I recently found I could get better performance from it using the CMS garbage collector. Latency with Sonic-Field is of no interest because it is a batch processor. However, the standard Java 7 garbage collector interacted badly with the way Sonic Field writes swap files out to disk when running low on memory. I tried CMS because it keeps the memory down the whole time (in theory – don’t flame me) because it constantly attempts to do small garbage collections along side the user threads. If we put all this together we might well come up with a reasonable theory “The CMS garbage collector might give fewer pauses and might be able to keep memory use down but in so doing it will almost certainly cause more data cache misses”. Constantly traversing the reference graph in memory to try and collect dead objects is going to cause cache loads and those loads will cause other data to be flushed from the cache (it has finite size). Thus, when user threads come to read again they will cause more cache misses and so on. Does it matter? That answer will be entirely down to the application and the hardware and the load on the application. I am not, I repeat not, advocating one garbage collector over another! Nevertheless, it is a question I would like to answer so let’s answer it for my little test patch. These data cache effects of the garbage collector are not visible from the normal VM profiling tools. This means that they do not get discussed much in the JVM community and they get considered in JVM tuning even less. However, there is a tool (actually several – but I am going to talk about the easiest to use) which can shed some light on the topic. I am talking about Intel’s PCM (Performance Counter Monitor). It can be used for code tuning as well, but I thought talking about the GC would be more fun today.

A Worked Example

pcm is just a command line tool. We pass the command line to run Java to it in quotes and it does its measurements. With other tooling, the performance counters can be used to get all sorts of other detail about an application. The benefit of the pcm command line tool is its simplicity and lack of intrusion into the over all application run. The disadvantage is that it will measure the JVM and application warm up phases. However, for server style applications or batch processors (like Sonic Field) these overheads are usually trivial compared to the actual application run. I ran my patch on my personal Macbook Pro Retina (2012) with 16Gig of RAM. The JVM was:

java version "1.8.0-ea"

Java(TM) SE Runtime Environment (build 1.8.0-ea-b61)

Java HotSpot(TM) 64-Bit Server VM (build 25.0-b05, mixed mode)

Readings from pcm are simply written to standard out when the application exits. I compared runs with no settings for the garbage collector (the default therefore) and with my currently preferred set of tweaks. To be honest, I am not sure if the tweaks are optimal; I kind of lifted them from a bunch of online articles… Here is the launch script for Java:

/Users/alexanderturner/x/IntelPerformanceCounterMonitorV2.5.1 2/pcm.x "java -Xmx12G -Xms12G  -DsonicFieldTemp=/Users/alexanderturner/temp -DsonicFieldThreads=12 -DsonicFieldSwapLimit=4.0  -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops -XX:ParallelGCThreads=8 -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly -classpath bin:spring-framework-3.1.2.RELEASE/dist/org.springframework.asm-3.1.2.RELEASE.jar:spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.beans-3.1.2.RELEASE.jar:spring-framework-3.1.2.RELEASE/dist/org.springframework.core-3.1.2.RELEASE.jar:spring/spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.context-3.1.2.RELEASE.jar:
spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.context-support-3.1.2.RELEASE.jar:
spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.expression-3.1.2.RELEASE.jar:
spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.test-3.1.2.RELEASE.jar:
spring/otherJars/commons-logging-1.1.1.jar com.nerdscentral.sfpl.RenderRunner $1"

Hopefully it is clear just how simple running Java under Intel Performance Counter Monitor v2 really is. So, here is the output:

Standard GC

Core (SKT) EXEC IPC FREQ AFREQ L3MISS L2MISS L3HIT L2HIT L3CLK L2CLK READ WRITE TEMP
0 0.53 0.75 0.70 1.31 422 M 621 M 0.32 0.32 0.14 0.01 N/A N/A 32
2 0.56 0.77 0.73 1.31 346 M 466 M 0.26 0.31 0.11 0.01 N/A N/A 28
1 0.22 0.69 0.32 1.31 144 M 192 M 0.25 0.28 0.11 0.01 N/A N/A 32
3 0.21 0.68 0.31 1.31 135 M 171 M 0.21 0.28 0.10 0.01 N/A N/A 28
4 0.55 0.77 0.71 1.31 332 M 410 M 0.19 0.38 0.11 0.01 N/A N/A 22
7 0.18 0.68 0.26 1.30 124 M 134 M 0.08 0.30 0.11 0.00 N/A N/A 27
5 0.19 0.68 0.29 1.31 133 M 155 M 0.14 0.30 0.11 0.00 N/A N/A 22
6 0.61 0.79 0.78 1.32 343 M 382 M 0.10 0.35 0.10 0.00 N/A N/A 27

—————————————————————————————————-

SKT 0 0.38 0.75 0.51 1.31 1982 M 2533 M 0.22 0.33 0.11 0.01 N/A N/A 22

Instructions retired: 2366 G ; Active cycles: 3166 G ; Time (TSC): 773 Gticks ; C0 (active,non-halted) core residency: 39.04 % —————————————————————————————————

TOTAL * 0.38 0.75 0.51 1.31 1982 M 2533 M 0.22 0.33 0.11 0.01 N/A N/A N/A

C: 1.49 => corresponds to 37.36 % utilization for cores in active state. Instructions per no C1 core residency: 23.92 %; C3 core residency: 0.01 %; C6 core residency: 0.00 %; C7 core residency: 37.02 % C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 % PHYSICAL CORE I Pminal CPU cycle: 0.76 => corresponds to 19.12 % core utilization over time interval

Concurrent Mark Sweep

Rather than

-XX:+UseConcMarkSweepGC
-XX:+UseCompressedOops
-XX:+CMSParallelRemark
-XX:ParallelGCThreads=Enabled
-XX:CMSInitiatingOccupancyFraction=60
-XX:+UseCMSInitiatingOccupancyOnly

Core (SKT) EXEC IPC FREQ AFREQ L3MISS L2MISS L3HIT L2HIT L3CLK L2CLK READ WRITE TEMP
0 0.53 0.69 0.76 1.31 511 M 781 M 0.35 0.35 0.17 0.02 N/A N/A 26
2 0.54 0.71 0.75 1.31 418 M 586 M 0.29 0.40 0.14 0.01 N/A N/A 29
1 0.31 0.66 0.47 1.30 207 M 285 M 0.27 0.26 0.11 0.01 N/A N/A 26
3 0.21 0.68 0.31 1.31 135 M 171 M 0.21 0.28 0.10 0.01 N/A N/A 28
4 0.55 0.77 0.71 1.31 332 M 410 M 0.19 0.38 0.11 0.01 N/A N/A 22
7 0.18 0.68 0.26 1.30 124 M 134 M 0.08 0.30 0.11 0.00 N/A N/A 27
5 0.19 0.68 0.29 1.31 133 M 155 M 0.14 0.30 0.11 0.00 N/A N/A 22
6 0.61 0.79 0.78 1.32 343 M 382 M 0.10 0.35 0.10 0.00 N/A N/A 27
3 0.30 0.66 0.46 1.30 198 M 258 M 0.23 0.27 0.11 0.01 N/A N/A 29
4 0.59 0.73 0.81 1.31 397 M 504 M 0.21 0.46 0.12 0.01 N/A N/A 29
7 0.30 0.66 0.45 1.30 186 M 204 M 0.09 0.29 0.11 0.00 N/A N/A 30
7 0.30 0.66 0.45 1.30 186 M 204 M 0.09 0.29 0.11 0.00 N/A N/A 30
5 0.30 0.66 0.45 1.30 188 M 225 M 0.16 0.28 0.11 0.01 N/A N/A 29
6 0.58 0.73 0.79 1.31 414 M 466 M 0.11 0.49 0.13 0.00 N/A N/A 30

—————————————————————————————————-

SKT 0 0.43 0.70 0.62 1.31 2523 M 3313 M 0.24 0.38 0.13 0.01 N/A N/A 25
Instructions retired: 2438 G ; Active cycles: 3501 G ; Time (TSC): 708 Gticks ; C0 (active,non-halted) core residency: 47.22 %

—————————————————————————————————- TOTAL * 0.43 0.70 0.62 1.31 2523 M 3313 M 0.24 0.38 0.13 0.01 N/A N/A N/A

C: 1.39 => corresponds to 34.83 % utilization for cores in active state. Instructions per no C1 core residency: 17.84 %; C3 core residency: 0.01 %; C6 core residency: 0.01 %; C7 core residency: 34.92 % C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %

PHYSICAL CORE I
Pminal CPU cycle: 0.86 => corresponds to 21.51 % core utilization over time interval. All the information given here is of interest, however, there is so much of it I figure the best thing to do is cut the the case and test my assertion about the CMS collector. To do that we can look at just two lines form the output for each run:

Default:
SKT 0 0.38 0.75 0.51 1.31 1982 M 2533 M 0.22 0.33 0.11 0.01 N/A N/A 22
Instructions retired: 2366 G ; Active cycles: 3166 G ; Time (TSC): 773 Gticks ; C0 (active,non-halted) core residency: 39.04 %
CMS:
0 0.43 0.70 0.62 1.31 2523 M 3313 M 0.24 0.38 0.13 0.01 N/A N/A 25
SKT:
Instructions retired: 2438 G ; Active cycles: 3501 G ; Time (TSC): 708 Gticks ; C0 (active,non-halted) core residency: 47.22 %

Discussion

We can see that under the CMS collector there were substantially more cache misses. The L2 misses were 30% greater and L2 were up 27% over the default collector. Nevertheless, the total time taken in giga ticks (708CMS/773Default) shows that all this extra data missing has not negatively impacted over all performance at all. I guess this means that a lot more research could and should be done before drawing any conclusions as to the correct approach for this application!
If you leave this post thinking that I did not fully discuss the subject, you are correct. My intention here has been to get the reader interested in thinking about this aspect of Java performance and opening the door to a new approach.

Meta: this post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

Weak, Weaker, Weakest, Harnessing The Garbage Collector With Specialist References

When and when not to use specialist references in Java

Weak, Soft and Phantom references are dangerous and powerful. If they are used the wrong way they can destroy JVM performance; however, when used the correct way they can substantially enhance performance and program clarity.

Weak and Soft references are the more obvious of the three. They are pretty much the same thing actually! The idea is simply that they be used to access an object but will not prevent that object being reclaimed by the garbage collector:


Object y=new Object();
// y is a hard reference to the object
// and so that object cannot be reclaimed.

Obejct x=WeakReference<Object>(y);
// now x is a weak reference to the object
// (not to y - as y is just a variable).
// The object still cannot be reclaimed
// because y is still a hard reference to it.

y=null;
// now there is only a weak reference to
//the object, it is eligible for garbage collection.

if(x.get()==null){
System.out.println("The object has gone away");
}else{
System.out.println("The object is " + x.get().toString());
}

Have you spotted the deliberate mistake? It is an easy one to miss and it will probably not show up in unit testing. It is exactly the sort of issue which makes me say:

Only Use Weak/Soft References If You Absolutely Have To
And Probably Not Even Then.
When the JVM is under memory pressure it might reclaim the object between the first and second invocations of the get method in the weak reference. This will result in the program throwing a null pointer exception when the toString method is invoked on null. The correct form for the code is:


Object x=x.get();
// Now we have null xor a hard reference to
// the object
if(z==null){
System.out.println("The object has gone away");
}else{
System.out.println("The object is " + z.toString());
}

So they are mad, bad and dangerous to be with; why do we want them?

We have not fully touched on why they are really, really dangerous yet. To do that we need to see why we might want them and why we might need them. There are two common situations in which weak and soft references might seem like a good idea (we will look at the difference between soft and weak in a little bit). The first of these is in some form of RAM cache.

It works like this: We have some data, for example customer details, which is stored in a database. We keep looking it up and that is slow. What we can do is cache that data in RAM. However, eventually the RAM will fill up with names and addresses and the JVM throw an OutOfMemoryError. The solution is to store the names and addresses in objects which are only weakly reachable. Something like this:


ConcurrentHasMap>String,WeakReference>CustomerInfo<< cache=new ConcurrentHashMap><();
...
CustomerInfo currentCustomer=cache.get(customerName);
if(currentCustomer==null){
currentCustomer=reloadCachesEntry(customerName);
}

This innocent little pattern is quite capable of bringing a monster sized JVM to its knees. The pattern is using the JVM’s garbage collector to manage an in-memory cache. The garbage collector was never designed to do that. The pattern abuses the garbage collector by filling up the memory with weakly reachable objects which run the JVM out of heap space. When the JVM gets low in memory, it has to traverse all the reference, weak, soft and otherwise, in its heap and reclaim RAM. This is expensive and shows up as a processing cost. It is even worse on very big JVM instances with a lot of processor cores because the garbage collector may well end up having to perform a ‘stop the world’ full cycle and hence reduce performance down to single core levels!

I am not saying in memory cache technology is a bad idea. Oh no – it is a great idea. However, just throwing it against the garbage collector and not expecting trouble is a very poor design choice.

Weak vs Soft what is the difference? Well, there is much really. On some JVMs (the client hostspot JVM for example – but that might change at any time) weak reference are marked for preferential garbage collection. In other words, the garbage collector should make more effort to reclaim memory from the object graph to which they refer (and no soft or hard references refer) than for other memory. Soft references do not have this idea to them. However, this is just an optional idea on some JVMs and cannot be relied upon, and it is a bad idea anyway. I would suggest using either soft or weak references all the time and stick with it. Pick which ever you like the sound of. I prefer the name WeakReference, so tend to use that.

There is one other difference; an object which is referenced to by a soft reference and a weak reference, but not a hard reference, can have the situation where it can still be acquired from the .get() method of the weak reference but not that of the soft reference. The reverse is not possible not the other way around. Code that relies on this behaviour is probably wrong headed.

Good uses for weak references do exist. What weak references are great for it keeping track of objects which are being used else where. An example is from Sonic Field (an audio processing package). In this example, ‘slots’ in files contain audio data and are associated with objects in memory. This model does not use the weak references to refer to in-memory copies of the data. In memory objects use the slots. Weak references are used to allow the file management system to reuse slots.

The code using slots does not need (and should not need to) be concerned with the management of disk space. It is the concern of the file manager to do that. The file manager has weak references to the objects using the slots. When a new slot is requested, the file manager checks for any existing slots referred to via weak references which have been reclaimed (and hence return null from the get method). If it finds such a reference, it can reuse the slot.

Automatic notification of reclamation

Sometimes we might want to be told when a weak or soft (or the other sort – phantom) reference has been reclaimed. This can be done via the en-queuing system. We can do this using a reference queue:


WeakReference(T referent, ReferenceQueue<? super T> q)

We do something like this:


ReferenceQueue junkQ = new ReferenceQueue<>();
....
WeakReference<FileSlot> mySlot=new WeakReference<>(aSlot);
....
// In a different thread - make sure it is daemon!
WeakReference<FileSlot> isDead;
while(true){
isDead = junkQ.remove();
// Take some action based on the fact it is dead
// But - it might not be dead - see end of post :(
...
}

But, remember, by the time weak reference ends on the junkQ calling .get() on it will return null. If you will have to store information to allow what ever action you are interesting it to happen somewhere else (like a ConcurrentHashMap where the reference is the key),

So What Is A Phantom Reference?

Phantom references are the one sort which, when you need them, you really need them. But on the face of it, they seem utterly useless. You see, whenever you invoke .get() on a phantom reference, you always get null back. It is not possible to use a phantom reference to get to the object to which it refers – ever. Well – that is not quite true. We can achieve this via JNI sometimes but we should never do so.

Consider the situation where you allocate native memory in JNI associated with a Java object. This is the sort of model which the DirectBuffers in the noi package of the JDK use. It is something I have used repeatedly in large commercial projects.

So, how do we reclaim that native memory? In the case of file like systems, it is possible to say that the memory is not reclaimed until the file is closed. This places the responsibility of resource management on the shoulders of the programmer; which is exactly what the programmer expects for things like files. However, for lighter weight objects, we programmers do not like to have to think about resource management – the garbage collector is there to do it for us.

We could place code in a finalizer which calls into the JNI code to reclaim the memory. This is bad (as in lethal) because JVMs make almost guarantee that they will call finalizers. So, don’t do that! But, phantom references come to the rescue! First we need to understand ‘phantom reachable’: A phantom reference will only become enqueued if the thing to which it refers cannot be reach via any other sort of reference (hard, weak or soft). At this point the phantom reference can be enqueued. If the object had a finalizer, then it will either have been ignored or run; but it will not have ‘brought the object back to life’. Phantom reachable objects are ‘safe’ for JNI native code (or any other code) to reclaim resources against.

So our code with phantom references can look like this:


ReferenceQueue<FileSlot> junkQ = new ReferenceQueue<>();
....
Phantom<FileSlot> mySlot=new Phantom<>(aSlot);
....
// In a different thread - make sure it is daemon!
Phantom<FileSlot> isDead;
while(true){
isDead=junkQ.remove();
long handle=lookUpHandle(isDead);
cleanNativeMemory(handle);
}

In this pattern we keep a handle which the native code can use to find and reclaim resources in a structure (another hashmap probably) in Java. When we are absolutely sure that Java object cannot be brought back to life – it is phantom reachable (i.e. a ghost – we can then safely reclaim the native resource. If your code does other ‘naughty’ things using sun.misc.unsafe (for example) this trick might be of use as well. For a full example which uses this technique – check out this post.

One final thought about phantom references. It is technically possible to implement the same pattern as above using weak references. However, that is not the purpose of weak references and such a pattern would be abusing them. Phantom references makes an absolute guarantee that an object really is dead and so resource can be reclaimed. For just one example, it is theoretically possible for a weak reference to be enqueued and then the object be brought back to life by its finalizer because the finalization queue is running slower than the weak reference queue. This sort of edge case horror story cannot happen with phantom references.

There is one little problem, which is a weakness of the JVM design. That is that the JNI global weak reference type has an undefined relationship with phantom references. Some people suggest that you can use a global weak reference even to get to am object even when it is enqueued as a phantom reference. This is a quirk of one particular implementation of the JVM and should never be used.

By: Dr Alexander J Turner: I would like to thank Attila for contacting me and organising this interesting project. Feel free to pop over to my blog at Nerds-Central at any time 🙂

Meta: this post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on! Want to write for the blog? We are looking for contributors to fill all 24 slot and would love to have your contribution! Contact Attila Balazs to contribute!

Java Sound – Make A Noise

The JDK has a audio integration as standard – but for some reason it is ill used and poorly documented.

When I started coding Sonic Field it was silent. Yes, it could synthesis audio signals and processes recordings; however, playing an arbitrary piece of audio with javax.sound was beyond me and even Google did not seem able to help. Well, I did eventually figure it out and so I hope this post can help others avoid my pain to come to bring Java to audio and the other way around.
The first step is to get the javax.sound api imported. For the example here, the imports are:


import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.DataLine;
import javax.sound.sampled.SourceDataLine;

That was the easy bit. The hard bit is figuring out what to do with it. The api was designed around using pre-cooked audio sources and performing effects and mixes on them. What is really hard to figure out is how to play audio you have made programmatically. However, it is possible using the AudioFormat class. This is able to understand some simple formats and make audio sources from byte arrays holding them. For audio output 16 bit signed linear encoding is really simple and good enough for listening to. So, in the example below I show how to code up 16 bit signal and play it on the default output device.


// Please note that this code is under AGPL 3.0
// within Sonic Field.

// I have relaxed this snippet and this snippet only
// to the CC license for the Java Advent Calendar project.

// The key thing to understand is the AudioFormat.
// This is actually the
// class which understands the binary format of
// audio which the computer cna
// use to make sound. Here we are specifying
// sixteen bit signed audio
AudioFormat af = new AudioFormat((float) SFConstants.SAMPLE_RATE, 16, 1, true, true);
// SFConstants,SAMPLE_RATE is in
// samples per second - it is part of Sonic Field
// here it is 96000, which is a good choice.
// 44100 is CD rate as an alternative.

// These two lines link the AudioFormat class
// to the default output device for
// the computer. It is possible choose between
// different audio output devices
// but that is for another post.
DataLine.Info info = new DataLine.Info(SourceDataLine.class, af);
SourceDataLine source = (SourceDataLine) AudioSystem.getLine(info);
source.open(af);
source.start();
// OK dataIn is a Sonic Field type.
// However, it is just a wrapped float array.
// what we need to do is convert a series of
// floats into the 16 bit signed format
// we specified. javax.sound does not really
// help us much here. We have to
// do the conversion out selved. Fortunately, it is easy.
byte[] buf = new byte[dataIn.getLength() * 2];
for (int i = 0; i < buf.length; ++i)
{
short sample = (short) (dataIn.getSample(i / 2) * 32767.0);
buf[i] = (byte) (sample >> 8);
buf[++i] = (byte) (sample & 0xFF);
}
// Now we have a byte array with
// 16 bit signed audio, we can just play it.
// This is done via the 'write' method on our SourceDataLine
source.write(buf, 0, buf.length);
// The source plays asynchronously using buffering.
// If we want to know when it
// has finished we need to call the drain method
// which only returns when all current audio as played
source.drain();
// Finally we can shut everything down
source.stop();
source.close();

I hope that the comments above help explain what is happening. I noticed that they do make reading the code a little tricky, so here is the code without the comments.

AudioFormat af = new AudioFormat((float) SFConstants.SAMPLE_RATE, 16, 1, true, true);
DataLine.Info info = new DataLine.Info(SourceDataLine.class, af);
SourceDataLine source = (SourceDataLine) AudioSystem.getLine(info);
source.open(af);
source.start();
byte[] buf = new byte[dataIn.getLength() * 2];
for (int i = 0; i < buf.length; ++i)
{
short sample = (short) (dataIn.getSample(i / 2) * 32767.0);
buf[i] = (byte) (sample >> 8);
buf[++i] = (byte) (sample & 0xFF);
}
source.write(buf, 0, buf.length);
source.drain();
source.stop();
source.close();

Sonic Field is an open source (AGPL) audio processing and synthesis system written in pure Java (no JNI, only standard Java). For more information on the project please feel free to check out the links below. The source code and associated sites will provide a large amount of useful information on audio and Java.

By: Dr Alexander J Turner: I would like to thank Attila for contacting me and organising this interesting project. Feel free to pop over to my blog at Nerds-Central at any time 🙂
Meta: this post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on! Want to write for the blog? We are looking for contributors to fill all 24 slot and would love to have your contribution! Contact Attila Balazs to contribute!

Java 7 – The Language Grows Up

Java was side stepped by C# some years ago – but with the newly invigorated roadmap, Java is back in the game!

Java 6 had a few very (very) annoying features which have been fixed in Java 7. It is easy not to realise how powerful these new yet small aspects of the language are. They point to a bright future where Java grows up slowly into a ever easier language to use.

My first insight into the annoying features in Java did not come form writing Java but from creating a COBOL compiler which could compile to the JVM. Some features were asked for by fellow developers which I would not have thought about otherwise:

  1. Some way to make variable definitions easier in the face of generics syntax.
  2. Something like the using syntax in C#.
  3. Lambdas and delegates (these are coming to Java 8).

In this post I am going to take a quick look at 1 and 2. These two features make my every day life much easier as a Java programmer, so I am happy to talk about them!

Diamond Syntax:

Even with excellent tools like Eclipse, writing the following is a pain:

ArrayList<string> list = new ArrayList<string>();

But it gets worse, how about:

ArrayList<ArrayList<string>> list = new ArrayList<ArrayList<string>>();

The compiler know that the type on the right is the same as the type on the left; that is how it does the type reconciliation. So, why not allow programmers to let the compiler do a bit of work for them:

ArrayList<ArrayList<string>> list = new ArrayList<>();

This exactly what the diamond operator does. In simple terms <> means “Take the generic type of the l-value to which the r-value is being assigned.

It is easy to say this sort of change does not matter. Using auto-complete on an IDE developers do not need to do all that typing and what does a few characters matter anyhow? The difference comes in readability as much as in the rate of typing. Code with frequent redefinitions of the same type requires more effort to read and is easier to misunderstand. Misunderstandings cause bugs, need I say more?

Try With Resources:

On the subject of bugs, here is a construct which can cause rather a lot of them:

static String readFirstLineFromFile(String path) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(path));
try {
return br.readLine();
} finally {
if (br != null) br.close();
}
}

Three things tend to happen here:

  1. Someone forgets to put in the finally clause and the system runs out of file handles.
  2. Someone forgets to put in the null check and null pointer exceptions get thrown when the FileReader fails.
  3. The code gets edited for some change and the above two things happen in the new version.

This is another case of the Java language being complete and competent but prone to bugs by relying on the developer a little too much. Java 1.7 brings in the new Try-With-Resources syntax. It looks a little odd to start with but it is straight forward and works really well.

static String readFirstLineFromFile(String path) throws IOException  {
try (
BufferedReader br = new BufferedReader(new FileReader(path));
) {
return br.readLine();
}
}

Now, what ever objects are defined in the () clause after the try will be closed automatically. This is a much better approach in that, just like the the diamond operator, we only need to think about the issue once, in one place. For more information on try-with-resources see the Oracle Page.

By: Dr Alexander J Turner: I would like to thank Attila for contacting me and organising this interesting project. Feel free to pop over to my blog at Nerds-Central at any time 🙂

Meta: this post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on! Want to write for the blog? We are looking for contributors to fill all 24 slot and would love to have your contribution! Contact Attila Balazs to contribute!