Build OpenJDK for a Nice Speedup :: Nov 2, 2019

Just-in-time compilation (JIT) is a beautiful thing, and Java leads the industry. But unlike C2, which compiles your class files with optimizations for a particular OS (mac), ISA (x86_64), and Architecture (Intel Broadwell), JVM distributions themselves are compiled only for OS and ISA. This enables portability while reducing the number of build combinations for vendors.

The native code running alongide your Java app, like C1 and the garbage collector, is significant. The question is, if we compile OpenJDK for our own architecture, will app throughput improve in a significant way? The answer appears to be yes.

Building the JDK

We’ll be comparing AdoptOpenJDK’s Mac OpenJDK 13, and a custom-built OpenJDK 13. Both are the same build, 33. These are my build instructions:

bash configure --with-jvm-variants=server --with-jvm-features=link-time-opt \
--with-extra-cflags='-Ofast -march=native -mtune=broadwell -funroll-loops -fomit-frame-pointer' \
--with-extra-cxxflags='-Ofast -march=native -mtune=broadwell -funroll-loops -fomit-frame-pointer'

We configure with the standard server varient, which includes jvm-features like g1gc and c2. Additionally, link-time-opt is added to the build. I could not find any documentation on this feature, but presumably it performs link time optimization, which should improve performance. Now lets consider the C and C++ compile flags.

-Ofast is the highest optimization profile available in clang and gcc, and comes with a few drawbacks. Binary size may be larger, and floating-point semantics are relaxed. One may replace this profile with -O3, which keeps strict fp conformance, or -O2, which also maintains binary size. I have tested with both Ofast and O3, and haven’t experienced problems with either. But your mileage may vary.

-march=native and -mtune=broadwell tell the compiler to optimize for your architecture. One would think given the compiler documentation that march implies mtune, but this is apparently not the case.

-funroll-loops ensures that loops are unrolled. Loop unrolling should be especially performant, since march is specified. Loop unrolling is included with clang’s -O3 and up, but must be manually set for gcc.

-fomit-frame-pointer allows the compiler to omit frame pointers when possible, freeing a register. This could make debugging the JVM’s native code painful.

make images CONF=macosx-x86_64-server-release

The build takes only 15 minutes or so on my early 2015 macbook, even with optimizations enabled.

Benchmarks

DaCapo was the first benchmark suite I ran.

DaCapo Results

The optimized JDKs outperform in every case, with the avrora and fop benchmarks getting big speedups with -Ofast. Really curious as to why this is the case!

Next up I ran some benchmarks from the Computer Language Benchmarks Game.

Benchmarks Game Results

These I calculated with time for i in {1..10}; do java <class>; done, and divided by 10. There was no significant difference between the JDKs. Unlike DaCapo, these benchmarks are not representative of normal workloads and should be given less value. For example, none of the benchmarks generate garbage or exercise JVM features other than C2. But at least we know that C2 and startup are not the benefactors of the optimizations!

Finally I executed some JMH microbenchmarks for Netty’s HttpObjectEncoder:

Netty JMH Results

I chose this microbenchmark somewhat at random after looking at Netty’s extensive collection. The speedup is massive in the case when allocation is not pooled, and void Promises are not returned, and still big otherwise. Of course, these methods are unlikely to dominate your application’s performance. There appears to be a glitch in the second chunked bench.

Note: Netty offers Native Transport… so if we could compile this as well!

Other Compilers and OS

I also tried building with Intel’s Compiler, and patched the configure scripts to allow it. However, the build failed cryptically. I’d also be interested to see Linux results.

Summary

So, while you need to test with your own applications, it is clear that targeting the JDK to a specific architecture can provide significant throughput improvements. Coupled with the absolute ease of building new JDKs (Project Skara is awesome), developers of performance-critical Java applications should seriously consider building an optimized JDK, just the same as C/++ developers build optimized binaries.

Tables for Those Inclined

DaCapo Table

Benchmarks Game Table

Netty Table

Modesty

I doubt my methodology is perfect, if not let me know!