String deduplication

String deduplication is one of the newest performance features and this post is relevant only to the newest Oracle HotSpot JDK (1.8u20 or newer).
Enhancement Proposal #192 is here.
In a few words, String deduplication is a part of garbage collection, when G1 GC reassigns value-field (array of chars) in String objects if there are any duplicates of this array. In comparison to String.intern(), after which reference to String objects would be the same, after String deduplication String objects will be different but they will reference to the same array of chars.
As I wrote in my previous post String interning and number of buckets in Java String Pool, I have simple program that reads 1 GB file, splits each line into tokens, and creates object structure for each document (each document is represented as array of tokens with additional information for document).
My Oracle HotSpot JDK is 1.8u31:

kirill@ubuntu:~$ java -server -version
java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) Server VM (build 25.31-b07, mixed mode)

Each line split into tokens with the following code:

            StringTokenizer tokens = new StringTokenizer(text, " \t\n\r\f\",.;:-+*()[]|/–?=_&·");
            List<String> list = new ArrayList<>();
            while (tokens.hasMoreTokens()) {
                list.add(tokens.nextToken().toLowerCase().intern());
            }

This works fine with Java String Pool size = 1000003 as I described in my previous post.

Lets remove invocation of intern():

            StringTokenizer tokens = new StringTokenizer(text, " \t\n\r\f\",.;:-+*()[]|/–?=_&·");
            List<String> list = new ArrayList<>();
            while (tokens.hasMoreTokens()) {
                list.add(tokens.nextToken().toLowerCase());
            }

This code leads to OutOfMemoryError because there are too many String objects created during parsing string lines (many of those tokens are equal). Java String Pool is not used here.
But String deduplication may be effective here. Let’s find all JVM options relevant to String deduplication:

kirill@ubuntu:~$ java -server -XX:+PrintFlagsFinal -version 2>/dev/null | grep -i dedup
     bool PrintStringDeduplicationStatistics        = false                               {product}
    uintx StringDeduplicationAgeThreshold           = 3                                   {product}
     bool UseStringDeduplication                    = false                               {product}

According to UseStringDeduplication=false, String deduplication is disabled by default.
Let’s enable string deduplication with -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics. The full line to run my program is:

java -server -Xms768m -Xmx768m -XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics -Xloggc:gc.log -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintAdaptiveSizePolicy -XX:+PrintStringTableStatistics -XX:StringTableSize=1000003 -jar InformationRetrieval.jar data.log 1>out.log 2>err.log

Done! No OutOfMemoryError after adding these 2 JVM options.
There are lines relevant to String deduplication in gc.log:

2015-03-09T01:00:32.250+0200: 195.245: [GC concurrent-string-deduplication, 2161.7K->132.1K(2029.5K), avg 96.7%, 0.0271479 secs]
   [Last Exec: 0.0271479 secs, Idle: 0.6144211 secs, Blocked: 1/0.0003469 secs]
      [Inspected:           87239]
         [Skipped:              0(  0.0%)]
         [Hashed:           86554( 99.2%)]
         [Known:              570(  0.7%)]
         [New:              86669( 99.3%)   2161.7K]
      [Deduplicated:        84583( 97.6%)   2029.5K( 93.9%)]
         [Young:                0(  0.0%)      0.0B(  0.0%)]
         [Old:              84583(100.0%)   2029.5K(100.0%)]
   [Total Exec: 125/20.4248935 secs, Idle: 125/163.6405303 secs, Blocked: 17/10.6674584 secs]
      [Inspected:        27041547]
         [Skipped:              0(  0.0%)]
         [Hashed:        27039265(100.0%)]
         [Known:             1672(  0.0%)]
         [New:           27039875(100.0%)    636.8M]
      [Deduplicated:     26585580( 98.3%)    615.6M( 96.7%)]
         [Young:               14(  0.0%)    344.0B(  0.0%)]
         [Old:           26585566(100.0%)    615.6M(100.0%)]
   [Table]
      [Memory Usage: 6364.8K]
      [Size: 262144, Min: 1024, Max: 16777216]
      [Entries: 455749, Load: 173.9%, Cached: 0, Added: 455883, Removed: 134]
      [Resize Count: 8, Shrink Threshold: 174762(66.7%), Grow Threshold: 524288(200.0%)]
      [Rehash Count: 0, Rehash Threshold: 120, Hash Seed: 0x0]
      [Age Threshold: 3]
   [Queue]
      [Dropped: 0]

We see that this is 125th String deduplication. This deduplication took 0.0271479 secs and inspected 87239 String objects. 86669 of them are new String objects (not analyzed before) and 570 of them are known. 84583 of 87239 were deduplicated (note that deduplication also depends on -XX:StringDeduplicationAgeThreshold JVM option).
In general, deduplication took 20.4248935 secs (so long) and inspected 27041547 String objects. 26585580 of them were deduplicated and it saved about 615M. This is why there is no OutOfMemoryError but I found performance penalties comparing to String interning. I didn’t test my code with JMH but the reading from file (and parsing string line) became slower. The work time was 49 secs with String interning and become 58 secs with String deduplication.
To summarize – String deduplication may be useful if String interning cannot be used (the code creates many String objects, but it is not recommended to intern them), but it requires G1 garbage collector and some performance penalties may be observed.

External process blocking

External native processes created in Java are very useful. The best way to create new process is to use ProcessBuilder class and its ProcessBuilder.start() method. After that new child process starts its execution in parallel with Java application.
The following code pattern used sometimes for waiting until child process has terminated:

        Process process = new ProcessBuilder(cmd)
                .environment(envp)
                .directory(dir)
                .start();
        process.waitFor();
        if (process.exitValue() == 0) {
            // Getting out and error streams of this process
            InputStream processOutputStream = process.getInputStream();
            InputStream processErrorStream = process.getErrorStream();
            ...
        } else {
            // External process returned non-zero code
            ...
        }

Let’s start with simple commands that print 10 and 20 symbols to output:

kirill@ubuntu:~$ s=$(printf "%-10s" '~') ; echo "${s// /'~'}"
~~~~~~~~~~
kirill@ubuntu:~$ s=$(printf "%-20s" '~') ; echo "${s// /'~'}"
~~~~~~~~~~~~~~~~~~~~

Nothing prevents us to print as many symbols as we want:

kirill@ubuntu:~$ s=$(printf "%-1000s" '~') ; echo "${s// /'~'}"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let’s move this command to Java:

	    int n = 10;
        String[] cmd = {
                "/bin/bash",
                "-c",
                "s=$(printf \"%-"+ n + "s\" '~') ; echo \"${s// /'~'}\""
        };
        Process process = new ProcessBuilder(cmd).start();
        process.waitFor();
        if (process.exitValue() == 0) {
            BufferedReader in = new BufferedReader(new InputStreamReader(process.getInputStream()));
            String line = in.readLine();
            System.out.println(line.length());
        } else {
            throw new IllegalStateException("Program has terminated but returned non-zero code: " + process.exitValue());
        }

It’s easy to see that this external process writes to its stdout initially given number of ‘~’-symbols. Java thread waits until that process ends. After that Java thread reads pipe buffer and writes to stdout the total number of symbols written by external process. It is not necessary to close stream here.
Seems to be OK, but let’s take n = 100000 or n = 1000000. Something went wrong and seems to be Java thread waits forever!
Let’s find is external child process still alive:

kirill@ubuntu:~$ ps -ef | grep bash
kirill    2966  2959  0 01:23 pts/1    00:00:00 bash
kirill    3084  2959  0 01:24 pts/7    00:00:00 bash
kirill    3868  3857 73 01:55 pts/1    00:00:29 /bin/bash -c s=$(printf "%-1000000s" '~') ; echo "${s// /'~'}"
kirill    3878  3084  0 01:56 pts/7    00:00:00 grep --color=auto bash

Child process still alive. Even more it consumes CPU (sample of top-command):

Tasks: 186 total,   2 running, 184 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.3 us,  0.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   2063792 total,  1850116 used,   213676 free,   124244 buffers
KiB Swap:  2094076 total,        0 used,  2094076 free.   919536 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                     
 3868 kirill    20   0   12472   8032   1152 R 97.0  0.4   4:11.74 /bin/bash -c s=$(printf "%-1000000s" '~') ; echo "${s// /'~'}"

(but I suppose that high CPU consuming related to printing 1000000 symbols, for lower values of n CPU is not used since some time)
What max value of n should I set for preventing hanging of this child process? Binary search gave me n=65535. Process usually hangs if I set n = 65536. Process usually does not hang if I set n = 65535.
The solution comes from documentation:

Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, or even deadlock.

So, we should read data from pipe buffers as often as possible. The code below helps if we don’t want to use separated threads for reading data from child process:

    public static void main(String[] args) throws IOException, InterruptedException {
	    int n = 65536;
        String[] cmd = {
                "/bin/bash",
                "-c",
                "s=$(printf \"%-"+ n + "s\" '~') ; echo \"${s// /'~'}\""
        };
        Process process = new ProcessBuilder(cmd).start();
        ByteArrayOutputStream outputStreamFromProcess = new ByteArrayOutputStream();
        ByteArrayOutputStream errorStreamFromProcess = new ByteArrayOutputStream();
        final long THREAD_SLEEP_MS = 10;
        final byte[] buffer = new byte[1 << 15];
        while (process.isAlive()) {
            while (Math.max(moveBytesBetweenStreams(process.getInputStream(), outputStreamFromProcess, buffer),
                    moveBytesBetweenStreams(process.getErrorStream(), errorStreamFromProcess, buffer)) > 0) {
            }
            Thread.sleep(THREAD_SLEEP_MS);
        }
        // Process has terminated, let's read remain data from process:
        while (moveBytesBetweenStreams(process.getInputStream(), outputStreamFromProcess, buffer) > 0) {
        }
        while (moveBytesBetweenStreams(process.getErrorStream(), errorStreamFromProcess, buffer) > 0) {
        }
        if (process.exitValue() == 0) {
            System.out.println(outputStreamFromProcess.size());
        } else {
            throw new IllegalStateException("Program has terminated and returned non-zero code: " + process.exitValue());
        }
    }

    private static int moveBytesBetweenStreams(InputStream from, OutputStream to, byte[] buffer) throws IOException {
        if (from.available() > 0) {
            int read = from.read(buffer);
            if (read > 0) {
                to.write(buffer, 0, read);
            }
            return read;
        } else {
            return 0;
        }
    }

The main loop of the code above works until process alive: it simply reads outputs from child process. If both out-stream and error-stream are empty, main thread goes to sleep for 10 ms and then checks streams again. After child process has terminated java thread reads remain data from streams. Note that this code prints 65537 since the last byte of output is ‘\n’. In this code child process does not hang.

String interning and number of buckets in Java String Pool

I have a collection of documents (texts) stored in file. The file contains almost 110000 documents. For each document the document data is placed in one line, so I have almost 110000 lines there.
My simple non-commercial program reads this file, splits each line (document) into words, and creates object structure for each document (each document is represented as array of words with additional information for document).
File has almost 1g of data, so placing each word as new object in heap leads to OutOfMemoryError very quickly (there are memory limitations on machine for running my program with high -Xmx value).
So, seems to be good to intern String objects that represent words. There are many words in all documents, but many of them are repeated. The list of words for each document may be created and modified like this:

            StringTokenizer tokens = new StringTokenizer(text, " \t\n\r\f\",.;:-+*()[]|/–?=_&·");
            List<String> list = new ArrayList<String>();
            while (tokens.hasMoreTokens()) {
                list.add(tokens.nextToken().toLowerCase().intern());
            }

I was surprised that reading and parsing my file with documents took 317 seconds. It’s obviously unexpected amount of time. Note that reading from file was wrapped into BufferedReader with input buffer size = 1m (not 8k that is default). The program was running on Oracle HotSpot 1.8:
kirill@ubuntu:~$ java -server -version
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) Server VM (build 25.5-b02, mixed mode)

If you want to profile your Java application without any external profilers, you can use -Xprof JVM option. But note that this option must be disabled in production systems (it slows JVM). So, I ran my InformationRetrieval application with -Xprof option:
java -server -Xms768m -Xmx768m -Xprof -Xloggc:gc.log -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintAdaptiveSizePolicy -jar InformationRetrieval.jar data.log 1>out.log 2>err.log &

Parsing time of file was increased to 572 seconds (thanks to -Xprof). At the end of stdout (I redirected stdout to out.log) there are profiling data lines (the lines that are not interesting were removed):

Flat profile of 814.87 secs (47159 total ticks): main

  Interpreted + native   Method                        
  0.0%    20  +     0    InformationRetrieval.readPages
  0.0%    13  +     0    InformationRetrieval.writeGroupsToFile
  ...............................................................
  0.2%    94  +     8    Total interpreted (including elided)

     Compiled + native   Method                        
 28.6% 13492  +     1    InformationRetrieval.getFuzzyGroups
  6.0%  2815  +     0    InformationRetrieval$Page.
  0.9%   416  +     1    java.lang.String.toLowerCase
  0.4%   201  +     0    java.util.HashMap.put
  0.4%   190  +     1    java.util.HashMap.getOrDefault
  0.2%   103  +     2    sun.nio.cs.UTF_8$Decoder.decodeArrayLoop
  0.1%    57  +     0    java.io.BufferedReader.readLine
  0.1%    57  +     0    InformationRetrieval.readPages
  ...............................................................
 37.0% 17420  +    39    Total compiled (including elided)

         Stub + native   Method                        
 62.4%     0  + 29415    java.lang.String.intern
  0.3%     0  +   165    java.io.FileInputStream.readBytes
  0.0%     0  +     7    java.io.FileOutputStream.writeBytes
  0.0%     0  +     6    java.io.FileInputStream.available
  0.0%     0  +     3    java.lang.System.arraycopy
 62.8%     0  + 29596    Total stub

  Thread-local ticks:
  0.0%     2             Class loader


Global summary of 814.88 seconds:
100.0% 47464             Received ticks
  0.6%   304             Received GC ticks
  0.6%   278             Compilation
  0.0%     2             Class loader

Easy to find this line:
62.4% 0 + 29415 java.lang.String.intern
String interning took 62.4% of all received ticks.

To understand this problem, we need to add -XX:+PrintStringTableStatistics JVM option (-Xprof may be removed):
java -server -Xms768m -Xmx768m -Xloggc:gc.log -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintAdaptiveSizePolicy -XX:+PrintStringTableStatistics -jar InformationRetrieval.jar data.log 1>out.log 2>err.log &

At the end of out.log file:

SymbolTable statistics:
Number of buckets       :     20011 =     80044 bytes, avg   4.000
Number of entries       :     15224 =    182688 bytes, avg  12.000
Number of literals      :     15224 =    627352 bytes, avg  41.208
Total footprint         :           =    890084 bytes
Average bucket size     :     0.761
Variance of bucket size :     0.759
Std. dev. of bucket size:     0.871
Maximum bucket size     :         6
StringTable statistics:
Number of buckets       :      1009 =      4036 bytes, avg   4.000
Number of entries       :    348580 =   4182960 bytes, avg  12.000
Number of literals      :    348580 =  15964728 bytes, avg  45.799
Total footprint         :           =  20151724 bytes
Average bucket size     :   345.471
Variance of bucket size :   327.286
Std. dev. of bucket size:    18.091
Maximum bucket size     :       396

As we can see in String Table statistics section, there are only 1009 buckets in Java String Pool (they occupied 4036 bytes) and 348580 interned strings (they occupied 15964728 bytes). The average bucket size is 345.471: after each invocation of String.intern() CPU may compare given string with other 345 strings that are already placed at right bucket. So, 1009 is inappropriate value for Java String Pool and should be changed.

Let’s find where 1009 comes from:

kirill@ubuntu:~$ java -server -XX:+PrintFlagsFinal -version 2>/dev/null | grep 1009
    uintx StringTableSize                           = 1009            {product}

So, we need to change -XX:StringTableSize. Let’s set it to 1000003 (please note that it’s desirable to set number of buckets in Java String Pool to prime number, 1009 and 1000003 are prime numbers):
java -server -Xms768m -Xmx768m -Xloggc:gc.log -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintAdaptiveSizePolicy -XX:+PrintStringTableStatistics -XX:StringTableSize=1000003 -jar InformationRetrieval.jar data.log 1>out.log 2>err.log &

New String Table statistics section is:

StringTable statistics:
Number of buckets       :   1000003 =   4000012 bytes, avg   4.000
Number of entries       :    348580 =   4182960 bytes, avg  12.000
Number of literals      :    348580 =  15964728 bytes, avg  45.799
Total footprint         :           =  24147700 bytes
Average bucket size     :     0.349
Variance of bucket size :     0.354
Std. dev. of bucket size:     0.595
Maximum bucket size     :         7

Reading and parsing collection of documents took 49 seconds instead of 317 seconds. The number of buckets in Java String Pool is 1000003 and they occupied 4000012 bytes (instead of 4036 bytes for 1009) and Java String Pool feels OK. Note that it’s OK to set number of buckets to [(Number of literals) / 0.75], but I decided to increase this value even more.
So, sometimes it’s important to increase number of buckets in Java String Pool.

Concurrent Mode Failure in practice, part 2

This is the second post about concurrent mode failure (please see concurrent mode failure, part 1)

If CMS cannot free tenured generation it may continue with Full GC. Good explanation about concurrent mode failure in Java Performance by Charlie Hunt (see here) on page 117 and in Garbage Collection Tuning (see Garbage Collection Tuning).
In case with CMS, concurrent mode failure initiates Full GC that is non-parallel stop-the-world fallback. If Full GC happens too often it is a signal that heap size of JVM is too low or GC tuning should be performed.

The first thing that should be done is to add -XX:+PrintGCDetails -XX:+PrintGCDateStamps as JVM options if it wasn’t here before. The first option enables GC logging to stdout (in addition, add -Xloggc: for logging to a file instead of stdout). The second option enables times in GC logging in YYYY-MM-DDThh:mm:ssZ format. Note that if you use this option then -XX:+PrintGCTimeStamps should be removed (or set -XX:-PrintGCDateStamps): in another case you will have both times (time in YYYY-MM-DDThh:mm:ssZ format and total amount of seconds from the start of this JVM) in GC log that may be undesirable. The typical line in output is:
2014-01-26T20:00:54.056+0400: [GC [ParNew: 307167K->586K(345024K), 0.0132780 secs] 6916030K->6609567K(12544576K) icms_dc=72 , 0.0134680 secs] [Times: user=0.19 sys=0.00, real=0.01 secs]

Next JVM option is -XX:+PrintTenuringDistribution. This option enables extended logging for GC and information about objects ages in young generation will be printed to log. The typical line in output is:
2014-01-28T14:40:09.851+0400: [GC [ParNew
Desired survivor size 26836992 bytes, new threshold 4 (max 4)
- age 1: 6662512 bytes, 6662512 total
- age 2: 262160 bytes, 6924672 total
: 419456K->6870K(471872K), 0.0114110 secs] 419456K->6870K(477248K), 0.0115030 secs] [Times: user=0.08 sys=0.00, real=0.02 secs]

This indicates that 26836992 bytes are threshold for survivor space, new threshold 4 (max 4) means that objects may stay in young generation during 4 GC invocations before they will be moved to tenured generation. Next 2 lines indicate that after this GC objects with age 1 occupied 6662512 bytes and objects with age 2 occupied 262160 bytes. The last column in these 2 lines is cumulative sum of bytes.

So, we have these JVM options that are useful for getting better understanding of GC behaviour:
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-Xloggc:gc.log (optional, GC log will be stored to gc.log file instead of stdout)
-XX:+PrintTenuringDistribution

Next step is reproducing the workload for JVM with options above. The difficulty of reproducing stress workload (or even only production workload) on test environment depends on your application. It’s another history and I suppose you have already got GC log and problem already reproduced.

The next step is GC log analysis. There are few questions you should really know how to answer about each invocation in GC log:
1. About young generation:
What is the difference in young generation objects distribution between 2 GC invocations? How many objects were survived? How many of them were copied into an empty survivor space and how many of them were copied into tenured generation? Why was this GC invocation so long? What is the pause between 2 GC invocations?
2. About tenured generation:
Why tenured generation cleanup was started at this point of time? After analyzing GC log you may be surprised that sometimes tenured generation cleanup is started when tenured generation is almost full, and sometimes it is started when tenured generation is almost empty (especially in Concurrent Mark-Sweep collector). What is the trend of space usage in tenured generation before garbage collection? What is the trend of space usage after garbage collection? Why concurrent mode failure happened here?

Let’s take 2 GC invocations in young generation (the lines below are copied from GC log):

2014-01-28T14:42:44.102+0400: [GC [ParNew
Desired survivor size 26836992 bytes, new threshold 1 (max 4)
- age 1: 29692568 bytes, 29692568 total
- age 2: 12514000 bytes, 42206568 total
: 461639K->52416K(471872K), 0.0553180 secs] 6447587K->6050819K(12530496K), 0.0554520 secs] [Times: user=0.81 sys=0.01, real=0.06 secs]
2014-01-28T14:42:45.794+0400: [GC [ParNew
Desired survivor size 26836992 bytes, new threshold 1 (max 4)
- age 1: 32051328 bytes, 32051328 total
: 471872K->42579K(471872K), 0.0656970 secs] 6470275K->6067036K(12530496K), 0.0658310 secs] [Times: user=0.84 sys=0.01, real=0.07 secs]

These lines indicate that survivor threshold is 26 mb, max age threshold is 4, but there are no objects in second invocation that reached age 2. Total size of objects in young generation after first GC invocation is 42 mb (which is greater than 26 mb), so copying to tenured generation is occurred. The total used space in heap is increased from 6050m to 6067m. The time interval between these 2 invocations is about 1.7 seconds but GC stop-the-world pauses are short.
Sometimes time intervals between GC invocations are too low and young generation almost full of live objects after GC collection.
So, we need to decrease migrations to tenured generation. In general, it may be done by several things:
1. Increasing the size of young generation.
2. Increasing the age threshold.
3. Changing the ratio between eden and survivor spaces (increasing the size of survivor).
4. Tuning you application.
Despite the fact that producing production workload is difficult, I had all possibilities for that. After testing many configurations, I stayed with this one (the options below are tuned options for young generation for my application):
1. -XX:+UseParNewGC – this option enables parallel collector in young generation that may run concurrently with CMS collector (parallel scavenge collector). This option is default if CMS collector is enabled (-XX:+UseConcMarkSweepGC).
2. -XX:NewSize=3g – the size of young generation is 3072 mb.
3. -XX:InitialTenuringThreshold=15 – initial age threshold for objects in young generation is 15.
4. -XX:MaxTenuringThreshold=15 – age threshold for objects in young generation is 15.
5. -XX:SurvivorRatio=2 – the ratio between eden space and survivor space is 2. This means that there are 756 mb per each survivor space and 1536 mb per eden space.
6. -XX:TargetSurvivorRatio=90 – the threshold of survivor space is equal to this value (divided by 100) multiplied by survivor space. The larger values prevent more objects from tenuring prematurely, but there may be problems if graph of allocation rates contains spikes (please see http://www.techpaste.com/2012/02/22/java-command-line-options-jvm-performance-improvement). The default value is 50.
7. -XX:+UseAdaptiveSizePolicy – disables dynamic configuration of previous options for parallel scavenge collector.
Please note that I didn’t change the total heap size (14 gb).

After setting those parameters my application started to work. Migrations to tenured generation were reduced, but concurrent mode failure still happen each hour. I found that sometimes CMS starts to clean tenured generation too late. I made some calculations and found that 5.0 gb of all objects are internal caches of my application and they are changed or cleaned very rarely. The CMS in almost all cases collects the data in tenured generation and leaves these 5.0 gb. These options were helpful for making CMS more predictable:
1. -XX:+UseConcMarkSweepGC – enables CMS. It was enabled before because this post and previous post about CMS.
2. -XX:+UseCMSInitiatingOccupancyOnly – CMS will run if objects in tenured generation occupy constant amount of space rather than using adaptive strategy for this. As I mentioned above, sometimes CMS starts to clean tenured generation too late and concurrent mode failure occurs.
3. -XX:CMSInitiatingOccupancyFraction=50 – in my case CMS will run if tenured generation is filled with (14-3)*50/100=5.5 gb of objects. This option indicates constant amount of space for previous parameter.
4. -XX:+CMSScavengeBeforeRemark – this option enables collection in young generation before Remark phase in CMS. Remark phase pause may be shorter after collecting young generation.

After setting options above collection in tenured generation was normal and this saved my application.

So, JVM options for my application related to GC are -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=50 -XX:-UseAdaptiveSizePolicy -XX:+CMSScavengeBeforeRemark -XX:NewSize=3g -XX:+PrintTenuringDistribution -XX:InitialTenuringThreshold=15 -XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=2 -XX:TargetSurvivorRatio=90

Concurrent Mode Failure in practice, part 1

Recently I found that my application sometimes “does not work”.
It handles http-requests (and many other synchronous requests) and responses to caller very slow sometimes. In this post I will not give any information about failover policies and machine recovery in cluster, it doesn’t matter. We only have that sometimes it produces responses up to 100x slower than it should be.
In log4j logs the situation was like this: it started the execution of the request, proceeded a few steps, then it “hung” for a long time, and after “hunging” it ended it’s execution. Of course the first reason of Java application “hanging” (suspension) is excessive GC. Since my application has it’s high-concurrency internal workload that is not related to external requests, excessive GC looks real. So, it was a time to look at GC logs.
The typical line in GC log was something like this:
283280.428: [GC 283280.428: [ParNew: 1258304K->139776K(1258304K), 0.1880170 secs] 10825965K->9977370K(12443136K), 0.1881800 secs] [Times: user=2.74 sys=0.00, real=0.19 secs]
This means that garbage collection in young generation occurred after 283280.428 seconds of JVM start. It reduced memory usage in young generation from 1258304K to 139776K and spent 0.19 seconds. The total memory usage in heap was reduced from 10825965K to 9977370K. Actually, the total memory in heap is 1258304K but be careful: this value actually may not be equal to JVM -Xmx parameter and should be between -Xms and -Xmx values. The ParNew indicates that parallel collector works in young generation and stop-the-world pauses occur during collections. Actually there was no problem with minor collections. I mean there was no visible problems with Eden or both Survivor spaces.
What about tenured generation? In this application the collector in tenured generation is Concurrent Mark-Sweep (CMS). This actually works concurrently with application threads (with short stop-the-world pauses at 2 of 6 GC phases). The phases are:

Initial mark
This is stop-the-world pause. GC marks all object reachable from application threads, static variables and others. The line related to this phases looks like this:
12993.795: [GC [1 CMS-initial-mark: 8498366K(11184832K)] 8647061K(12443136K), 0.1610800 secs] [Times: user=0.16 sys=0.00, real=0.16 secs]

Mark
In this phase GC marks all live objects concurrently with application threads. GC searches all objects rechable from objects that were found in initial mark phase. The lines related to this phase look like:
12993.957: [CMS-concurrent-mark-start]
12997.311: [CMS-concurrent-mark: 3.352/3.354 secs] [Times: user=19.88 sys=0.20, real=3.36 secs]

Preclean
Since previous phase is executed concurrently with application threads, there may be objects that were modified during previous phase. Tenured generation is splitted into chunks called cards. If we update reference field of some object then appropriate card that contains this reference is marked as dirty. All dirty cards (they include references to newly created objects during mark phase) are found here. You can find more about this phase at Jon Masamitsu’s blog (https://blogs.oracle.com/jonthecollector/entry/did_you_know). The lines related to this phase look like:
12997.311: [CMS-concurrent-preclean-start]
12997.617: [CMS-concurrent-preclean: 0.298/0.306 secs] [Times: user=0.64 sys=0.01, real=0.30 secs]
12997.617: [CMS-concurrent-abortable-preclean-start]
12997.747: [GC 12997.747: [ParNew: 1252211K->139776K(1258304K), 0.5655490 secs] 9750577K->9040052K(12443136K), 0.5657350 secs] [Times: user=7.90 sys=0.02, real=0.56 secs]
CMS: abort preclean due to time 13003.750: [CMS-concurrent-abortable-preclean: 5.549/6.133 secs] [Times: user=17.06 sys=0.17, real=6.14 secs]

Remark (rescan)
This is stop-the-world phase. In this phase CMS catches all changes that were made during previous concurrent phases. The line is:
13003.752: [GC[YG occupancy: 677602 K (1258304 K)]13003.752: [Rescan (parallel) , 0.3685750 secs]13004.649: [weak refs processing, 0.0001350 secs] [1 CMS-remark: 8900276K(11184832K)] 9577878K(12443136K), 0.8974050 secs] [Times: user=5.93 sys=0.04, real=0.90 secs]

Sweep
In this concurrent phase CMS adds updates freelists for being used for future allocations. Here are the lines:
13004.650: [CMS-concurrent-sweep-start]
13014.364: [CMS-concurrent-sweep: 9.714/9.714 secs] [Times: user=10.08 sys=0.30, real=9.71 secs]

Reset
In this concurrent phase CMS resets all it’s internal structures for next collections. The lines related to this phase are:
13014.364: [CMS-concurrent-reset-start]
13014.391: [CMS-concurrent-reset: 0.028/0.028 secs] [Times: user=0.04 sys=0.00, real=0.03 secs]

It’s all about CMS phases. While searching through GC log file I found interesting lines:
13277.489: [GC 13277.489: [ParNew: 1258304K->1258304K(1258304K), 0.0000310 secs]13277.489: [CMS13288.998: [CMS-concurrent-sweep: 13.054/13.058 secs] [Times: user=15.55 sys=0.09, real=13.05 secs]
(concurrent mode failure): 10772025K->5233535K(11184832K), 35.9192040 secs] 12030329K->5233535K(12443136K), [CMS Perm : 42832K->42821K(71392K)], 35.9194630 secs] [Times: user=35.86 sys=0.03, real=35.93 secs]

As you know, concurrent mode failure retreats to non-parallel STW collector. In these lines we see that this fallback reduced memory usage from 12030329K to 5233535K but it took 35.93 seconds. So long! Looks like this is the real problem and root cause of application “hanging”. This fallback occurred twice per each 10 minutes.
This problem requires deep analysis of internal life of your application. This problem needs GC tuning. I will write in next post how to deal with this problem and how I resolved this issue.

Java concurrency: CountDownLatch

Suppose you are given N threads that should wait until M tasks finish their execution (in general, these tasks are executed in other M threads). There are several approaches to solve this. One of them is to use CountDownLatch object from java.util.concurrent package. CountDownLatch maintains internal counter (called “latch”) that represents total number of tasks that should be completed. When task completes its execution, it should call countDown() method to decrease internal counter by 1. If thread calls await() it will be blocked until CountDownLatch counter reaches 0.
So, N waiter-threads should call await() and N task-threads should call countDown() when they complete their work. The constructor of CountDownLatch takes one argument – initial value of counter. Do not forget to call countDown() inside finally-block: in cases when exception occurs and countDown() will not be executed, the counter never be decreased and waiter-threads may wait forever.
There is also await(long timeout, TimeUnit unit) method: causes current thread to wait until counter reaches 0 or until specified time elapsed.
Please note about happens-before relationship between actions before countDown() and after await(). From http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/CountDownLatch.html: Memory consistency effects: Until the count reaches zero, actions in a thread prior to calling countDown() happen-before actions following a successful return from a corresponding await() in another thread.
Suppose we have method long calculate(long[] array, int fromInclusive, int toExclusive) that calculates partial sum of array from fromInclusive index (inclusive) to toExclusive index (exlusive). We want to calculate the total sum of all elements of this array. We can call calculate(array, 0, array.length) but another approach is to split calculations between M different threads and then join all results together. For example, in case M = 2 first thread may calculate sum of first half of array and second thread may calculate sum of second half of array. After that both partial results will be summed to result.
We have one main thread (N = 1) and M other threads that perform calculations. The main thread should call await() and worker threads should call countDown(). Please find a code below for an example.



import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

/**
 *
 * @author Kyrylo Holodnov
 */
public class Main {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        // Setting number of working threads
        final int WORKER_THREADS_COUNT = 1;
        // Setting total array size
        final int ARRAY_SIZE = 10000000;
        // Initialization of array with array[0] = FIRST_VALUE and array[i] = array[i - 1] + 1
        final long[] array = new long[ARRAY_SIZE];
        final long FIRST_VALUE = Integer.MAX_VALUE;
        for (int i = 0; i < ARRAY_SIZE; i++) {
            array[i] = FIRST_VALUE + i;
        }
        // Expected sum is a sum of arithmetic progression
        long EXPECTED_SUM = (FIRST_VALUE + (FIRST_VALUE + ARRAY_SIZE - 1)) * ARRAY_SIZE / 2;
        // Creating CountDownLatch
        final CountDownLatch gate = new CountDownLatch(WORKER_THREADS_COUNT);
        // Creating empty array of partial sums: this is a small array
        final long[] partialSums = new long[WORKER_THREADS_COUNT];
        // Creating executor for submitting tasks
        final ExecutorService executor = Executors.newFixedThreadPool(WORKER_THREADS_COUNT);
        // Setting interval length which should be used by each worker for calculations
        int step = ARRAY_SIZE / WORKER_THREADS_COUNT;
        if ((ARRAY_SIZE % WORKER_THREADS_COUNT) > 0) {
            step++;
        }
        for (int i = 0; i < WORKER_THREADS_COUNT; i++) {
            final int index = i;
            final int fromInclusive = index * step;
            final int toExclusive = Math.min(fromInclusive + step, ARRAY_SIZE);
            executor.execute(new Runnable() {
                @Override
                public void run() {
                    try {
                        partialSums[index] = calculate(array, fromInclusive, toExclusive);
                    } finally {
                        gate.countDown();
                    }
                }
            });
        }
        try {
            gate.await();
        } catch (InterruptedException ignored) {
        }
        executor.shutdown();
        // Calculating actual sum after all workers completed their work
        long actualSum = 0;
        for (int i = 0; i < WORKER_THREADS_COUNT; i++) {
            actualSum += partialSums[i];
        }
        // If actual sum is not equal to expected sum then some errors occurred before
        if (actualSum != EXPECTED_SUM) {
            System.err.println("Error while executing: expected final result = " + EXPECTED_SUM
                    + ", but actual final result = " + actualSum);
        } else {
            System.out.println("Actual = expected = " + actualSum);
        }
    }

    private static long calculate(long[] array, int fromInclusive, int toExclusive) {
        long result = 0;
        for (int i = fromInclusive; i < toExclusive; i++) {
            result += array[i];
        }
        return result;
    }
}

ThreadPoolExecutor and Throwable

Suppose we are using ThreadPoolExecutor via java.util.concurrent.Executors.newFixedThreadPool(int) (or newCachedThreadPool()) invocation. When a new Callable<V> is submitted and executed, any Throwable that occurred during execution of Callable<V>.call(), will be caught and stored inside FutureTask<V> returned by submitting this Callable<V>. This field is inside FutureTask<V>.Sync-class and standard invocation of FutureTask<V>.get() delegates the call to Sync-object that checks if this field with Throwable is not null. If it is not null then this Throwable is wrapped to ExecutionException and finally ExecutionException is thrown in FutureTask<V>.get() invocation. If you submit a task to executor and never retrieve result from FutureTask, you will never know that some Throwable occurred.

The code below creates ThreadPoolExecutor with one thread, and one task is submitted to execution. This task creates StringBuilder and appends more and more numbers to it (0, 1, 2, 3, …). After several seconds OutOfMemoryError occur in thread that executes this task and the execution is stopped (the message “Finally-block for thread with id = %” will be printed to standard output stream: this indicates that the execution “jumped” from infinite loop to “finally” block). The line “return “Success”;” is never reachable. In last loop we retrieve the result from Future-object and see that ExecutionException occur during FutureTask<V>.get() invocation. If we comment/remove the last loop, we will never know that OutOfMemoryError occurred.


import static java.lang.Thread.currentThread;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;

/**
 *
 * @author Kyrylo Holodnov
 */
public class Main {    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        final int OOME_THREADS_COUNT = 1;
        ExecutorService executor = Executors.newFixedThreadPool(OOME_THREADS_COUNT);
        Future<string>[] future = new Future[OOME_THREADS_COUNT];
        for (int i = 0; i &lt; OOME_THREADS_COUNT; i++) {
            future[i] = executor.submit(new Callable<string>() {
                @Override
                public String call() throws Exception {
                    try {
                        StringBuilder builder = new StringBuilder();
                        int i = 0;
                        while (!currentThread().isInterrupted()) {
                            builder.append(i++);
                        }
                        return "Success";
                    } finally {
                        System.out.println("Finally-block for thread with id = " + currentThread().getId());
                    }
                }
            });
        }
        executor.shutdown();
        System.out.println("Executor: shutdown completed");
        for (int i = 0; i &lt; OOME_THREADS_COUNT; i++) {
            try {
                System.out.println(future[i].get());
            } catch (ExecutionException | InterruptedException ex) {
                ex.printStackTrace();
            }
        }
    }
}