Profile Image
Shimon vasserman

Concurrent mode failure

Hi,

 

During running performance tests, “Concurrent mode failure” issue occurred twice.

 

From what I read online this could occur because :
1.  An object is promoted that is too large to fit into any contiguous space in memory.

2.  There isn't enough space in tenured to account for the rate of live objects being promoted.

 

Is it possible to know the specific reason from the logs :

 

2023-05-15T09:44:28.796+0000: 72404.340: [GC (Allocation Failure) 2023-05-15T09:44:28.797+0000: 72404.340: [ParNew (promotion failed): 1989926K->1978441K(2108608K), 1.2122562 secs]2023-05-15T09:44:30.009+0000: 72405.552: [CMS2023-05-15T09:44:32.507+0000: 72408.051: [CMS-concurrent-mark: 22.913/34.947 secs] [Times: user=783.96 sys=58.50, real=34.94 secs]

 (concurrent mode failure): 67102203K->66480666K(83958208K), 316.5658479 secs] 69078226K->66480666K(86066816K), [Metaspace: 86903K->86903K(94208K)], 317.7785441 secs] [Times: user=273.91 sys=71.50, real=317.78 secs]

 

2023-05-15T11:49:54.945+0000: 79930.488: [GC (Allocation Failure) 2023-05-15T11:49:54.945+0000: 79930.488: [ParNew (promotion failed): 2058684K->2071112K(2108608K), 0.7569158 secs]2023-05-15T11:49:55.702+0000: 79931.245: [CMS2023-05-15T11:49:56.994+0000: 79932.538: [CMS-concurrent-preclean: 1.750/2.579 secs] [Times: user=9.34 sys=0.68, real=2.58 secs]

 (concurrent mode failure): 64427482K->64792085K(83958208K), 270.7814328 secs] 66389686K->64792085K(86066816K), [Metaspace: 87069K->87069K(94208K)], 271.5387835 secs] [Times: user=227.52 sys=48.46, real=271.54 secs]

 

Thanks in advance,

  • concurrentmodefailure

  • gc-logs

  • cms-concurrent-mark

Please Sign In or to post your comment or answer

Profile Image

Ram Lakshmanan

Hello Shimon!

 

 Greetings.

 

 Oh ouch. Both the 'concurrent mode failure' GC events are causing very high pause times. First event took 317.78 seconds (i.e. 5.2 minutes). Second GC event took 271.54 seconds (i.e. 4.5 minutes). They are very high!!

 

 a. Can you share the JVM arguments that you are passing to your application?

 

 b. Also if possible can you share the GCeasy report of this GC log file? 

 

 Thanks.

Profile Image

Shimon vasserman

Hello Ram,

Thanks for your response,

 

Gceasy report - 

 

https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMjMvMDUvMjMvcGNjX2djLmxvZy4wLS0xNi0zMy0yNw==&channel=WEB

 

JVM arguments - 

 

-XX:MaxMetaspaceSize=384m -Xms41856m -Xmx41856m -XX:-OmitStackTraceInFastThrow -XX:+ExplicitGCInvokesConcurrent -XX:-UseParallelGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70

 

-XX:SurvivorRatio=2 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M

 

After checking online I understood that I also need to add -XX:+UseCMSInitiatingOccupancyOnly and indeed after I added this parameter the issue seems to be solved.

But I still would like to know the root cause if possible.

 

BR,
Shimon V.

Profile Image

Shimon vasserman

Quick update,
After I added XX:+UseCMSInitiatingOccupancyOnly 

'concurrent mode failure' happened again today, so issue wasn't solved.
I suspect that it's a 
fragmentation issue, but I don't know how to prove it.
Does anyone has an idea how to prove it?

Another question I have is how can I reslove this issue, do I have to switch to some other collector to prevent the long gc pauses or there is some other solution?

 

GcEasy anylsis recommends to change 'CMSInitiatingOccupancyFraction' to a lower value, but how will a lower value will help if the issue if fragmentation?


Thanks in advance,
Shimon V. 

Profile Image

Ram Lakshmanan

Hello Shimon!

 

 Greetings. I reviewed your GC Log. I would like to share following observation:

 

1. Your machine is overloaded: 

 

 There are 369 GC events in which 'sys' time is greater than 'usr' time. This is a very high number given that JVM itself was only running for 22 hours. This type of pattern happens when the machine(or VM or container) in which JVM is running is overloaded. Most likely there is a CPU/memory or I/O starvation happening on this device. You might want to fix it.

 

2. Very high Allocation Failures:

You are having very number of GC events triggered due to high allocation failures. Due to that 1:32 hours is paused. Currently your 

 

 

 To address this we can do couple of things. First one is to eliminate these two Flags from your JVM arguments. If it doesn't work, I have another suggestion, which I will let you once you try this one.

 

3. Concurrent mode failure

In order to reduce concurrent mode failure time reduce the '-XX:CMSInitiatingOccupancyFraction=40'. So that GC event will get triggered earlier and we don't see such long pause time.

 

Profile Image

Shimon vasserman

Hello Ram,

Thanks a lot for the detailed response.

1. Read the following post - 

https://blog.gceasy.io/2016/12/11/sys-time-greater-than-user-time/ - regarding 'sys' time is greater than 'usr' time - I suspect that there is a "Memory constraint" issue, but how can I be sure?

2. Very high Allocation Failures - Can you please exaplian how the removal of those flags

(-XX:+UseParallelGC, -XX:+UseParNewGC) will lower the high number of allocation failures? 

3. Concurrent mode failure - 
There is something basic that I probalbly don't understnad, from analysing gc logs, I can tell that when concurrent mode failure occurs, there is enough free heap space.

From what I read, CMS concurrent collection cycle does not perform compaction, so basically fragmentation is handled only during "Stop the world" event.

So even if I reduce the -XX:CMSInitiatingOccupancyFraction to 40, or to lower number it will only delay "stop the world" event as eventially the heap will be fragmented, is this correct?

 

Thanks again for you help,
Shimon V

Profile Image

Ram Lakshmanan

Hello Shimon!

 

> I suspect that there is a "Memory constraint" issue, but how can I be sure?

 

Look at the memory metrics in your monitoring tool on this device. If there are memory constraints it might be reported. OR if you would have run the yCrash in micrometrics monitoring mode https://docs.ycrash.io/ycrash-agent/micrometrics.html, it could have spotted the problem for you.

 

>Can you please exaplian how the removal of those flags(-XX:+UseParallelGC, -XX:+UseParNewGC) will lower the high number of allocation failures? 


First of all if you are configuring '-XX:+UseParallelGC', it means you are asking the JVM to use 'Paralle GC algorithm' and then you are asking the JVM to use 'CMS GC algorithm' as well by passing '-XX:+UseConcMarkSweepGC' argument. It's conflicting configuration. It's a misconfiguration

 

By configuring '-XX:+UseParNewGC' you are asking JVM to run parallel GC in Young Generation and CMS in old generation. I found that doesn't yield better performance in most of the cases where heap size is large like yours

Load More Comments

Got something else on mind? Post Your Question

Not the answer you're looking for? Browse other questions tagged
  • concurrentmodefailure

  • gc-logs

  • cms-concurrent-mark