Profile Image
Tim OShea

G1 GC crashes after extended period of time, what steps can I take?

Hello all, thanks for reading. I have 2 reports, the one attached to this question at the bottom shows what eventually happens to my app after several hours (sometimes days), and then the other one, the below hyperlink "Stable GC log" is what it behaves like leading up to the crash.

 

As you can see the promotion rate is nothing crazy compared to the creation rate which is why I'm surprised the app eventually crashes.

 

Avg creation rate - 953.64 mb/sec
Avg promotion rate - 382 kb/sec

Stable GC log

 

Some things I'm starting to tune for my next iteration:

  • Increase G1ReservePercent to ~20%
  • Decrease InitiatingHeapOccupancyPercent to ~30% 

I figure if I start the mixed collections sooner and reserve more space the collections will be more evenly distributed over time and avoid to-space exhaustion. Although, I am still unsure of what the root cause is for the sudden behavior that leads to the app crashing. Any ideas?


Here are the CLI flags since they're not included in the reports.

Memory: 4k page, physical 72031152k(561148k free), swap 0k(0k free)

CommandLine flags:
-XX:+AlwaysPreTouch
-XX:CICompilerCount=15
-XX:ConcGCThreads=6
-XX:G1HeapRegionSize=8388608
-XX:G1ReservePercent=10
-XX:GCLogFileSize=5242880
-XX:HeapDumpPath=/dev/null
-XX:InitialHeapSize=25769803776
-XX:InitiatingHeapOccupancyPercent=45
-XX:MarkStackSize=4194304
-XX:MaxGCPauseMillis=1000
-XX:MaxHeapSize=25769803776
-XX:MaxNewSize=15460204544
-XX:MinHeapDeltaBytes=8388608
-XX:NumberOfGCLogFiles=5
-XX:OnOutOfMemoryError=/bin/kill -9 %p
-XX:+ParallelRefProcEnabled
-XX:+PerfDisableSharedMem
-XX:+PrintGC
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution
-XX:+UseCompressedClassPointers
-XX:+UseCompressedOops
-XX:+UseFastUnorderedTimeStamps
-XX:+UseG1GC
-XX:+UseGCLogFileRotation
-XX:+UseStringDeduplication 



Report URL - https://gceasy.io/my-gc-report.jsp?p=YXJjaGl2ZWQvMjAyMS8wNC8xNC8tLWFwaS03NDU2YzEzZC1jNmM4LTQ5YTEtOTZkOS04NjI2Nzk5N2U5OTdjNmQ1ZTcwZi1iM2I0LTQ3NjktODEyMC1kMDAwNDk5NmFmYmUudHh0&channel=WEB

  • g1gccrash

  • stablegclog

  • spaceexhaustion

  • rootcause

  • cliflags

  • g1reservepercent

  • initiatingheapoccupancypercent

Please Sign In or to post your comment or answer

Profile Image

Ram Lakshmanan

Hello Tim!

 

 Greetings.
 
 From 10:05 pm to 10:40 pm your application is stable. Starting from 10:40 pm your heap size has started to grow significantly. After 10:40 pm object creation rate is spiking up and garbage collection isn't able to keep up, thus Full GCs are kicking in resulting in very long pauses and ultimately crashing at 10:50 pm. 
 
 This sort of behavior will happen if the application starts to process a lot more transactions. Is it possible:
 
 a. Some sort of batch process launched around 10:40 pm?
 b. Starting from 10:40 pm certain customer transaction might load a heavy number of records from DB, it can cause memory spike. 
 c. Maybe traffic volume started to increase from 10:40 pm? Is there some metric/stats, which will tell you whether traffic volume started to increase from 10:40 pm?
 
 Based on your data, GC doesn't look like a problem. GC is only a reaction and not the cause. Thus tuning GC will not help you to solve the problem.

 

 You should be focusing on what is triggering the heap to spike up. You should be analyzing thread dumps and heap dumps. When the problem resurfaces, you need to capture thread dumps, heap dumps from the application and analyze them. You can use the 14-day trial version of yCrash which captures 360-degree data (thread dump, heap dump, GC log, netstat, iostat, vmstat, top, top -H, dmesg, disk usage...) and analyze all those artifacts to generate a root cause analysis report. OR you can manually capture thread dump, manually capture heap dump and analyze them through tools like fastThread, EclipseMAT.
 

Profile Image

Tim OShea

Hey Ram, thanks a bunch for the analysis!! Without going too in-depth about the app it essentially polls a constant number of requests from a work queue which means there is no burst in traffic (good), but the queue can grow very large. Since the app is working at a constant rate I assumed that the GC would be able to adapt to this rate which lead me to think that GC is the cause, rather than a reaction. The amount of work the app has to do at 10:00 compared to 10:40 is the same.

 

If we assume that it's acceptable for the GC to spend as much time as it needs to in-between each request, is it possible to configure the GC to behave this way? The only other thing I can think of is calling System.gc() in between each request.

 

Thanks for pointing out the 14 day trial, I will definitely take advantage of that.

 

Profile Image

Ram Lakshmanan

Hello Tim!

 

 Here are two possibilities, I am thinking:

 

 a. I used to work for a financial instituion earlier. Certain (rare) customers tend to have 250+ accounts. When those customers sign in, our application has to make 250+ backend calls to SORs (1 call for 1 account), read transaction history of each account, process them. Just one such customer request tend to chew up entire memory. Cause severe memory problems. From the traffic volume prespective they are still considered as 1 inbound request. But itensity of workload they create in the application is uncomparable with other transactions. Is it possible such heavy weight transactions are processed at 10:40 pm?

 

b. From your description, your app is polling constant number of requests from work queue. What if there aren't sufficient worker threads or resource in your application to keep up with the inbound workload? Then heap size will start to grow.

 

Is it possible to pause the inbound workload? In such circumstance is your application able recover?

 

These questions can be answered by analyzing thread dumps/heap dumps.

 

I will not recommend calling explicitly System.gc(). Because JVM itself will know when to call GC. It will unnecessary significant overhead. Besides that if there is memory leak or heavy workload, calling System.gc() will not help us.

Profile Image

Tim OShea

Great points Ram, thanks for providing an example. My application has similar behavior. I'll deep dive into thread/heap dumps to see what I can tune.

Got something else on mind? Post Your Question

Not the answer you're looking for? Browse other questions tagged
  • g1gccrash

  • stablegclog

  • spaceexhaustion

  • rootcause

  • cliflags

  • g1reservepercent

  • initiatingheapoccupancypercent