Hello Tim!
Greetings.
From 10:05 pm to 10:40 pm your application is stable. Starting from 10:40 pm your heap size has started to grow significantly. After 10:40 pm object creation rate is spiking up and garbage collection isn't able to keep up, thus Full GCs are kicking in resulting in very long pauses and ultimately crashing at 10:50 pm.
This sort of behavior will happen if the application starts to process a lot more transactions. Is it possible:
a. Some sort of batch process launched around 10:40 pm?
b. Starting from 10:40 pm certain customer transaction might load a heavy number of records from DB, it can cause memory spike.
c. Maybe traffic volume started to increase from 10:40 pm? Is there some metric/stats, which will tell you whether traffic volume started to increase from 10:40 pm?
Based on your data, GC doesn't look like a problem. GC is only a reaction and not the cause. Thus tuning GC will not help you to solve the problem.
You should be focusing on what is triggering the heap to spike up. You should be analyzing thread dumps and heap dumps. When the problem resurfaces, you need to capture thread dumps, heap dumps from the application and analyze them. You can use the 14-day trial version of yCrash which captures 360-degree data (thread dump, heap dump, GC log, netstat, iostat, vmstat, top, top -H, dmesg, disk usage...) and analyze all those artifacts to generate a root cause analysis report. OR you can manually capture thread dump, manually capture heap dump and analyze them through tools like fastThread, EclipseMAT.
Edit your Comment