Hello Bill!
Greetings. I reviewed your GC log analysis report. Below is the GC pause time graph from your GC report:
I see only two occurances of Full GC one that happened at 05:38pm and another at 06:30pm. That too they were pausing the JVMs only for 1.2 seconds and 1.7 seconds. I suspect this is too small of pause time that can trigger HTTP 503 error (Unless you have attached some other GC log). You may want to confirm whether HTTP 503 errors were happening around the reported time frame.
HTTP 503 error indicates application is not ready to handle the request. There could be several reasons for this:
- Garbage collection pauses
- Threads getting BLOCKED
- Network connectivity
- Load balancer routing issue
- Heavy CPU consumption of threads
- Operating System running with old patches
- Memory Leak
- DB not responding properly
:
:
So just thread dump is not enough to diagnose the problem. You have captured only thread dump, that too one snapshot of it. It's always a good practice to capture 3 thread dumps in a gap of 10 seconds between each one. Besides thread dumps you might have to capture other logs/artifacts to do thorough analysis.
You can use the open source yCrash script which will capture 360-degree application level artifacts (like GC logs, 3 snapshots of thread dumps, heap dumps) and system level artifacts (like top, top -H, netstat, vmstat, iostat, dmesg, diskusage, kernel parameters...). Once you have these data, either you can manually analyze them or upload it to yCrash tool, which will analyze all these artifacts and generate one unified root cause analysis marrying all these artifacts. It can indicate the root cause of the problem.
Edit your Comment