Hello Nathan!
Good question.
A Garbage Collection event spends certain amount of time in the JVM layer and certain amount of time in the kernel layer. The time it's spend in the JVM layer is reported as 'user' time and time it spends in kernel layer is reported as 'sys' time.
If 'sys' time is more than 'user' time then it indicates there are certain environmental/kernel issues which is causing the 'sys' time to be higher. This fact is getting very evident since this problem surface only in GCP environment and not in your on-premise data center. There is some problem in your GCP deployment. Glad to see that you are looking at the right places.
- the memory on the VM
- Memory for the physical/hypervisor
- Disk i/o
- CPU
Can you also check Kernel parameters?
GCeasy reports the exact timestamps at which 'sys' time was greater than 'user' time. You might want to check the environment behaviour at those exact periods.
a. Do you have any APM tools using which you can go back in time and look at the system behaviour. But not all APM tools doesn't give all the environmental data (that we are looking for).
b. You can consider configure yCrash open source script in one of the GCP instance to run every 5 minutes (without capturing heap dump), which will capture 360 degree data and not add any overhead. You can analyze the captured output and see what is the environmental issue triggering this problem.
Edit your Comment