By any chance, the timeout used while poll methods of LinkedBlockingQueue in TimeUnit.nanoseconds
Hello,
We have an API (JAVA springboot) that shows sudden increase in CPU consumption from less than 10% to 60-70% after a few minutes into the load test. We have captured some memory dumps when CPU shows an increase and analyzing the threads in them, we see a lot of them WAITING at jdk.internal.misc.Unsafe.park(Native Method). I understand that these threads are simply waiting for task to be assigned, but could it be the cause? Could there be a blocking/locking issue with the queue? If so, how do I go from here?
Attaching some snapshots from Dynatrace, that shows SSLHandshake as the TOP consumer -
Regards,
Report URL - https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjEvMDkvMjcvLS0yLnR4dC0tMTMtNTItNTI7Oy0tMS50eHQtLTEzLTUyLTUy
Hello Pooja!
Greetings. Here are my initial thoughts:
1. Always CPU spikes are caused by threads that are in RUNNABLE state. Other thread states like WAITING, TIMED_WAITING, BLOCKED, NEW, TERMINATED will not cause CPU spike.
2. It looks like you have captured when heap dump capture was in the process through the dynatrace one agent. You should have captured thread dumps before capturing heap dumps. When heap dump is captured, it freezes the JVM - thus we wouldn't what are the actual threads that are causing the problem.
3. Some times when garbage collection runs in background that can also cause the CPU to spike up.
Inorder to accurately point the lines of code causing the CPU spike, you need to analyze not only thread dumps but also 'top -H -p {PID}' command output, where {PID} is your Java application's process Id which is experiencing CPU spike. When you issue this ‘top’ command with given arguments, it will list all the threads running in the application and amount of CPU each one of the thread consume. Once you have both the data, you can identify high CPU consuming thread and lines of code they are executing. You can either capture these artifacts and analyze them manually, or use the 14 day trial version of yCrash tool. yCrash tool automatically captures application-level data (thread dump, heap dump, Garbage Collection log) and system-level data (netstat, vmstat, iostat, top, top -H, dmesg,...). It marries these two datasets and generates an instant root cause analysis report pointing out exact line of code causing the CPU spike. Here is more information on how to diagnose high CPU spike.
Hello Pooja!
I see. If it is the case, can you capture thread dumps separately and few minutes later capture heap dumps and analyze them. Don't capture them together like it was done last earlier.
a. You can analyze the captured thread dumps in fastThread.
b. You can use Eclipse MAT or HeapHero to analyze heap dumps. Here is a video clip that my assist you in analyzing heap dumps in Eclipse MAT.
You can also check with amazon support whether there is a possibility to trigger shell script on demand in the fargate service? Let me also check whether there are any such options.
Hi Ram,
Thanks a lot for the pointers, I will check those today.
Meanwhile, what stands out as a hotspot wrt. CPU consumption is sun.security.rsa as shown in the snip below -
When CPU spikes to 60% after 20 mins into the test, https-jsse-nio-*-exec threads consume the most CPU and this stacktrace shows the above methods in stacktrace.
Do you have any pointers around this?
Regards,
Hello Pooja!
Screenshot of data that you have shared indicates CPU consumption is originating because of SSL handshake. But apparently SSL handshakes shouldn't consume this much CPU. Here is an excellent article, which summarizes that when Google switched their gmail service from http to https, it added the overhead of only 1% CPU. So I can think of only 2 reasons why SSL (i.e. https) to add so much CPU overhead:
a. Your connections are not persistent. Either your load testing tool or your application or AWS Fargate service is not keeping the HTTPS connections persistent. It's turning down the connections after every transaction.
b. I have seen monitoring tools reporting wrong methods as CPU suspects.
Most likely it could #a in your case. Check your HTTPS connection settings.
Thanks.
Hi Ram,
These are TCP connections to the database. We do have connection pooling enabled using HikariCP.
hikari.connectionTimeout=2000
hikari.initializationFailTimeout=0
hikari.maximumPoolSize=5
We have tried running the test with different PoolSize but CPU issue still persists.
Can it be related to any other configuration that is missing?
Hi Ram,
Thanks for the reply. Application server is Tomcat and DB is Postgressql.
I was referring to DB connections as while looking at the hotspots, I found this stack where we can see that socketinputstream.socketread0 takes considerable network time, which doesn't look normal. Mostly I have seen this time under 1%.
Hi Ram,
Sorry for the confusion. Both of these issues are from the same test run, so I wanted to highlight it.
We were able to solve the CPU issue to some extent by changing the logging levels, increasing the connection pool size and doing some changes at the code level where a lot of mapper objects were getting created.
However, the performance is still not as per the expectations as I think the long time spent on Socket.read() operation as well as SSL as highlighted in the snip below is the cause.
At moment, we are not able to find any reason why it could be taking so long, when it's not a big query that returns a huge chunk of data.
We also see a similar behavior with commitTransaction, where if I drill down shows the same trace where SockerRead0 is consuming significant time.
What is more peculiar is that this is not seen with all the SQLs.
Edit your Comment