Hello Janarthanan!
Greetings.
I looked in to your thread dump and I am seeing following sources of problem(s):
a. You have thread pool by name: 'NetworkUpdateProcessorPool'. This thread pool keeps growing infinitely. In the thread dump that you captured at 10:20:19, this thread pool had 7398 threads. Here 6651 threads are in WAITING state (ie not doing anything). Can you confirm whether any max thread pool limit is set on this thread pool? Looks like no max thread pool size is set.
2. I do see java.util.TimeZone is blocking 305 threads:
Below is the stack of the 'NetworkUpdateProcessorPool' thread which is holding on to 'java.util.TimeZone' class and not releasing it:
java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.TimeZone.getTimeZone(java.base@11.0.7/TimeZone.java:517)
- locked <0x00000006c015fb50> (a java.lang.Class for java.util.TimeZone)
at ch.qos.logback.contrib.json.JsonLayoutBase.formatTimestamp(Unknown Source)
at ch.qos.logback.contrib.json.JsonLayoutBase.addTimestamp(Unknown Source)
at ch.qos.logback.contrib.json.classic.JsonLayout.toJsonMap(Unknown Source)
at ch.qos.logback.contrib.json.classic.JsonLayout.toJsonMap(Unknown Source)
at ch.qos.logback.contrib.json.JsonLayoutBase.doLayout(Unknown Source)
at ch.qos.logback.core.encoder.LayoutWrappingEncoder.encode(LayoutWrappingEncoder.java:115)
at ch.qos.logback.core.OutputStreamAppender.subAppend(OutputStreamAppender.java:230)
at ch.qos.logback.core.OutputStreamAppender.append(OutputStreamAppender.java:102)
at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84)
at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)
at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:270)
at ch.qos.logback.classic.Logger.callAppenders(Logger.java:257)
at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:421)
at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383)
at ch.qos.logback.classic.Logger.info(Logger.java:591)
at com.xxxxx.tms.processor.tasks.TopologyOMSLinkProcessorTask.run(TopologyOMSLinkProcessorTask.java:39)
at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.7/Executors.java:515)
at java.util.concurrent.FutureTask.run(java.base@11.0.7/FutureTask.java:264)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.7/ScheduledThreadPoolExecutor.java:304)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)
If you notice 'java.util.TimeZone' is invoked by the Logback (qos.ch) logging framework. Can you investigate whether you are running on a older version of Logback framework and Logback has introduced any new version to fix this problem?
c. Janarthanan - here is a another strong suspicion: If repeated garbage collection runs it can also stall (i.e. halt the application). To confirm that we need to capture and analyze GC logs. Did you capture GC log?
You can use the open source yCrash script which will capture 360-degree application level artifacts (like GC logs, 3 snapshots of thread dumps, heap dumps) and system level artifacts (like top, top -H, netstat, vmstat, iostat, dmesg, diskusage, kernel parameters...). Once you have these data, either you can manually analyze them or upload it to yCrash tool, which will analyze all these artifacts and generate one unified root cause analysis marrying all these artifacts. It may indicate the root cause of the problem.
Edit your Comment