Within synchronization block there is a N/W call.
We have run into an issue concerning the AdminApi. From the outside, the
AdminApi was unresponsive and the JVM used up a lot of memory. In our case try
to perform a count of a specific object instances via
SpaceInstance.getRuntimeDetails().getcountPerClassName(). We try this every 30s
as part of our system monitoring. The block we encountered happend as part of
Admin.getSpaces.waitFor(String, long, TimeUnit). From API point of view it
should only block a limited amout of time, but it actually blocked forever. The
reason for this is actually somewhere different.
We created a thread and heap dump and found out that most of the used JVMs
memory are instances of the com.gigaspaces.lrmi.DynamicSmartStub class. In
total there's over 80.000 of them resulting in about 800MB allocated. The JVM
itself did not throw any exceptions. It just blocked AdminAPI calls to the
allocated memory kept on rising.
It turns out the memory allocated is all part of a single
com.sun.jini.thread.TaskManager tasklist. Checking the implementation and thread
dump I can see that there are 10 "GS-SDM Cache Task" instances that should be
working on this task list. Unfortunately they are are blocked by a single cache
task worker with thread Id=105. This one itself is blocked by a synchronized
method call of org.openspaces.admin.internal.vm.DefaultVirtualMachines.isMonitoring().
In fact there is a whole chain of threads waiting for locks to be released. ex.
"GS-SDM Cache Task" Id=105 BLOCKED on org.openspaces.admin.internal.vm.DefaultVirtualMachines@2045aff9 owned by "GS-admin-scheduled-executor-thread" Id=92
"GS-admin-scheduled-executor-thread" Id=92 BLOCKED on org.openspaces.admin.internal.vm.DefaultVirtualMachine@7f52d7e2 owned by "GS-admin-scheduled-executor-thread" Id=88
"GS-admin-scheduled-executor-thread" Id=88 RUNNABLE (in native)
The thread all are waiting for is a blocking the others as part of
DefaultVirtualMachine.getStatistics(). This getStatistic() call blocks because
it initiates a DefaultGridServiceContainer.getJVMStatistics() call which itself
down the line results in a blocking read within com.gigaspaces.lrmi.nio.Reader.readBytesFromChannelBlocking().
The conclusion for me is that the AdminAPI initiated a network read, that for
whatever reason is not answered. It's unfortunate this network access is done
within the synchronized block of a DefaultVirtualMachine instance without any
timeouts. As a result a lot of AdminAPI calls start blocking on various ends and
Out expectation would be that the Admin API reports an error. Any sort of error
is fine like throwing an Exception or returning null.