The Admin API blocks itself, becoming unresponsive and exploding memory usage

Description

Within synchronization block there is a N/W call.

We have run into an issue concerning the AdminApi. From the outside, the
AdminApi was unresponsive and the JVM used up a lot of memory. In our case try
to perform a count of a specific object instances via
SpaceInstance.getRuntimeDetails().getcountPerClassName(). We try this every 30s
as part of our system monitoring. The block we encountered happend as part of
Admin.getSpaces.waitFor(String, long, TimeUnit). From API point of view it
should only block a limited amout of time, but it actually blocked forever. The
reason for this is actually somewhere different.

We created a thread and heap dump and found out that most of the used JVMs
memory are instances of the com.gigaspaces.lrmi.DynamicSmartStub class. In
total there's over 80.000 of them resulting in about 800MB allocated. The JVM
itself did not throw any exceptions. It just blocked AdminAPI calls to the
allocated memory kept on rising.

It turns out the memory allocated is all part of a single
com.sun.jini.thread.TaskManager tasklist. Checking the implementation and thread
dump I can see that there are 10 "GS-SDM Cache Task" instances that should be
working on this task list. Unfortunately they are are blocked by a single cache
task worker with thread Id=105. This one itself is blocked by a synchronized
method call of org.openspaces.admin.internal.vm.DefaultVirtualMachines.isMonitoring().
In fact there is a whole chain of threads waiting for locks to be released. ex.

"GS-SDM Cache Task" Id=105 BLOCKED on org.openspaces.admin.internal.vm.DefaultVirtualMachines@2045aff9 owned by "GS-admin-scheduled-executor-thread" Id=92
--> DefaultVirtualMachines.isMonitoring()
"GS-admin-scheduled-executor-thread" Id=92 BLOCKED on org.openspaces.admin.internal.vm.DefaultVirtualMachine@7f52d7e2 owned by "GS-admin-scheduled-executor-thread" Id=88
--> DefaultVirtualMachine.getStatistics()
"GS-admin-scheduled-executor-thread" Id=88 RUNNABLE (in native)

The thread all are waiting for is a blocking the others as part of
DefaultVirtualMachine.getStatistics(). This getStatistic() call blocks because
it initiates a DefaultGridServiceContainer.getJVMStatistics() call which itself
down the line results in a blocking read within com.gigaspaces.lrmi.nio.Reader.readBytesFromChannelBlocking().

The conclusion for me is that the AdminAPI initiated a network read, that for
whatever reason is not answered. It's unfortunate this network access is done
within the synchronized block of a DefaultVirtualMachine instance without any
timeouts. As a result a lot of AdminAPI calls start blocking on various ends and
places.

Out expectation would be that the Admin API reports an error. Any sort of error
is fine like throwing an Exception or returning null.

Attachments

00011160_threadDump

26 Apr, 2017

Linked work items

is related by

GS-13452

Improve responsiveness of remote statistics gathering in Admin API

Activity

Show:

Resize issue view side panel

Fixed

Details

Assignee

Moran Avigdor

Reporter

Yuval Dori(Deactivated)

Participants of an issue

Moran Avigdor

Yuval Dori

Priority

Medium

SalesForce Case ID

00011160

Fix versions

12.3

Edition

Open Source

Platform

All

Sprint

Add sprint

Acceptance Test

see GS-13452

Freshdesk Support

Open Freshdesk Support

Created April 26, 2017 at 12:29 PM

Updated December 17, 2017 at 9:30 AM

Resolved December 14, 2017 at 3:17 PM

Freshdesk Support