Uploaded image for project: 'InsightEdge Platform'
  1. GS-13236

The Admin API blocks itself, becoming unresponsive and exploding memory usage

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Medium
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 12.3
    • Labels:
      None
    • Platform:
      All
    • SalesForce Case ID:
      00011160
    • Acceptance Test:
      see GS-13452
    • Sprint:
      12.3-M11
    • Edition:
      Open Source

      Description

      Within synchronization block there is a N/W call.

      We have run into an issue concerning the AdminApi. From the outside, the
      AdminApi was unresponsive and the JVM used up a lot of memory. In our case try
      to perform a count of a specific object instances via
      SpaceInstance.getRuntimeDetails().getcountPerClassName(). We try this every 30s
      as part of our system monitoring. The block we encountered happend as part of
      Admin.getSpaces.waitFor(String, long, TimeUnit). From API point of view it
      should only block a limited amout of time, but it actually blocked forever. The
      reason for this is actually somewhere different.

      We created a thread and heap dump and found out that most of the used JVMs
      memory are instances of the com.gigaspaces.lrmi.DynamicSmartStub class. In
      total there's over 80.000 of them resulting in about 800MB allocated. The JVM
      itself did not throw any exceptions. It just blocked AdminAPI calls to the
      allocated memory kept on rising.

      It turns out the memory allocated is all part of a single
      com.sun.jini.thread.TaskManager tasklist. Checking the implementation and thread
      dump I can see that there are 10 "GS-SDM Cache Task" instances that should be
      working on this task list. Unfortunately they are are blocked by a single cache
      task worker with thread Id=105. This one itself is blocked by a synchronized
      method call of org.openspaces.admin.internal.vm.DefaultVirtualMachines.isMonitoring().
      In fact there is a whole chain of threads waiting for locks to be released. ex.

      "GS-SDM Cache Task" Id=105 BLOCKED on org.openspaces.admin.internal.vm.DefaultVirtualMachines@2045aff9 owned by "GS-admin-scheduled-executor-thread" Id=92
      --> DefaultVirtualMachines.isMonitoring()
      "GS-admin-scheduled-executor-thread" Id=92 BLOCKED on org.openspaces.admin.internal.vm.DefaultVirtualMachine@7f52d7e2 owned by "GS-admin-scheduled-executor-thread" Id=88
      --> DefaultVirtualMachine.getStatistics()
      "GS-admin-scheduled-executor-thread" Id=88 RUNNABLE (in native)

      The thread all are waiting for is a blocking the others as part of
      DefaultVirtualMachine.getStatistics(). This getStatistic() call blocks because
      it initiates a DefaultGridServiceContainer.getJVMStatistics() call which itself
      down the line results in a blocking read within com.gigaspaces.lrmi.nio.Reader.readBytesFromChannelBlocking().

      The conclusion for me is that the AdminAPI initiated a network read, that for
      whatever reason is not answered. It's unfortunate this network access is done
      within the synchronized block of a DefaultVirtualMachine instance without any
      timeouts. As a result a lot of AdminAPI calls start blocking on various ends and
      places.

      Out expectation would be that the Admin API reports an error. Any sort of error
      is fine like throwing an Exception or returning null.

        Attachments

          Issue links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: