The ProcessingUnitInstance.relocationAndWait() may not abort after timeout, causing a permanent blocking state

Description

The following reproduces at the customer site once in ~100 retries.

Client:
After the full deployment, the customer started a rebalancing script. This script uses the Admin and invokes public ProcessingUnitInstance relocateAndWait(GridServiceContainer gridServiceContainerToRelocateTo, long timeout, TimeUnit timeUnit) method. The result of this invocation was the client blocks forever.

Thread dumps are taken 18h after the method call, it is still in a deadlock.

Thread which calls relocationAndWait
"pool-20-thread-1" #547 prio=5 os_prio=0 tid=0x00000000200f2000 nid=0x20a18 runnable [0x0000000026bdd000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(Redefined)

  • locked <0x00000000ca17ad90> (a java.lang.Object)
    at com.gigaspaces.lrmi.nio.Reader.readBytesFromChannelBlocking(Reader.java:239)
    at com.gigaspaces.lrmi.nio.Reader.readBytesBlocking(Reader.java:628)
    at com.gigaspaces.lrmi.nio.Reader.bytesToPacket(Reader.java:566)
    at com.gigaspaces.lrmi.nio.Reader.readReply(Reader.java:158)
    at com.gigaspaces.lrmi.nio.CPeer.invoke(CPeer.java:711)
    at com.gigaspaces.lrmi.ConnPoolInvocationHandler.invoke(ConnPoolInvocationHandler.java:79)
    at com.gigaspaces.lrmi.MethodCachedInvocationHandler.invoke(MethodCachedInvocationHandler.java:76)
    at com.gigaspaces.lrmi.DynamicSmartStub.invokeRemote(DynamicSmartStub.java:449)
    at com.gigaspaces.lrmi.DynamicSmartStub.invoke(DynamicSmartStub.java:429)
    at com.gigaspaces.reflect.$GSProxy7.relocate(Unknown Source)
    at com.gigaspaces.grid.gsm.GSMProxy.relocate(GSMProxy.java:247)
    at org.openspaces.admin.internal.gsm.DefaultGridServiceManager.relocate(DefaultGridServiceManager.java:366)
    at org.openspaces.admin.internal.pu.DefaultProcessingUnitInstance.relocate(DefaultProcessingUnitInstance.java:435)
    at org.openspaces.admin.internal.pu.DefaultProcessingUnitInstance.relocateAndWait(DefaultProcessingUnitInstance.java:462)
    at com.frequentis.ps.maintenance.service.simulation.operation.RelocationOperation.lambda$null$3(RelocationOperation.java:85)
    at com.frequentis.ps.maintenance.service.simulation.operation.RelocationOperation$$Lambda$589/791011469.apply(Unknown Source)
    at java.util.Optional.map(Optional.java:215)
    at com.frequentis.ps.maintenance.service.simulation.operation.RelocationOperation.lambda$relocationAction$4(RelocationOperation.java:85)
    at com.frequentis.ps.maintenance.service.simulation.operation.RelocationOperation$$Lambda$588/1007713063.apply(Unknown Source)
    at java.util.Optional.flatMap(Optional.java:241)
    at com.frequentis.ps.maintenance.service.simulation.operation.RelocationOperation.relocationAction(RelocationOperation.java:83)
    at com.frequentis.ps.maintenance.service.simulation.operation.RelocationOperation.lambda$executeAction$2(RelocationOperation.java:75)
    at com.frequentis.ps.maintenance.service.simulation.operation.RelocationOperation$$Lambda$580/793351903.accept(Unknown Source)
    at com.frequentis.ps.maintenance.service.simulation.operation.RebalanceOperation.rebalanceAction(RebalanceOperation.java:123)
    at com.frequentis.ps.maintenance.service.simulation.operation.RebalanceOperation.lambda$executeAction$8(RebalanceOperation.java:109)
    at com.frequentis.ps.maintenance.service.simulation.operation.RebalanceOperation$$Lambda$577/508898348.accept(Unknown Source)
    at com.frequentis.ps.maintenance.service.rebalance.Rebalancing.performRebalancing(Rebalancing.java:194)
    at com.frequentis.ps.maintenance.service.rebalance.Rebalancing.rebalance(Rebalancing.java:156)
    at com.frequentis.ps.maintenance.service.rebalance.RebalanceService.lambda$rebalance$0(RebalanceService.java:117)
    at com.frequentis.ps.maintenance.service.rebalance.RebalanceService$$Lambda$264/1768124127.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

GSM Thread which handles the relocation call
"GS-LRMI Connection-pool-1-thread-15" #124 daemon prio=5 os_prio=0 tid=0x0000000016426000 nid=0x1b46c runnable [0x0000000022dfc000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(Redefined)

  • locked <0x00000000ec070680> (a java.lang.Object)
    at com.gigaspaces.lrmi.nio.Reader.readBytesFromChannelBlocking(Reader.java:239)
    at com.gigaspaces.lrmi.nio.Reader.readBytesBlocking(Reader.java:628)
    at com.gigaspaces.lrmi.nio.Reader.bytesToPacket(Reader.java:566)
    at com.gigaspaces.lrmi.nio.Reader.readReply(Reader.java:158)
    at com.gigaspaces.lrmi.nio.CPeer.invoke(CPeer.java:711)
    at com.gigaspaces.lrmi.ConnPoolInvocationHandler.invoke(ConnPoolInvocationHandler.java:79)
    at com.gigaspaces.lrmi.MethodCachedInvocationHandler.invoke(MethodCachedInvocationHandler.java:76)
    at com.gigaspaces.lrmi.DynamicSmartStub.invokeRemote(DynamicSmartStub.java:449)
    at com.gigaspaces.lrmi.DynamicSmartStub.invoke(DynamicSmartStub.java:429)
    at com.gigaspaces.reflect.$GSProxy12.destroy(Unknown Source)
    at org.jini.rio.resources.servicecore.ServiceAdminProxy.destroy(ServiceAdminProxy.java:229)
    at org.jini.rio.monitor.ServiceElementManager.doDestroy(ServiceElementManager.java:1034)
    at org.jini.rio.monitor.ServiceElementManager.destroyService(ServiceElementManager.java:996)
    at org.jini.rio.monitor.ServiceElementManager.decrement(ServiceElementManager.java:1521)

  • locked <0x00000000eabca480> (a java.lang.Object)
    at org.jini.rio.monitor.ProvisionMonitorImpl$OpStringManager.decrement(ProvisionMonitorImpl.java:2648)
    at org.jini.rio.core.OperationalStringManagerGigaspacesMethodinternalInvoke16.internalInvoke(Unknown Source)
    at com.gigaspaces.internal.reflection.fast.AbstractMethod.invoke(AbstractMethod.java:45)
    at com.gigaspaces.lrmi.DynamicSmartStub._invoke(DynamicSmartStub.java:276)
    at com.gigaspaces.lrmi.DynamicSmartStub.invokeDirect(DynamicSmartStub.java:347)
    at com.gigaspaces.lrmi.DynamicSmartStub.invoke(DynamicSmartStub.java:432)
    at com.gigaspaces.reflect.$GSProxy10.decrement(Unknown Source)
    at org.jini.rio.monitor.ServiceElementManager.relocate(ServiceElementManager.java:1121)
    at org.jini.rio.monitor.ProvisionMonitorImpl$OpStringManager.relocate(ProvisionMonitorImpl.java:2477)
    at org.jini.rio.core.OperationalStringManagerGigaspacesMethodinternalInvoke14.internalInvoke(Unknown Source)
    at com.gigaspaces.internal.reflection.fast.AbstractMethod.invoke(AbstractMethod.java:45)
    at com.gigaspaces.lrmi.DynamicSmartStub._invoke(DynamicSmartStub.java:276)
    at com.gigaspaces.lrmi.DynamicSmartStub.invokeDirect(DynamicSmartStub.java:347)
    at com.gigaspaces.lrmi.DynamicSmartStub.invoke(DynamicSmartStub.java:432)
    at com.gigaspaces.reflect.$GSProxy10.relocate(Unknown Source)
    at com.gigaspaces.grid.gsm.GSMImpl.relocate(GSMImpl.java:384)
    at com.gigaspaces.grid.gsm.GSMGigaspacesMethodinternalInvoke10.internalInvoke(Unknown Source)
    at com.gigaspaces.internal.reflection.fast.AbstractMethod.invoke(AbstractMethod.java:45)
    at com.gigaspaces.lrmi.LRMIRuntime.invoked(LRMIRuntime.java:441)
    at com.gigaspaces.lrmi.nio.Pivot.consumeAndHandleRequest(Pivot.java:494)
    at com.gigaspaces.lrmi.nio.Pivot.handleRequest(Pivot.java:572)
    at com.gigaspaces.lrmi.nio.Pivot$ChannelEntryTask.run(Pivot.java:173)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Expectation is that after 60 seconds a get a timeout exception or just return with null, but not a block forever.

Server:
We can see that the timeout that the client pass is being used just after the relocate (DefaultProcessingUnitInstance line 471). This seems like a second problem. It is okay that the server will try to relocate even after the client timeout but is seems that for the server there is no timeout at all.

Activity

Fixed

Details

Assignee

Reporter

Participants of an issue

EvgenyF
Yuval Dori

Priority

SalesForce Case ID

Fix versions

Edition

Platform

All

Acceptance Test

added test: RelocateProcessingUnitInstanceTest

Freshdesk Support

Open Freshdesk Support
Created March 23, 2017 at 9:38 AM
Updated February 8, 2018 at 11:34 AM
Resolved November 16, 2017 at 9:14 AM
Freshdesk Support