We're updating the issue view to help you get more done. 

The ProcessingUnitInstance.relocationAndWait() may not abort after timeout, causing a permanent blocking state

Description

The following reproduces at the customer site once in ~100 retries.

Client:
After the full deployment, the customer started a rebalancing script. This script uses the Admin and invokes public ProcessingUnitInstance relocateAndWait(GridServiceContainer gridServiceContainerToRelocateTo, long timeout, TimeUnit timeUnit) method. The result of this invocation was the client blocks forever.

Thread dumps are taken 18h after the method call, it is still in a deadlock.

Thread which calls relocationAndWait
"pool-20-thread-1" #547 prio=5 os_prio=0 tid=0x00000000200f2000 nid=0x20a18 runnable [0x0000000026bdd000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(Redefined)

  • locked <0x00000000ca17ad90> (a java.lang.Object)
    at com.gigaspaces.lrmi.nio.Reader.readBytesFromChannelBlocking(Reader.java:239)
    at com.gigaspaces.lrmi.nio.Reader.readBytesBlocking(Reader.java:628)
    at com.gigaspaces.lrmi.nio.Reader.bytesToPacket(Reader.java:566)
    at com.gigaspaces.lrmi.nio.Reader.readReply(Reader.java:158)
    at com.gigaspaces.lrmi.nio.CPeer.invoke(CPeer.java:711)
    at com.gigaspaces.lrmi.ConnPoolInvocationHandler.invoke(ConnPoolInvocationHandler.java:79)
    at com.gigaspaces.lrmi.MethodCachedInvocationHandler.invoke(MethodCachedInvocationHandler.java:76)
    at com.gigaspaces.lrmi.DynamicSmartStub.invokeRemote(DynamicSmartStub.java:449)
    at com.gigaspaces.lrmi.DynamicSmartStub.invoke(DynamicSmartStub.java:429)
    at com.gigaspaces.reflect.$GSProxy7.relocate(Unknown Source)
    at com.gigaspaces.grid.gsm.GSMProxy.relocate(GSMProxy.java:247)
    at org.openspaces.admin.internal.gsm.DefaultGridServiceManager.relocate(DefaultGridServiceManager.java:366)
    at org.openspaces.admin.internal.pu.DefaultProcessingUnitInstance.relocate(DefaultProcessingUnitInstance.java:435)
    at org.openspaces.admin.internal.pu.DefaultProcessingUnitInstance.relocateAndWait(DefaultProcessingUnitInstance.java:462)
    at com.frequentis.ps.maintenance.service.simulation.operation.RelocationOperation.lambda$null$3(RelocationOperation.java:85)
    at com.frequentis.ps.maintenance.service.simulation.operation.RelocationOperation$$Lambda$589/791011469.apply(Unknown Source)
    at java.util.Optional.map(Optional.java:215)
    at com.frequentis.ps.maintenance.service.simulation.operation.RelocationOperation.lambda$relocationAction$4(RelocationOperation.java:85)
    at com.frequentis.ps.maintenance.service.simulation.operation.RelocationOperation$$Lambda$588/1007713063.apply(Unknown Source)
    at java.util.Optional.flatMap(Optional.java:241)
    at com.frequentis.ps.maintenance.service.simulation.operation.RelocationOperation.relocationAction(RelocationOperation.java:83)
    at com.frequentis.ps.maintenance.service.simulation.operation.RelocationOperation.lambda$executeAction$2(RelocationOperation.java:75)
    at com.frequentis.ps.maintenance.service.simulation.operation.RelocationOperation$$Lambda$580/793351903.accept(Unknown Source)
    at com.frequentis.ps.maintenance.service.simulation.operation.RebalanceOperation.rebalanceAction(RebalanceOperation.java:123)
    at com.frequentis.ps.maintenance.service.simulation.operation.RebalanceOperation.lambda$executeAction$8(RebalanceOperation.java:109)
    at com.frequentis.ps.maintenance.service.simulation.operation.RebalanceOperation$$Lambda$577/508898348.accept(Unknown Source)
    at com.frequentis.ps.maintenance.service.rebalance.Rebalancing.performRebalancing(Rebalancing.java:194)
    at com.frequentis.ps.maintenance.service.rebalance.Rebalancing.rebalance(Rebalancing.java:156)
    at com.frequentis.ps.maintenance.service.rebalance.RebalanceService.lambda$rebalance$0(RebalanceService.java:117)
    at com.frequentis.ps.maintenance.service.rebalance.RebalanceService$$Lambda$264/1768124127.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

GSM Thread which handles the relocation call
"GS-LRMI Connection-pool-1-thread-15" #124 daemon prio=5 os_prio=0 tid=0x0000000016426000 nid=0x1b46c runnable [0x0000000022dfc000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(Redefined)

  • locked <0x00000000ec070680> (a java.lang.Object)
    at com.gigaspaces.lrmi.nio.Reader.readBytesFromChannelBlocking(Reader.java:239)
    at com.gigaspaces.lrmi.nio.Reader.readBytesBlocking(Reader.java:628)
    at com.gigaspaces.lrmi.nio.Reader.bytesToPacket(Reader.java:566)
    at com.gigaspaces.lrmi.nio.Reader.readReply(Reader.java:158)
    at com.gigaspaces.lrmi.nio.CPeer.invoke(CPeer.java:711)
    at com.gigaspaces.lrmi.ConnPoolInvocationHandler.invoke(ConnPoolInvocationHandler.java:79)
    at com.gigaspaces.lrmi.MethodCachedInvocationHandler.invoke(MethodCachedInvocationHandler.java:76)
    at com.gigaspaces.lrmi.DynamicSmartStub.invokeRemote(DynamicSmartStub.java:449)
    at com.gigaspaces.lrmi.DynamicSmartStub.invoke(DynamicSmartStub.java:429)
    at com.gigaspaces.reflect.$GSProxy12.destroy(Unknown Source)
    at org.jini.rio.resources.servicecore.ServiceAdminProxy.destroy(ServiceAdminProxy.java:229)
    at org.jini.rio.monitor.ServiceElementManager.doDestroy(ServiceElementManager.java:1034)
    at org.jini.rio.monitor.ServiceElementManager.destroyService(ServiceElementManager.java:996)
    at org.jini.rio.monitor.ServiceElementManager.decrement(ServiceElementManager.java:1521)

  • locked <0x00000000eabca480> (a java.lang.Object)
    at org.jini.rio.monitor.ProvisionMonitorImpl$OpStringManager.decrement(ProvisionMonitorImpl.java:2648)
    at org.jini.rio.core.OperationalStringManagerGigaspacesMethodinternalInvoke16.internalInvoke(Unknown Source)
    at com.gigaspaces.internal.reflection.fast.AbstractMethod.invoke(AbstractMethod.java:45)
    at com.gigaspaces.lrmi.DynamicSmartStub._invoke(DynamicSmartStub.java:276)
    at com.gigaspaces.lrmi.DynamicSmartStub.invokeDirect(DynamicSmartStub.java:347)
    at com.gigaspaces.lrmi.DynamicSmartStub.invoke(DynamicSmartStub.java:432)
    at com.gigaspaces.reflect.$GSProxy10.decrement(Unknown Source)
    at org.jini.rio.monitor.ServiceElementManager.relocate(ServiceElementManager.java:1121)
    at org.jini.rio.monitor.ProvisionMonitorImpl$OpStringManager.relocate(ProvisionMonitorImpl.java:2477)
    at org.jini.rio.core.OperationalStringManagerGigaspacesMethodinternalInvoke14.internalInvoke(Unknown Source)
    at com.gigaspaces.internal.reflection.fast.AbstractMethod.invoke(AbstractMethod.java:45)
    at com.gigaspaces.lrmi.DynamicSmartStub._invoke(DynamicSmartStub.java:276)
    at com.gigaspaces.lrmi.DynamicSmartStub.invokeDirect(DynamicSmartStub.java:347)
    at com.gigaspaces.lrmi.DynamicSmartStub.invoke(DynamicSmartStub.java:432)
    at com.gigaspaces.reflect.$GSProxy10.relocate(Unknown Source)
    at com.gigaspaces.grid.gsm.GSMImpl.relocate(GSMImpl.java:384)
    at com.gigaspaces.grid.gsm.GSMGigaspacesMethodinternalInvoke10.internalInvoke(Unknown Source)
    at com.gigaspaces.internal.reflection.fast.AbstractMethod.invoke(AbstractMethod.java:45)
    at com.gigaspaces.lrmi.LRMIRuntime.invoked(LRMIRuntime.java:441)
    at com.gigaspaces.lrmi.nio.Pivot.consumeAndHandleRequest(Pivot.java:494)
    at com.gigaspaces.lrmi.nio.Pivot.handleRequest(Pivot.java:572)
    at com.gigaspaces.lrmi.nio.Pivot$ChannelEntryTask.run(Pivot.java:173)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Expectation is that after 60 seconds a get a timeout exception or just return with null, but not a block forever.

Server:
We can see that the timeout that the client pass is being used just after the relocate (DefaultProcessingUnitInstance line 471). This seems like a second problem. It is okay that the server will try to relocate even after the client timeout but is seems that for the server there is no timeout at all.

Workaround

None

Acceptance Test

added test: RelocateProcessingUnitInstanceTest

Status

Assignee

EvgenyF

Reporter

Yuval Dori

Labels

None

Priority

Critical

SalesForce Case ID

00011037

Fix versions

Commitment Version/s

None

Due date

None

Product

None

Edition

Open Source

Platform

All