We're updating the issue view to help you get more done. 

Unhealthy Space remains in Stopped state indefinitely after a network failure

Description

The Space instance identified that another instance has been elected as primary and changes its state to stopped.

2018-03-21 13:42:30,280 data-processor.72 [1] INFO [com.gigaspaces.space.active-election.space.72] - terminating - another space has been elected as primary [72_1]
...
2018-03-21 13:42:30,363 data-processor.72 [1] INFO [com.gigaspaces.space.space.72] - Space Stopped successfully

The Grid Service Manager suspects the instance failure, but since it is still not actively managing the instance it does not forcefully terminate it.

2018-03-21 13:42:34,440 GSM WARNING [org.openspaces.pu.container.servicegrid.PUFaultDetectionHandler] - Suspecting failure of service: [data-processor.72 [1]] pid[8179] host[ip-172-30-0-227.eu-west-1.compute.internal/172.30.0.227] - RTT[7.7 ms]. Retrying to reach service.; Caused by: com.j_spaces.core.SpaceUnhealthyException: Space is in unhealthy state: Space [space_container72:space] is in stopped state.
terminating - another space has been elected as primary [72_1]

We noticed that the time between the GSM was granted leadership and the time it actually was active was longer than the time we wait before forcefully terminating (3 min vs. 1 min wait).

2018-03-21 13:42:29,118 GSM INFO [com.gigaspaces.grid.gsm.leader] - Granted leadership
...
2018-03-21 13:45:51,311 GSM INFO [com.gigaspaces.grid.gsm.leader] - Actively managing: [data-processor, data-feeder]

We should be seeing the following message:
GSM INFO [com.gigaspaces.grid.gsm.services] - Forcefully destroy unhealthy service [data-processor.72 [1]]

Instances that the GSM got to actively managing before the timeout, were terminated.

Workaround

None

Acceptance Test

regression tests (disconnect, manager suite)
large cluster

Status

Assignee

Meron Avigdor

Reporter

Meron Avigdor

Labels

None

Priority

Major

SalesForce Case ID

12076

Fix versions

Commitment Version/s

None

Due date

None

Product

XAP

Edition

Premium

Platform

All

Sprint

None