Uploaded image for project: 'InsightEdge Platform'
  1. GS-13509

Unhealthy Space remains in Stopped state indefinitely after a network failure

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects versions: None
    • Fix versions: 12.3.1
    • Labels:
      None

      Description

      The Space instance identified that another instance has been elected as primary and changes its state to stopped.

      2018-03-21 13:42:30,280 data-processor.72 [1] INFO [com.gigaspaces.space.active-election.space.72] - terminating - another space has been elected as primary [72_1]
      ...
      2018-03-21 13:42:30,363 data-processor.72 [1] INFO [com.gigaspaces.space.space.72] - Space Stopped successfully

      The Grid Service Manager suspects the instance failure, but since it is still not actively managing the instance it does not forcefully terminate it.

      2018-03-21 13:42:34,440 GSM WARNING [org.openspaces.pu.container.servicegrid.PUFaultDetectionHandler] - Suspecting failure of service: [data-processor.72 [1]] pid[8179] host[ip-172-30-0-227.eu-west-1.compute.internal/172.30.0.227] - RTT[7.7 ms]. Retrying to reach service.; Caused by: com.j_spaces.core.SpaceUnhealthyException: Space is in unhealthy state: Space [space_container72:space] is in stopped state.
      terminating - another space has been elected as primary [72_1]

      We noticed that the time between the GSM was granted leadership and the time it actually was active was longer than the time we wait before forcefully terminating (3 min vs. 1 min wait).

      2018-03-21 13:42:29,118 GSM INFO [com.gigaspaces.grid.gsm.leader] - Granted leadership
      ...
      2018-03-21 13:45:51,311 GSM INFO [com.gigaspaces.grid.gsm.leader] - Actively managing: [data-processor, data-feeder]

      We should be seeing the following message:
      GSM INFO [com.gigaspaces.grid.gsm.services] - Forcefully destroy unhealthy service [data-processor.72 [1]]

      Instances that the GSM got to actively managing before the timeout, were terminated.

        Attachments

          Activity

            People

            • Assignee:
              moran Moran Avigdor
              Reporter:
              moran Moran Avigdor
              Participants of an issue:
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: