Running with 3 PUs where one of those PUs is deployed with 14 partitions.
4 physical hosts in total. 2 LUS running on the hosts which where not killed.
Customer was able to reproduce the long failover with 3 space, 2 spaces have only one partition and one space has 14 partitions. With backups this make 32 stateful PUIs. Failover sometimes takes a period of over 60 sec.
EventExpireThread which holds a given lock for extended period of time (> 60sec) and prevents some other thread to signal the waiting connection threads
reproduction test case