OACI HA: some VEs did not start after failover with PRL_ERR_TRY

OACI HA: some VEs did not start after failover with PRL_ERR_TRY_AGAIN

Modified on: Thu, 23 Jan, 2020 at 5:20 AM

Operations Automation

Symptoms

After a Virtuozzo node outage, a lot of VEs (more than 100) were relocated to a single node and some of them failed to start with the following error in /var/log/pa/vps.log on OACI IM node:

    2017-08-31 14:27:14,343 (bb598b68-b195-4439-8ff4-e7ec92e81ad5) INFO  GenericVm2VfTask [Shared executor thread #11 @1 @INTERACTIVE] - done_with_message(-2147482544, PRL_ERR_TRY_AGAIN)
    2017-08-31 14:27:14,344 (bb598b68-b195-4439-8ff4-e7ec92e81ad5) WARN  GenericVm2VfTask [Shared executor thread #11 @1 @INTERACTIVE] - VM2VF operation [START] (reqId=113193) finished with rc=-2147482544

Cause

In case a lot of VEs are started simultaneously on a Virtuozzo node, they hit a bottleneck of parallel operations and some of them face the error and are not attempted to be started again. This behavior is planned to be improved in scope of internal request with ID CCU-17188.

Resolution

In order to prevent such effect, it is recommended to distribute VEs across the cluster more evenly, so that in case of a failover, VEs from a failed nodes are relocated across many nodes, but not on a single node in a large portion.

Such failed VEs should be started manually.

Internal content

Did you find it helpful? Yes No