Symptoms

A sales order fails with the error:

Execution Failed: Operations Automation is not available. Connection to 192.168. 90. 11:8440 has failed: Connection timeout. Please check network settings.

The corresponding pem.activateSubscription API request sent from Billing to OA is not getting a response in 10 minutes. From /var/log/pa/core.log, the API request is stuck after finishing providing resources on SaaS level:

Nov  7 20:45:01.615 : DBG [openapi:1605 openapi-task-95:939 pau]: c.p.p.s.p.e.SubscriptionManagerOpenAPI Entering activateSubscription, accountId: 1000010, subscriptionId: 1000050, subscriptionName: null, stId: 3, parentNotificationId: null
...
Nov  7 20:45:01.861 : DBG [openapi:1605 1:14928:7f7e757fb700 SAAS ]: [ {anonymous}::provideResourcesFromParamsImpl] <
=== EXIT [0.000004]
Nov  7 20:45:01.861 : DBG [openapi:1605 1:14928:7f7e757fb700 SAAS ]: [ SaaS::SaaSManager_impl::doChangeSubscriptionLi
mits] <=== EXIT [0.036235]

while the request should normally end with the line like:

Nov  7 03:31:43.085 : DBG [openapi:10 openapi-task-10 pau]: c.p.p.s.x.c.DynamicMethodHandler XML RPC invocation of 'activateSubscription' took 4594 ms

The following error could be found in /var/log/pa/console.log some time before the incident:

171107 10:32:58 WARN  [org.jboss.jca.core.connectionmanager.pool.strategy.OnePool] (EJB default - 7) IJ000604: Throwable while attempting to get a new connection: null: javax.resource.ResourceException: IJ031084: Unable to create connection
...
Caused by: org.postgresql.util.PSQLException: FATAL: the database system is in recovery mode
        at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:443)
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:217)
        at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:52)
        at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:216)

At the same time in /var/lib/pgsql/9.6/data/pg_log/postgresql-Tue.log:

[2017-11-07 10:41:12.178 EET] p=19835:303@0/0 c=oss@192.168.10.10/oss:[unknown] PANIC:  hash table "Shared Buffer Lookup Table" corrupted

The Virtuozzo container, where OA Management Node resides, was migrated in online mode exactly at the time of the postgresql error. This could be seen in /var/log/messages on the Virtuozzo node:

Nov  7 10:40:00 pcs01 vzmdest[181412]: Start of CT 101 migration (private /vz/private/101, root /vz/root/101, opt=24)
Nov  7 10:40:58 pcs01 vzmdest[181412]: OfflineManagement CT#101 ...
Nov  7 10:40:58 pcs01 vzmdest[181412]: done
Nov  7 10:40:58 pcs01 vzmdest[181412]: Undumping CT#101 ...
Nov  7 10:41:11 pcs01 kernel: [78571813.682698] CT: 101: restored
Nov  7 10:41:11 pcs01 vzmdest[181412]: done
Nov  7 10:41:11 pcs01 vzmdest[181412]: Resuming CT#101 ...
Nov  7 10:41:11 pcs01 vzmdest[181412]: done

Cause

A Virtuozzo online migration may cause corruption of the shared memory segments that postgresql server relies on.

Reference: checkpointing shared memory fails

Resolution

In order to fix the issue, restart OA services on the Management Node.

Avoid using Virtuozzo online migration for the management containers (OA and BA), if they are located on Virtuozzo nodes of 6.0 (or lower) versions.

Internal content