Symptoms

NOTE: The issue only affects OA 6.0. If any similar symtoms are observed on higher versions, investigate them separately.

Various VE-related operations fail in CCP. Examples of events:

  • VE start / stop operations
  • VE password reset
  • Firewall configuration tasks
  • Pressing 'Log in to Plesk' button returns 'Service Unavailable'
  • Initial provisining of a VE, an example of a failed task output:

    Provisioning "Panel_6_0" for APS application Parallels Plesk Panel
    APS Application Error:
    406 Not Acceptable
    [ApplicationUnknownError] Provisioning: resource <user_uuid> of type 'environments' (http://www.parallels.com/infrastructure/virtual-environment/1.0) for APS application 'Cloud Infrastructure-1.4-1951': P100001: Operation failed; details: [Failed to create VE].
    

The same behavior can be confirmed for any VEs located on a specific hardware node.

A timeout message can be found in /var/log/IM/PACI-vm2vf.log for the corresponding operations:

2015-12-14 20:01:22,850 (1930bc61-172e-4fec-af29-b296f9825cc2) INFO  LoggingProxy [RequestProcessor-1284] - invoked: create_firewall_rule(
2015-12-14 20:01:23,200 (1930bc61-172e-4fec-af29-b296f9825cc2) INFO  NativeVm2VfCode [RequestProcessor-1284] - [78861:819410] INF ==== found mutex for 10.250.0.116 @[common/common_lib.c][575][establish_connection_ex][29040])
...
2015-12-14 20:21:22,851 (1930bc61-172e-4fec-af29-b296f9825cc2) INFO  NativeVm2VfCode [Thread-894374] - [78861:819410] INF pthread_cond_timedwait [110]: cbp 0x7fd4c79852e0 job 0x2e28cf [__PrlSrv_GetVmConfig] index [78853] (__configure_firewall_cb, firewall.c, 248) @[common/generic_sdk_cb_tools.c][139][cbps_timeout_sleep][29540])
2015-12-14 20:21:22,851 (1930bc61-172e-4fec-af29-b296f9825cc2) INFO  NativeVm2VfCode [Thread-894375] - [78861:819410] INF PrlJob_GetStatus : job 0x2e28cf PJS_FINISHED @[common/generic_sdk_cb_tools.c][101][cbps_job_check_status_ex][29540])
2015-12-14 20:21:22,852 (1930bc61-172e-4fec-af29-b296f9825cc2) INFO  NativeVm2VfCode [Thread-894376] - [78861:819410] INF TIMEOUT: job [0x2e28cf] inactive, unreg callback @[common/generic_sdk_cb_tools.c][149][cbps_timeout_sleep][29540])

Other tasks, invoking connection to the same hardware node, fail with the same timeout, e.g.:

2015-12-14 15:01:12,132 (d9279218-7b4c-4316-90f7-c449c4e003b6:44) INFO  NativeVm2VfCode [RequestProcessor-1223] - [74528:774720] INF creating: cbp 0x7fd4c4ecaf10 job 0x2c0230 [__PrlSrv_GetVmConfig] index [74520] (get_netstat_cb, traffic.c, 144) @[common/generic_sdk_cb.c][656][reg_next_generic_sdk_cb][30309])
2015-12-14 15:21:10,825 (d9279218-7b4c-4316-90f7-c449c4e003b6:44) INFO  NativeVm2VfCode [Thread-862931] - [74528:774720] INF pthread_cond_timedwait [110]: cbp 0x7fd4c4ecaf10 job 0x2c0230 [__PrlSrv_GetVmConfig] index [74520] (get_netstat_cb, traffic.c, 144) @[common/generic_sdk_cb_tools.c][139][cbps_timeout_sleep][30725])

To see the affected hardware nodes, the following commands can be used:

# awk -F'[():]' '/pthread_cond_timedwait.*GetVmConfig/{print$4}' /var/log/IM/PACI-vm2vf.log | while read u; do grep "$u.*found mutex" /var/log/IM/PACI-vm2vf.log |awk '{print$14}' ; done | sort | uniq

In order to check all calls to a specific hardware node, the following command can be used (replace the IP address with the needed one):

# awk -F'[():]' '/found mutex for 172.16.0.234/{print$4}' /var/log/IM/PACI-vm2vf.log | sort | uniq | while read u; do grep -m 1 "$u.*done" /var/log/IM/PACI-vm2vf.log; done

Output example:

2016-02-23 08:27:36,075 (03849efe-a101-4f6d-9c76-f7bdbde47258) DEBUG NativeVm2VfCode [Thread-342851] - [32457:4907494] Callback invocation: done(-2147483627, 5880159666763202560)

"done(-2147483627" pattern points to a failure of the operation due to a callback timeout.

In all cases, the hung call is __PrlSrv_GetVmConfig. It's received on the hardware node side and logged to /var/log/parallels.log:

12-14 20:01:23.205 F /disp:4581:467827/ Processing command 'DspCmdGetHostCommonInfo' 2048 (PJOC_SRV_GET_COMMON_PREFS)
12-14 20:01:23.311 F /disp:4581:467827/ Processing command 'DspCmdGetVmConfigById' 2173 (PJOC_SRV_GET_VM_CONFIG)

Note: the call is always accompanied by GetHostCommonInfo -- this request checks if connection to HW node is alive from the IM side.

Cause

The issue is recognized under internal ID CCU-12320.

Resolution

To make the occurrences less frequent, configure tcp_retries* sysctl parameters:

# echo "net.ipv4.tcp_retries1=10" >> /etc/sysctl.conf
# echo "net.ipv4.tcp_retries2=50" >> /etc/sysctl.conf
# sysctl -p

As an immediate workaround, kill vm2vf Java process, having assured there are no pending callbacks in queue:

  • checking the callbacks:

    [root@poaman ~]# curl -u'admin:***' http://<paci_im_ip>:4465/paci/v1.0/vm2vf
    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <pending-callbacks/>
    
  • checking the process list:

    [root@paciapi 2035849]# ps aux | grep vm2vf
    PACI      6307  0.0  0.0 106332  1008 ?        S    Dec14   0:00 /bin/bash /usr/local/share/PACI-im/scripts/vm2vf-daemon
    PACI      6310  0.0  0.0 106336  1096 ?        S    Dec14   0:00 /bin/bash /usr/local/share/PACI-im/scripts/vm2vf-daemon
    PACI      6314  1.4  2.1 11365968 368176 ?     Sl   Dec14  20:30 /usr/java/default/bin/java -Djava.library.path=/usr/local/share/PACI-im/lib -Dorg.omg.CORBA.ORBClass=org.jacorb.orb.ORB -Dorg.omg.CORBA.ORBSingletonClass=org.jacorb.orb.ORBSingleton -cp .:./vm2vf.jar:./lib/activemq-client-5.9.0.jar:./lib/activemq-jms-pool-5.9.0.jar:./lib/aopalliance.jar:./lib/commons-collections-3.2.1.jar:./lib/commons-configuration-1.6.jar:./lib/commons-lang-2.6.jar:./lib/commons-lang3-3.1.jar:./lib/commons-pool-1.6.jar:./lib/geronimo-j2ee-management_1.1_spec-1.0.1.jar:./lib/geronimo-jms_1.1_spec-1.1.1.jar:./lib/gluegen-rt.jar:./lib/gluegen-rt-natives-linux-amd64.jar:./lib/gluegen-rt-natives-windows-amd64.jar:./lib/guava-17.0.jar:./lib/guice-3.0.jar:./lib/guice-assistedinject-3.0.jar:./lib/guice-jmx-3.0.jar:./lib/guice-jndi-3.0.jar:./lib/guice-multibindings-3.0.jar:./lib/guice-servlet-3.0.jar:./lib/guice-throwingproviders-3.0.jar:./lib/hawtbuf-1.9.jar:./lib/jacorb.jar:./lib/javax.inject.jar:./lib/jaxb-api.jar:./lib/jaxb-impl.jar:./lib/jcl-over-slf4j-1.7.5.jar:./lib/jul-to-slf4j-1.7.5.jar:./lib/kxml2-2.3.0.jar:./lib/log4j-over-slf4j-1.7.5.jar:./lib/logback-classic-1.0.13.jar:./lib/logback-core-1.0.13.jar:./lib/ri-0.1-SNAPSHOT.jar:./lib/slf4j-api-1.7.5.jar:./lib/slf4j-ext-1.7.5.jar:./lib/xstream-1.4.5.jar:. -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:CMSIncrementalDutyCycleMin=0 -XX:CMSIncrementalDutyCycle=10 -XX:CMSInitiatingOccupancyFraction=70 -Xmx512m -Xms256m -XX:+HeapDumpOnOutOfMemoryError com.parallels.c2u.im.vm2vf.remote.Vm2VfDaemon
    
  • Kill the Java PID:

    # kill -9 6314
    

Internal content

Link on internal Article