Symptoms

Linux Provisioning Gateway Host (LINPGH) has become unmanageable from OA. Details that help to identify this particular scenario:

  • Server state is Off in PCP
  • APS 1.2 tasks fail in OA Task Manager on attempt to perform operation on LINPGH
  • pa-agent is started on LINPGH but it is not accepting new connections from OA Management Node
  • Big number of long running PHP-scripts are observed on LINPGH:

    linpgh01 ~ # ps -eo pid,stime,etime,cmd --sort=start_time |grep php |grep -v grep
     6082 Jan24 13-23:25:47 php -q some_script_name.php
    28948 Jan25 12-18:38:52 php -q some_script_name.php
    26316 Jan25 12-17:08:50 php -q some_script_name.php
    28255 Jan25 12-12:08:47 php -q some_script_name.php
    28257 Jan25 12-12:08:47 php -q some_script_name.php
    10661 Jan25 12-08:07:49 php -q some_script_name.php
    ..
    12586 Feb03  3-12:49:53 php -q some_script_name.php
    31736 Feb05  1-06:19:47 php -q some_script_name.php
     4462 Feb06  1-02:56:53 php -q some_script_name.php
     4463 Feb06  1-02:56:53 php -q some_script_name.php
    29342 Feb06    06:09:55 php -q some_script_name.php
    

    APS PHP-scripts are started by pa-agent. It can be verified in output of ps auxww:

    root     10108  0.0  0.2  38828  4772 ?        S    Feb05   0:00 /usr/local/pem/sbin/pa-agent --props-file /usr/local/pem/etc/pleskd.props --send-signal
    root      4460  0.0  0.0  65948  1344 ?        S    Feb06   0:00  \_ /bin/sh /usr/local/pem/APS/scripts/143/1.0-60/r26565_3399482224.sh
    root      4462  0.0  0.4 161564  8320 ?        S    Feb06   0:00      \_ php -q some_script_name.php
    

    As soon as all pa-agent workers (default value is 10) are occupied by execution of such scripts, LINPGH becomes unmanageable from OA.

  • Debugger shows that PHP-process waits for input infinitely (example for process with PID 29342):

    linpgh01 ~ # strace -tTfFs10000 -p 29342
    Process 29342 attached - interrupt to quit
    05:38:29 read(3,  <unfinished ...>
    
  • There is a TCP-connection created by the same proccess:

    linpgh01 ~ # netstat -antpl | egrep 'Local Address|29342'
    Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name
    tcp        0      0 LOCAL_IP:39018          REMOTE_IP:443            ESTABLISHED 29342/php
    linpgh01 ~ # 
    
  • On remote server side (server with REMOTE_IP) such connection does not exist. I.e., there is no TCP-connection from LINPGH port 39018 to local port 443

Cause

PHP-script did not recieve response from remote server due to a network issue. Corresponding TCP-connection is still open on LINPGH but it does not exist on remote service side.

PHP-settings that control maximum time of a script execution are ignored in this situation (values are in seconds):

    linpgh01 ~ # egrep '^max_execution_time|^max_input_time' /etc/php.ini
    max_execution_time = 30
    max_input_time = 60

PHP-script will be executed endlessly in this case.

Resolution

Configure cron task on LINPGH that analyzes execution time of PHP-scripts. If execution time is more than specified timeout PHP-script is aborted forcibly. Default timeout is 36000 (10 hours)

  1. Download script kill_stuck_php.sh into folder /root/scripts/ on LINPGH

  2. Add following lines into /etc/crontab:

    linpgh01 ~ # grep -i php /etc/crontab
    # Forcibly stop PHP-scripts that got stuck for more than 10 hours
    0 */1 * * * root /root/scripts/kill_stuck_php.sh >> /var/log/kill_stuck_php.log 2>&1
    
  3. Execute service crontab restart

Script will be started every hour. It will put execution results in to /var/log/kill_stuck_php.log. Example:

    linpgh01 ~ # cat /root/scripts/kill_stuck_php.log
    Wed Feb  7 06:36:01 CET 2018
            killing php process pid=4462 executable=php state=S user=root etime=1-03:54:56
            killing php process pid=4463 executable=php state=S user=root etime=1-03:54:56
            killing php process pid=31736 executable=php state=S user=root etime=1-07:17:50
    # 3 process(es) killed
    Wed Feb  7 07:00:01 CET 2018
    Wed Feb  7 08:00:01 CET 2018

Internal content

Link on internal Article