This article provides general troubleshooting steps for the most common issues with Linux Shared Hosting NG for Parallels Operations Automation (POA) 5.4. For POA 5.5, please refer to the 5.5 LSH deployment guide page 40: "Troubleshooting: Webcluster Issues".

General problems with an NG cluster

Symptoms - Websites hosted in an NG cluster do not work. In general, when a problem with NG hosting arises, there is a critical failure on one of the NG components, namely:

  • Load balancer (LB)
  • NG Caching Service (shstg)
  • Apache server (httpd)
  • NG Configuration Database (CDB)
  • NFS Storage

The most common problems occur with shstg/httpd services.

The Apache server, through the NG module mod_vhost, uses a pair (<hostname, IP address>) to obtain virtual host configuration data from the CDB server via the NG Caching Service, shstg.

Occasionally, shstg may stop providing data to Apache. First, check that:

  1. The shstg service is running on web servers in the NG cluster and there is only one instance of it:

    ~# ps aux | grep shstg
    root        5055  0.0  0.2 157028  3020 ?        Ssl  Jul02  18:56
    /usr/sbin/shstg_srv /etc/h2e_shstg.conf
    

    If it is not running or there are two or more instances of shstg in the process list (see KB article #113154 for more details), restart shstg:

    ~# /etc/init.d/shstg start | restart
    
  2. The Apache server is running.

    Follow the same steps as above and start Apache:

    ~# /etc/init.d/httpd start
    
  3. The NG Caching Server has access to the PostgreSQL database on the CDB server. In the default correct configuration, you will see the network connection from the NG Caching Service (shstg_srv) on the web server to the port 5432 on the CDB server:

    ~# netstat -antp | grep :5432
    tcp        0      0 10.39.84.43:43269           10.39.94.114:5432
    ESTABLISHED 5055/shstg_srv
    
  4. NFS shared storage is mounted on all web servers in the cluster using the df and mount utilities.

    Remember that NFS shared storage is mounted on web servers using automount technology, so if there are no open requests to websites, NFS storage may be automatically unmounted after an idle timeout.

    To check if automount works correctly, try to list the contents of any webspace on the web server, or list the mount point where NFS storage is mounted.

    For example, if the NFS volume with ID #1 is configured in the web cluster properties, then in the default installation, you might try to list the folder /var/www/vhosts/1. Even if it is not mounted already, automount will mount it automatically.

  5. If both the shstg and httpd services are running and there are no suspicious errors in the logs, there may be a problem on the Load Balancer. Read KB article #114327 for more details about NG Load Balancer configuration and functionality.

    pulse is the controlling daemon that spawns the lvsd daemon and performs heartbeating and monitoring of services on the real web servers in the NG cluster.

Make sure pulse and its child processes lvsd and nanny are running on the LB server:

 1462 ?        Ss     1:30 pulse
 1470 ?        Ss     0:34  \_ /usr/sbin/lvsd --nofork -c /etc/sysconfig/ha/lvs.cf
 1488 ?        Ss     3:46      \_ /usr/sbin/nanny -c -h 10.39.84.43 --server-name 10.39.94.43 -p 80 -r 80 -f 100 -s GET / HTTP/1.0\r\n\r\n -x HTTP -a 10 -I /sbin/ipvsadm -t 10 -w 32 -V 0.0.0.0 -M g -U /usr/sbin/h2e_get_cluster_load.sh --lvs
 1489 ?        Ss     3:43      \_ /usr/sbin/nanny -c -h 10.39.84.44 --server-name 10.39.94.44 -p 80 -r 80 -f 100 -s GET / HTTP/1.0\r\n\r\n -x HTTP -a 10 -I /sbin/ipvsadm -t 10 -w 32 -V 0.0.0.0 -M g -U /usr/sbin/h2e_get_cluster_load.sh --lvs

If there are problems with Apache on the web server, you will see the error message [inactive] shutting down in the /var/log/messages file on the LB server:

Jul 16 04:41:14 nglb nanny[1488]: [inactive] shutting down  10.39.84.43:80 due to connection failure

When the Apache server is back on the web server again, another entry will be put into the /var/log/messages file on the LB server:

Jul 16 04:41:54 nglb nanny[1488]:[ active ] making 10.39.84.43:80 available

So, check the system log file on the Load Balancer to see the problems with Apache on the web servers in the NG cluster.

Use the ipvsadm utility to check the current load balancing statistics and rules (including the web servers' weights). At the very least, check that all web servers in the cluster are listed by ipvsadm:

~# ipvsadm --list
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
FWM  100 lblc
  -> 10.39.84.43:0                Route   32     0          0
  -> 10.39.84.44:0                Route   32     0          0

~# ipvsadm -L --stats
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port               Conns   InPkts  OutPkts  InBytes OutBytes
  -> RemoteAddress:Port
FWM  100                             62337   825314        0 49918802        0
  -> 10.39.84.43:0                      31      243        0    20959        0
  -> 10.39.84.44:0                       1        1        0       60        0

If the steps above do not rectify the problem, use tcpdump on the LB and web servers to monitor traffic.

NG cluster performance problems

In these cases, a high load (>20-30) will be observed on the web server, and you will usually notice a lot of php-cgi processes.

This is a rare situation. It means that all processes are limited by the same resource. Currently, it is suspected that NFS could be a limiting factor under certain circumstances.

The general way to troubleshoot the performance issues would be to:

  1. Set up monitoring
  2. Locate the bottleneck
  3. Fix the bottleneck

The most likely bottlenecks are as follows:

  1. NFS server performance problems

  2. A slow or limited network connection between web servers and the CDB server or MySQL server(s) where customers' databases are working

  3. The load balancer algorithm needs tuning

An additional note about point two above: In most cases, customer websites are not static: they have different web applications installed that work with the database. In most cases, they use MySQL databases. The web application works on the NG web server and connects to the MySQL database working on a remote server.

If many websites connect to the same MySQL server simultaneously, the connection between the NG web servers and the MySQL server must have enough bandwidth. Otherwise, the web application will wait for data from the MySQL server, increasing the load on the NG web server.

If the MySQL server is deployed inside a Parallels Virtuozzo Containers (PVC) container, then PVC may limit outgoing traffic for such containers using the traffic shaper. This is standard functionality of PVC and, in the event of a connection between the NG web server and MySQL servers, this may cause a bottleneck. The same goes for the connection between the NG web servers (caching service, shstg) and the NG Configuration Database server, which may also be deployed inside a PVC container.

To solve the problem with traffic limiting, find which MySQL servers are being used by NG websites and see if they are deployed inside a PVC container. Then, consider doing the following on the corresponding PVC server where MySQL or the NG Configuration Database servers are running:

  • Stop the traffic shaper on the PVC server (confirm with the Provider, since this will stop traffic shaping for all containers on the server):

    ~# /etc/init.d/vz shaperoff{cle}
    
  • Increase the bandwidth limit for containers with MySQL and NG CDB servers, e.g., to 100 Mbps:

    ~# vzctl set CTID --rate 102400 --save{cle}
    

If the traffic control is enabled inside mysql server and its not a container, please verify it as follows:

tc -s qdisc ls

qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 529157756 bytes 3104367 pkt (dropped 0, overlimits 0 requeues 8) rate 0bit 0pps backlog 0b 0p requeues 8 qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 915336246 bytes 8559455 pkt (dropped 0, overlimits 0 requeues 8) rate 0bit 0pps backlog 0b 0p requeues 8

In order to delete the PFIFO_FAST qdisc use the following commands inside mysql server:

tc qdisc del dev eth0 root

tc qdisc del dev eth1 root

and restart the server.

General troubleshooting tips:

  1. Try to increase the number of worker processes on the NFS server (the parameter RPCNFSDCOUNT in the /etc/sysconfig/nfs file). By default, it is set to "32"; change it to "64" or even "128" and restart the "nfs" service:

    # service nfs restart
    
  2. If the NFS server is using NFS v3 and supports NFS v4, consider switching to NFS v4.

  3. In applications where data loss is not a big concern, the NFS volume may be exported with the async option. This makes NFS work faster because the server replies to requests before any changes made by that request have been committed to stable storage. However, a loss of power or network connectivity can result in a loss of data. As a result, this option is not recommended for production, though it may increase performance.

  4. Mount the NFS partition on the NFS server with the noatime, nodiratime options:

    1. Add the options to the /etc/fstab file on the NFS partition.

    2. Remount the NFS partition on the fly:

      ~# mount -o remount,noatime,nodiratime /PATH/TO/NFS/PARTITION
      
  5. Consider changing the load balancing algorithm to "wrr" in the /etc/sysconfig/ha/lvs.cf file on the LB server and then restarting the pulse service.

  6. Consider enabling persistent connections on the LB server (add the parameter persistent = 300 to /etc/sysconfig/ha/lvs.cf and restart the pulse service).

Try to strace php-cgi processes to check what they are waiting for.

You may also use the following script to gather information about the server's load:

#!/bin/sh

LA=`cat /proc/loadavg | awk -F"." '{print $1}'`
dir=/root/swstrace

if (( $LA > 4 )); then

newdir=$dir/`date +%dd-%H-%M`-LA$LA
mkdir $newdir
netstatls=$(netstat -ntup)
procslist=$(ps -eo pid,user,ppid,rtprio,ni,pri,psr,pcpu,stat,wchan:14,start_time,time,command -L)
echo "Number of http connections: $(echo "$netstatls" | egrep -c ':80|:443')" >> $newdir/averageinfo
echo "$netstatls" | egrep ':80|:443' | awk '{print $6}' | sort  | uniq -c  >> $newdir/averageinfo
echo -e "Number of PHP connections: $(echo "$netstatls" | egrep -c php) \n $(echo "$netstatls" | egrep php | awk '{print $6}' | sort  | uniq -c)" >> $newdir/averageinfo
echo "Number of running procs: $(grep running /proc/stat | awk '{print $2}')" >> $newdir/averageinfo
echo "Number of php-cgi procs: `echo "$procslist" | grep -c cgi.*php`" >> $newdir/averageinfo
echo "php-cgi procs in R state: `echo "$procslist" | grep -c R.*cgi.*php`" >> $newdir/averageinfo
echo "php-cgi procs in D state: `echo "$procslist" | grep -c D.*cgi.*php`" >> $newdir/averageinfo
echo -e "Disks IO activity:\n `iotop -bon1`" >> $newdir/averageinfo
echo -e "Mem info:\n `vmstat 1 3`" >> $newdir/averageinfo
echo "$procslist" >> $newdir/processlist
/usr/sbin/lveps -p > $newdir/lveps
#/usr/sbin/lsof > $newdir/lsof &
netstat -ntpu > $newdir/netstat
/usr/sbin/nfsstat -c > $newdir/nfsstat ; sleep 1 ; /usr/sbin/nfsstat -c >> $newdir/nfsstat &

fi

See the main Knowledgebase article #114326 Linux Shared Hosting NG: General Information, Best Practices and Troubleshooting for more information about NG hosting in Parallels Automation.

Internal content

Link on internal Article