This article provides general troubleshooting steps for the most common issues with Linux Shared Hosting NG for Parallels Operations Automation (POA) 5.4. For POA 5.5, please refer to the 5.5 LSH deployment guide page 40: "Troubleshooting: Webcluster Issues".
General problems with an NG cluster
Symptoms - Websites hosted in an NG cluster do not work. In general, when a problem with NG hosting arises, there is a critical failure on one of the NG components, namely:
- Load balancer (LB)
- NG Caching Service (shstg)
- Apache server (httpd)
- NG Configuration Database (CDB)
- NFS Storage
The most common problems occur with shstg
/httpd
services.
The Apache server, through the NG module mod_vhost
, uses a pair (<hostname, IP address>
) to obtain virtual host configuration data from the CDB server via the NG Caching Service, shstg
.
Occasionally, shstg
may stop providing data to Apache. First, check that:
The
shstg
service is running on web servers in the NG cluster and there is only one instance of it:~# ps aux | grep shstg root 5055 0.0 0.2 157028 3020 ? Ssl Jul02 18:56 /usr/sbin/shstg_srv /etc/h2e_shstg.conf
If it is not running or there are two or more instances of
shstg
in the process list (see KB article #113154 for more details), restartshstg
:~# /etc/init.d/shstg start | restart
The Apache server is running.
Follow the same steps as above and start Apache:
~# /etc/init.d/httpd start
The NG Caching Server has access to the PostgreSQL database on the CDB server. In the default correct configuration, you will see the network connection from the NG Caching Service (
shstg_srv
) on the web server to the port 5432 on the CDB server:~# netstat -antp | grep :5432 tcp 0 0 10.39.84.43:43269 10.39.94.114:5432 ESTABLISHED 5055/shstg_srv
NFS shared storage is mounted on all web servers in the cluster using the
df
andmount
utilities.Remember that NFS shared storage is mounted on web servers using
automount
technology, so if there are no open requests to websites, NFS storage may be automatically unmounted after an idle timeout.To check if automount works correctly, try to list the contents of any webspace on the web server, or list the mount point where NFS storage is mounted.
For example, if the NFS volume with ID #1 is configured in the web cluster properties, then in the default installation, you might try to list the folder
/var/www/vhosts/1
. Even if it is not mounted already, automount will mount it automatically.If both the
shstg
andhttpd
services are running and there are no suspicious errors in the logs, there may be a problem on the Load Balancer. Read KB article #114327 for more details about NG Load Balancer configuration and functionality.pulse is the controlling daemon that spawns the
lvsd
daemon and performs heartbeating and monitoring of services on the real web servers in the NG cluster.
Make sure pulse and its child processes lvsd
and nanny
are running on the LB server:
1462 ? Ss 1:30 pulse
1470 ? Ss 0:34 \_ /usr/sbin/lvsd --nofork -c /etc/sysconfig/ha/lvs.cf
1488 ? Ss 3:46 \_ /usr/sbin/nanny -c -h 10.39.84.43 --server-name 10.39.94.43 -p 80 -r 80 -f 100 -s GET / HTTP/1.0\r\n\r\n -x HTTP -a 10 -I /sbin/ipvsadm -t 10 -w 32 -V 0.0.0.0 -M g -U /usr/sbin/h2e_get_cluster_load.sh --lvs
1489 ? Ss 3:43 \_ /usr/sbin/nanny -c -h 10.39.84.44 --server-name 10.39.94.44 -p 80 -r 80 -f 100 -s GET / HTTP/1.0\r\n\r\n -x HTTP -a 10 -I /sbin/ipvsadm -t 10 -w 32 -V 0.0.0.0 -M g -U /usr/sbin/h2e_get_cluster_load.sh --lvs
If there are problems with Apache on the web server, you will see the error message [inactive] shutting down in the /var/log/messages file on the LB server:
Jul 16 04:41:14 nglb nanny[1488]: [inactive] shutting down 10.39.84.43:80 due to connection failure
When the Apache server is back on the web server again, another entry will be put into the /var/log/messages
file on the LB server:
Jul 16 04:41:54 nglb nanny[1488]:[ active ] making 10.39.84.43:80 available
So, check the system log file on the Load Balancer to see the problems with Apache on the web servers in the NG cluster.
Use the ipvsadm
utility to check the current load balancing statistics and rules (including the web servers' weights). At the very least, check that all web servers in the cluster are listed by ipvsadm
:
~# ipvsadm --list
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
FWM 100 lblc
-> 10.39.84.43:0 Route 32 0 0
-> 10.39.84.44:0 Route 32 0 0
~# ipvsadm -L --stats
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Conns InPkts OutPkts InBytes OutBytes
-> RemoteAddress:Port
FWM 100 62337 825314 0 49918802 0
-> 10.39.84.43:0 31 243 0 20959 0
-> 10.39.84.44:0 1 1 0 60 0
If the steps above do not rectify the problem, use tcpdump
on the LB and web servers to monitor traffic.
NG cluster performance problems
In these cases, a high load (>20-30) will be observed on the web server, and you will usually notice a lot of php-cgi processes.
This is a rare situation. It means that all processes are limited by the same resource. Currently, it is suspected that NFS could be a limiting factor under certain circumstances.
The general way to troubleshoot the performance issues would be to:
- Set up monitoring
- Locate the bottleneck
- Fix the bottleneck
The most likely bottlenecks are as follows:
NFS server performance problems
A slow or limited network connection between web servers and the CDB server or MySQL server(s) where customers' databases are working
- The load balancer algorithm needs tuning
An additional note about point two above: In most cases, customer websites are not static: they have different web applications installed that work with the database. In most cases, they use MySQL databases. The web application works on the NG web server and connects to the MySQL database working on a remote server.
If many websites connect to the same MySQL server simultaneously, the connection between the NG web servers and the MySQL server must have enough bandwidth. Otherwise, the web application will wait for data from the MySQL server, increasing the load on the NG web server.
If the MySQL server is deployed inside a Parallels Virtuozzo Containers (PVC) container, then PVC may limit outgoing traffic for such containers using the traffic shaper. This is standard functionality of PVC and, in the event of a connection between the NG web server and MySQL servers, this may cause a bottleneck. The same goes for the connection between the NG web servers (caching service, shstg) and the NG Configuration Database server, which may also be deployed inside a PVC container.
To solve the problem with traffic limiting, find which MySQL servers are being used by NG websites and see if they are deployed inside a PVC container. Then, consider doing the following on the corresponding PVC server where MySQL or the NG Configuration Database servers are running:
Stop the traffic shaper on the PVC server (confirm with the Provider, since this will stop traffic shaping for all containers on the server):
~# /etc/init.d/vz shaperoff{cle}
Increase the bandwidth limit for containers with MySQL and NG CDB servers, e.g., to 100 Mbps:
~# vzctl set CTID --rate 102400 --save{cle}
If the traffic control is enabled inside mysql server and its not a container, please verify it as follows:
tc -s qdisc ls
qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 529157756 bytes 3104367 pkt (dropped 0, overlimits 0 requeues 8) rate 0bit 0pps backlog 0b 0p requeues 8 qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 915336246 bytes 8559455 pkt (dropped 0, overlimits 0 requeues 8) rate 0bit 0pps backlog 0b 0p requeues 8
In order to delete the PFIFO_FAST qdisc use the following commands inside mysql server:
tc qdisc del dev eth0 root
tc qdisc del dev eth1 root
and restart the server.
General troubleshooting tips:
Try to increase the number of worker processes on the NFS server (the parameter RPCNFSDCOUNT in the
/etc/sysconfig/nfs
file). By default, it is set to "32"; change it to "64" or even "128" and restart the "nfs" service:# service nfs restart
If the NFS server is using NFS v3 and supports NFS v4, consider switching to NFS v4.
In applications where data loss is not a big concern, the NFS volume may be exported with the async option. This makes NFS work faster because the server replies to requests before any changes made by that request have been committed to stable storage. However, a loss of power or network connectivity can result in a loss of data. As a result, this option is not recommended for production, though it may increase performance.
Mount the NFS partition on the NFS server with the noatime, nodiratime options:
Add the options to the
/etc/fstab
file on the NFS partition.Remount the NFS partition on the fly:
~# mount -o remount,noatime,nodiratime /PATH/TO/NFS/PARTITION
Consider changing the load balancing algorithm to "wrr" in the
/etc/sysconfig/ha/lvs.cf
file on the LB server and then restarting the pulse service.- Consider enabling persistent connections on the LB server (add the parameter
persistent = 300
to/etc/sysconfig/ha/lvs.cf
and restart thepulse
service).
Try to strace php-cgi
processes to check what they are waiting for.
You may also use the following script to gather information about the server's load:
#!/bin/sh
LA=`cat /proc/loadavg | awk -F"." '{print $1}'`
dir=/root/swstrace
if (( $LA > 4 )); then
newdir=$dir/`date +%dd-%H-%M`-LA$LA
mkdir $newdir
netstatls=$(netstat -ntup)
procslist=$(ps -eo pid,user,ppid,rtprio,ni,pri,psr,pcpu,stat,wchan:14,start_time,time,command -L)
echo "Number of http connections: $(echo "$netstatls" | egrep -c ':80|:443')" >> $newdir/averageinfo
echo "$netstatls" | egrep ':80|:443' | awk '{print $6}' | sort | uniq -c >> $newdir/averageinfo
echo -e "Number of PHP connections: $(echo "$netstatls" | egrep -c php) \n $(echo "$netstatls" | egrep php | awk '{print $6}' | sort | uniq -c)" >> $newdir/averageinfo
echo "Number of running procs: $(grep running /proc/stat | awk '{print $2}')" >> $newdir/averageinfo
echo "Number of php-cgi procs: `echo "$procslist" | grep -c cgi.*php`" >> $newdir/averageinfo
echo "php-cgi procs in R state: `echo "$procslist" | grep -c R.*cgi.*php`" >> $newdir/averageinfo
echo "php-cgi procs in D state: `echo "$procslist" | grep -c D.*cgi.*php`" >> $newdir/averageinfo
echo -e "Disks IO activity:\n `iotop -bon1`" >> $newdir/averageinfo
echo -e "Mem info:\n `vmstat 1 3`" >> $newdir/averageinfo
echo "$procslist" >> $newdir/processlist
/usr/sbin/lveps -p > $newdir/lveps
#/usr/sbin/lsof > $newdir/lsof &
netstat -ntpu > $newdir/netstat
/usr/sbin/nfsstat -c > $newdir/nfsstat ; sleep 1 ; /usr/sbin/nfsstat -c >> $newdir/nfsstat &
fi
See the main Knowledgebase article #114326 Linux Shared Hosting NG: General Information, Best Practices and Troubleshooting for more information about NG hosting in Parallels Automation.