Monday, February 23, 2015

HP INSIGHT CMU MONITORING and ALERTS of HP SMART ARRAY HARD DRIVES

Setting up monitoring, alerts, and alert_reactions for hard drives attached to an HP Smart Array Controller.

By David Holton, HP

Background
A customer wanted HP Insight CMU to warn him of disk problems in their new Hadoop cluster. The request was for HP Insight CMU to send an email with a warning of possible disk issues. During the initial integration of the cluster, other challenges consumed any time that could be dedicated to this request. During a subsequent visit, there was enough time to workout a solution to meet this need.

Requirements

  1. Use the iLO4's AMS capability to get the data from sensors.
  2. Show the status in the GUI for each node.
  3. Show an alert if the status changes.
  4. Send an email when an alert is raised.


I decided that the solution should rely on the CPU and the OS as little as possible for two reasons:
  1. Some instances of disk problems cause disk-based OS processes to appear to be hung.
  2. Minimize any impact to Hadoop processes.

Limitations:
  1. This solution does not monitor or alert on disk drives controller by the AHCI controller, or the AHCI driver in the OS.
  2. While setting up the alert I realized that the CPU and OS on each node would have to be involved to some degree. I was not able to find a solution around this in the short time I had working out this solution. See the Alerts section of this paper for details.
  3. I was not able to find out what other possible values SNMP would report back for a non-OK disk. Therefore this solution only reports that disk status has changed. There is no attempt to indicate in what way it has changed. The SysAdmin will have to check on the system to determine if action needs to be taken.

MONITORING
Setting up the monitoring of disks controller by HP Smart Array controllers was a relatively straight forward process following the instructions and guidelines found in the HP Insight CMU User's Guide for version 7.2 section 6.5.9.1. I don't think it necessary to reproduce all the steps in the manual in this paper, but an overview of the tasks are:

  1. Enable iLO4 AMS extended metric support in HP Insight CMU's GUI interface.
  2. Configure the iLO4's SNMP port using HP Insight CMU's AMS menu made available in step 1.
  3. Check that SNMP access is working using snmpwalk and snmpget. This also verifies that the system has the correct SNMP related RPMs installed. Use the example commands listed in the User Guide.
  4. Use the “Get/Refresh SNMP data” menu item to gather initial data from all the configured iLOs in the cluster.
  5. Configure HP Insight CMU to pick out the required metrics using the data gathered in step 4.
  6. Configure the ActionsandAlertsFile.txt to show the metrics from step 4.
  7. Restart the Monitoring Engine.
  8. Restart the HP Insight CMU GUI.

FYI: In step 2, the requirements for configuring the iLO are very simple. It sets the SNMP port number, which is normally already set. It also changes the community setting located in the iLO4's /map1/snmp1 to “public.”

Step 5 if accomplished by listing the needed metrics in the cmu_ams_metrics file located in /opt/cmu/etc. I tested this by only listing two of the disk metrics until I had the format settled. Here is a listing of the version of the file I came up with:

#
# This file is part of the CMU AMS support.
# This file maps SNMP OIDs to CMU metric names.
#
# First column is the SNMP OID.
# Second column is the CMU metric name.
# The optional 'SUM' keyword in the third column
# is used to add the values of multiple SNMP OIDs
# into a single CMU metric.
#
SNMPv2-SMI::enterprises.232.6.2.6.8.1.4.0.1 amb1_temp
SNMPv2-SMI::enterprises.232.6.2.6.8.1.4.0.2 cpu1_temp
SNMPv2-SMI::enterprises.232.6.2.6.8.1.4.0.3 cpu2_temp
SNMPv2-SMI::enterprises.232.6.2.9.3.1.7.0.1 power1 SUM power
SNMPv2-SMI::enterprises.232.6.2.9.3.1.7.0.2 power2 SUM power
SNMPv2-SMI::enterprises.232.6.2.9.3.1.7.0.3 power3 SUM power
SNMPv2-SMI::enterprises.232.6.2.9.3.1.7.0.4 power4 SUM power
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.1 sata_drv1_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.2 sata_drv2_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.3 sata_drv3_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.4 sata_drv4_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.5 sata_drv5_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.6 sata_drv6_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.7 sata_drv7_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.8 sata_drv8_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.9 sata_drv9_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.10 sata_drv10_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.11 sata_drv11_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.12 sata_drv12_log_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.11 sata_drv1_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.12 sata_drv2_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.13 sata_drv3_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.14 sata_drv4_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.15 sata_drv5_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.16 sata_drv6_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.17 sata_drv7_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.18 sata_drv8_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.19 sata_drv9_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.20 sata_drv10_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.21 sata_drv11_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.22 sata_drv12_phy_status

The entries in cmu_amd_metrics are used by the /opt/cmu/bin/cmu_get_ams_metrics script to query the iLO, and submit the results to HP Insight CMU monitoring.

Next is the entries added to the ActionsandAlertsFile.txt that were made to show the results gathered by cmu_get_ams_metrics. Again, while setting this up, I only used a couple of disk entries.

#-------------HP iLO4 AMS------------------------------------#
#
amb1_temp "ambient temp" 4 numerical Instantaneous 60 Celsius EXTENDED /opt/cmu/bin/cmu_get_ams_metrics -a
cpu1_temp "CPU 1 temp" 4 numerical Instantaneous 60 Celsius EXTENDED
cpu2_temp "CPU 2 temp" 4 numerical Instantaneous 60 Celsius EXTENDED
power "Power Usage" 4 numerical Instantaneous 100 watts EXTENDED
sata_drv1_log_status "Drive 1 Logical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv2_log_status "Drive 2 Logical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv3_log_status "Drive 3 Logical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv4_log_status "Drive 4 Logical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv5_log_status "Drive 5 Logical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv6_log_status "Drive 6 Logical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv7_log_status "Drive 7 Logical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv8_log_status "Drive 8 Logical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv9_log_status "Drive 9 Logical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv10_log_status "Drive 10 Logical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv11_log_status "Drive 11 Logical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv12_log_status "Drive 12 Logical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv1_phy_status "Drive 1 Physical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv2_phy_status "Drive 2 Physical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv3_phy_status "Drive 3 Physical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv4_phy_status "Drive 4 Physical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv5_phy_status "Drive 5 Physical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv6_phy_status "Drive 6 Physical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv7_phy_status "Drive 7 Physical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv8_phy_status "Drive 8 Physical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv9_phy_status "Drive 9 Physical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv10_phy_status "Drive 10 Physical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv11_phy_status "Drive 11 Physical Status" 6 string Instantaneous 2 2=OK EXTENDED
sata_drv12_phy_status "Drive 12 Physical Status" 6 string Instantaneous 2 2=OK EXTENDED

The “OK” SNMP result returned for the disk's logical and physical status is a 2. Since this status should not change, I did not think it needed to be part of the graph displays, so I set the metric type as “string.” I also set the action-to-perform entry to 2=OK since it was displaying any string I put in that position.

After restarting the Monitoring Engine and restarting the GUI, I was able to see these changes.

ALERT

Setting up the alert was a bit less straight forward for me. I had ideas and scripting fragments worked out for having the head node query the individual node's iLO. I failed to realize that the script to be triggered by the ActionsandAlertsFile.txt, would have to be executed on each node individually. I had wanted to avoid using the OS & CPU on each node.

This cause another problem. Because of how the networking was designed for this cluster, the Compute Nodes cannot communicate directly with their own iLO. Only the Head Node had the ability to communicate with iLOs in this cluster.

The solution I came up with was to have a small script on each node that queried its disk's status by executing another script located on the head node. If I had more time, I would have looked for a better solution.

The alert line in the ActionsandAlertsFile.txt for monitoring the disks is:

check_diskdrives "Drive status changed" 4 1 0 > status “/opt/cmu/tools/cmu_hw_monitoring.sh"

Only the indication that the drive status has changed is reported. The line evaluates the returned value to determine if it is greater than 0.

The alert line in the ActionsandAlertsFile.txt executes the cmu_hw_monitoring.sh script found in /opt/cmu/tools on each node. The script is check the Logical and Physical status of the drives. I left it with a generic name, because it could be expanded to monitor more than disks.

cmu_hw_monitoring.sh
#!/bin/bash
#name cmu_hw_monitoring.sh

HEADNODE=headnode01-cmu
MY_HOSTNAME=$(hostname)-cmu
MY_ILONAME=$(hostname)-ilo
CMUTOP=/opt/cmu
CMUCONTRIB=${CMUTOP}/contrib

RETURN=0

for NUM1 in $(seq 1 12)
do
DRVSTATUS=$(ssh ${HEADNODE} "${CMUCONTRIB}/snmp-hw-alert ${MY_ILONAME} 3 11 \
${NUM1}" | awk -F: '{print $4}')

if [ ${DRVSTATUS} -ne 2 ]
then
RETURN=1
break
fi
done

for NUM2 in $(seq 11 22)
do
DRVSTATUS=$(ssh ${HEADNODE} "${CMUCONTRIB}/snmp-hw-alert ${MY_ILONAME} 5 37 \
${NUM2}" | awk -F: '{print $4}')
if [ ${DRVSTATUS} -ne 2 ]
then
RETURN=1
break
fi
done
echo ${RETURN}
exit ${RETURN}


This script, obviously, passes arguments to the snmp-hw-alert script on the Head Node in the /opt/cmu/contrib directory. This script runs the SNMP command that communicates with the node's iLO to query AMS data.
#!/bin/bash

NODE_ILO=$1
SUBSYS1=$2
SUBSYS2=$3
DRVNUM=$4

snmpget -v1 -cpublic ${NODE_ILO} \
SNMPv2-SMI::enterprises.232.3.2.${SUBSYS1}.1.1.${SUBSYS2}.1.${DRVNUM}

The result of this script is a digit which goes back to the cmu_hw_monitoring.sh script where it is evaluated to see if it is not equal to 2.

If the result does not equal 2, the value of $RETURN is changed from 0 to 1, and that loop stops processing further disks. The Alert that the disk status has changed will be raised.

ALERT_REACTION
Once the alert is raised the disk status has changed, an email will be sent to appropriate email addresses with this line from the ActionsandAlertsFile.txt:

check_diskdrives "Sending mail to root" ReactOnRaise echo -e "Alert 'CMU_ALERT_NAME' raised on node(s) CMU_ALERT_NODES. \n\nDetails:\n`/opt/cmu/bin/pdsh -w CMU_ALERT_NODES 'parted -l | grep Disk'`" | mailx -s "CMU: Alert 'CMU_ALERT_NAME' raised." root

The alert_reaction will attempt to get a list of disks from the OS to include as the body of the message.

TESTING
I tested the solution by changing the evaluation value in the cmu_hw_monitoring.sh script from 2 to 1 or 3, and alerts were raised as expected. I also had the alert_reaction send email to the local root user account, and all emails were received.

CONCLUSION

 Had I more time in this environment, I would have experimented with the scripting to reduce the amount of time the script on the node has to run. One possible solution to that is possibly using variable arrays instead of using for loops.