Setting
up monitoring, alerts, and alert_reactions for hard drives attached
to an HP Smart Array Controller.
By David Holton, HP
Background
A customer wanted
HP Insight CMU to warn him of disk problems in their new Hadoop
cluster. The request was for HP Insight CMU to send an email with a
warning of possible disk issues. During the initial integration of
the cluster, other challenges consumed any time that could be
dedicated to this request. During a subsequent visit, there was
enough time to workout a solution to meet this need.
Requirements
-
Use the iLO4's AMS capability to get the data from sensors.
-
Show the status in the GUI for each node.
-
Show an alert if the status changes.
-
Send an email when an alert is raised.
I decided that the
solution should rely on the CPU and the OS as little as possible for
two reasons:
-
Some instances of disk problems cause disk-based OS processes to appear to be hung.
-
Minimize any impact to Hadoop processes.
Limitations:
-
This solution does not monitor or alert on disk drives controller by the AHCI controller, or the AHCI driver in the OS.
-
While setting up the alert I realized that the CPU and OS on each node would have to be involved to some degree. I was not able to find a solution around this in the short time I had working out this solution. See the Alerts section of this paper for details.
-
I was not able to find out what other possible values SNMP would report back for a non-OK disk. Therefore this solution only reports that disk status has changed. There is no attempt to indicate in what way it has changed. The SysAdmin will have to check on the system to determine if action needs to be taken.
MONITORING
Setting up the
monitoring of disks controller by HP Smart Array controllers was a
relatively straight forward process following the instructions and
guidelines found in the HP Insight CMU User's Guide for version 7.2
section 6.5.9.1. I don't think it necessary to reproduce all the
steps in the manual in this paper, but an overview of the tasks are:
-
Enable iLO4 AMS extended metric support in HP Insight CMU's GUI interface.
-
Configure the iLO4's SNMP port using HP Insight CMU's AMS menu made available in step 1.
-
Check that SNMP access is working using snmpwalk and snmpget. This also verifies that the system has the correct SNMP related RPMs installed. Use the example commands listed in the User Guide.
-
Use the “Get/Refresh SNMP data” menu item to gather initial data from all the configured iLOs in the cluster.
-
Configure HP Insight CMU to pick out the required metrics using the data gathered in step 4.
-
Configure the ActionsandAlertsFile.txt to show the metrics from step 4.
-
Restart the Monitoring Engine.
-
Restart the HP Insight CMU GUI.
FYI: In step 2, the
requirements for configuring the iLO are very simple. It sets the
SNMP port number, which is normally already set. It also changes the
community setting located in the iLO4's /map1/snmp1 to “public.”
Step 5 if
accomplished by listing the needed metrics in the cmu_ams_metrics
file located in /opt/cmu/etc. I tested this by only listing two of
the disk metrics until I had the format settled. Here is a listing
of the version of the file I came up with:
#
#
This file is part of the CMU AMS support.
#
This file maps SNMP OIDs to CMU metric names.
#
#
First column is the SNMP OID.
#
Second column is the CMU metric name.
#
The optional 'SUM' keyword in the third column
#
is used to add the values of multiple SNMP OIDs
#
into a single CMU metric.
#
SNMPv2-SMI::enterprises.232.6.2.6.8.1.4.0.1
amb1_temp
SNMPv2-SMI::enterprises.232.6.2.6.8.1.4.0.2
cpu1_temp
SNMPv2-SMI::enterprises.232.6.2.6.8.1.4.0.3
cpu2_temp
SNMPv2-SMI::enterprises.232.6.2.9.3.1.7.0.1
power1 SUM power
SNMPv2-SMI::enterprises.232.6.2.9.3.1.7.0.2
power2 SUM power
SNMPv2-SMI::enterprises.232.6.2.9.3.1.7.0.3
power3 SUM power
SNMPv2-SMI::enterprises.232.6.2.9.3.1.7.0.4
power4 SUM power
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.1
sata_drv1_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.2
sata_drv2_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.3
sata_drv3_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.4
sata_drv4_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.5
sata_drv5_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.6
sata_drv6_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.7
sata_drv7_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.8
sata_drv8_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.9
sata_drv9_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.10
sata_drv10_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.11
sata_drv11_log_status
SNMPv2-SMI::enterprises.232.3.2.3.1.1.11.1.12
sata_drv12_log_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.11
sata_drv1_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.12
sata_drv2_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.13
sata_drv3_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.14
sata_drv4_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.15
sata_drv5_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.16
sata_drv6_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.17
sata_drv7_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.18
sata_drv8_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.19
sata_drv9_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.20
sata_drv10_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.21
sata_drv11_phy_status
SNMPv2-SMI::enterprises.232.3.2.5.1.1.37.1.22
sata_drv12_phy_status
The entries in
cmu_amd_metrics are used by the /opt/cmu/bin/cmu_get_ams_metrics
script to query the iLO, and submit the results to HP Insight CMU
monitoring.
Next is the entries
added to the ActionsandAlertsFile.txt that were made to show the
results gathered by cmu_get_ams_metrics. Again, while setting this
up, I only used a couple of disk entries.
#-------------HP
iLO4 AMS------------------------------------#
#
amb1_temp
"ambient temp" 4 numerical Instantaneous 60 Celsius
EXTENDED /opt/cmu/bin/cmu_get_ams_metrics -a
cpu1_temp
"CPU 1 temp" 4 numerical Instantaneous 60 Celsius EXTENDED
cpu2_temp
"CPU 2 temp" 4 numerical Instantaneous 60 Celsius EXTENDED
power
"Power Usage" 4 numerical Instantaneous 100 watts EXTENDED
sata_drv1_log_status
"Drive 1 Logical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv2_log_status
"Drive 2 Logical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv3_log_status
"Drive 3 Logical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv4_log_status
"Drive 4 Logical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv5_log_status
"Drive 5 Logical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv6_log_status
"Drive 6 Logical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv7_log_status
"Drive 7 Logical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv8_log_status
"Drive 8 Logical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv9_log_status
"Drive 9 Logical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv10_log_status
"Drive 10 Logical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv11_log_status
"Drive 11 Logical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv12_log_status
"Drive 12 Logical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv1_phy_status
"Drive 1 Physical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv2_phy_status
"Drive 2 Physical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv3_phy_status
"Drive 3 Physical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv4_phy_status
"Drive 4 Physical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv5_phy_status
"Drive 5 Physical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv6_phy_status
"Drive 6 Physical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv7_phy_status
"Drive 7 Physical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv8_phy_status
"Drive 8 Physical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv9_phy_status
"Drive 9 Physical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv10_phy_status
"Drive 10 Physical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv11_phy_status
"Drive 11 Physical Status" 6 string Instantaneous 2 2=OK
EXTENDED
sata_drv12_phy_status
"Drive 12 Physical Status" 6 string Instantaneous 2 2=OK
EXTENDED
The “OK” SNMP
result returned for the disk's logical and physical status is a 2.
Since this status should not change, I did not think it needed to be
part of the graph displays, so I set the metric type as “string.”
I also set the action-to-perform entry to 2=OK since it was
displaying any string I put in that position.
After restarting
the Monitoring Engine and restarting the GUI, I was able to see these
changes.
ALERT
Setting up the
alert was a bit less straight forward for me. I had ideas and
scripting fragments worked out for having the head node query the
individual node's iLO. I failed to realize that the script to be
triggered by the ActionsandAlertsFile.txt, would have to be executed
on each node individually. I had wanted to avoid using the OS &
CPU on each node.
This cause another
problem. Because of how the networking was designed for this
cluster, the Compute Nodes cannot communicate directly with their own
iLO. Only the Head Node had the ability to communicate with iLOs in
this cluster.
The solution I came
up with was to have a small script on each node that queried its
disk's status by executing another script located on the head node.
If I had more time, I would have looked for a better solution.
The alert line in
the ActionsandAlertsFile.txt for monitoring the disks is:
check_diskdrives
"Drive status changed" 4 1 0
> status “/opt/cmu/tools/cmu_hw_monitoring.sh"
Only the indication
that the drive status has changed is reported. The line evaluates
the returned value to determine if it is greater than 0.
The alert line in
the ActionsandAlertsFile.txt executes the cmu_hw_monitoring.sh script
found in /opt/cmu/tools on each node. The script is check the
Logical and Physical status of the drives. I left it with a generic
name, because it could be expanded to monitor more than disks.
cmu_hw_monitoring.sh
#!/bin/bash
#name
cmu_hw_monitoring.sh
HEADNODE=headnode01-cmu
MY_HOSTNAME=$(hostname)-cmu
MY_ILONAME=$(hostname)-ilo
CMUTOP=/opt/cmu
CMUCONTRIB=${CMUTOP}/contrib
RETURN=0
for
NUM1 in $(seq 1 12)
do
DRVSTATUS=$(ssh
${HEADNODE} "${CMUCONTRIB}/snmp-hw-alert ${MY_ILONAME} 3 11 \
${NUM1}"
| awk -F: '{print $4}')
if
[ ${DRVSTATUS} -ne 2 ]
then
RETURN=1
break
fi
done
for
NUM2 in $(seq 11 22)
do
DRVSTATUS=$(ssh
${HEADNODE} "${CMUCONTRIB}/snmp-hw-alert ${MY_ILONAME} 5 37 \
${NUM2}"
| awk -F: '{print $4}')
if
[ ${DRVSTATUS} -ne 2 ]
then
RETURN=1
break
fi
done
echo
${RETURN}
exit
${RETURN}
This script,
obviously, passes arguments to the snmp-hw-alert script on the Head
Node in the /opt/cmu/contrib directory. This script runs the SNMP
command that communicates with the node's iLO to query AMS data.
#!/bin/bash
NODE_ILO=$1
SUBSYS1=$2
SUBSYS2=$3
DRVNUM=$4
snmpget
-v1 -cpublic ${NODE_ILO} \
SNMPv2-SMI::enterprises.232.3.2.${SUBSYS1}.1.1.${SUBSYS2}.1.${DRVNUM}
The result of this
script is a digit which goes back to the cmu_hw_monitoring.sh script
where it is evaluated to see if it is not equal to 2.
If the result does
not equal 2, the value of $RETURN is changed from 0 to 1, and that
loop stops processing further disks. The Alert that the disk status
has changed will be raised.
ALERT_REACTION
Once the alert is
raised the disk status has changed, an email will be sent to
appropriate email addresses with this line from the
ActionsandAlertsFile.txt:
check_diskdrives
"Sending mail to root" ReactOnRaise echo -e "Alert
'CMU_ALERT_NAME' raised on node(s) CMU_ALERT_NODES.
\n\nDetails:\n`/opt/cmu/bin/pdsh -w CMU_ALERT_NODES 'parted -l | grep
Disk'`" | mailx -s "CMU: Alert 'CMU_ALERT_NAME' raised."
root
The alert_reaction
will attempt to get a list of disks from the OS to include as the
body of the message.
TESTING
I tested the
solution by changing the evaluation value in the cmu_hw_monitoring.sh
script from 2 to 1 or 3, and alerts were raised as expected. I also
had the alert_reaction send email to the local root user account, and
all emails were received.
CONCLUSION
Had I more time in
this environment, I would have experimented with the scripting to
reduce the amount of time the script on the node has to run. One
possible solution to that is possibly using variable arrays instead
of using for loops.