Monday, March 17, 2014

HP CMU Diskless Nodes on Cisco Switches - PXE boot ARP Timeout

I recently had a new problem booting diskless nodes in a HP CMU cluster.  CMU stands for Cluster Management Utility, which is developed by HP.  I love CMU, and would use it if I was the permanent SysAdmin of some cluster.

Unlike other cluster management applications, CMU started its life as a utility written by some HP engineers for use in their own lab.  It is very solid, and stays out of the SysAdmins way during any other system tasks.

I normally only work on HP branded switches, ProCurve, but this network was all Cisco 3850, 48 port switches.  It should also be noted that I am not a Network Administrator by profession.

The primary Admin network switch had two connections to the other admin switches in the other cabinets.

NOTE: Spanning Tree was already turned off on all Admin switches.

I.                  Cisco setup attempt 1 was basically all default settings, AND instructions straight from the User Guide.
a.      This setup caused a lot of packet loss and apparent route confusion.  From my laptop connected to the same switch as the head node I could only SSH to it about 1 attempt in 4.
b.      I had to physically disconnect one of the two ports in each channel-group just to have a stable network. 
c.      The booting of diskless nodes was not tested in the field with this setup.
d.       
On the primary Admin switch, using ports 1 & 2 as an example:
1.      configure terminal
2.      interface gigabitethernet range 1/0/1 – 2
3.      switchport mode access
4.      switchport access vlan 1
5.      channel-group 1 mode auto
On cabinet 1 Admin switch using ports 47 & 48 as the example:
1.      configure terminal
2.      interface gigabitethernet range 1/0/47 – 48
3.      switchport mode access
4.      switchport access vlan 1
5.      channel-group 1 mode auto

------------------
II.                Ciscos setup attempt 2 was made with the following steps:
a.      This configuration fixed the unstable network performance, so I was able to leave it in place while I worked on other integration tasks. 
c.      However, booting diskless nodes resulted in the PXE Boot Error 11: ARP Timeout. 
d.      Differences from setup 1 are in bold red.
On the primary Admin switch, using ports 1 & 2 as an example:
1.      configure terminal
2.      interface gigabitethernet range 1/0/1 – 2
3.      switchport mode dynamic auto
4.      switchport access vlan 1
5.      channel-group 1 mode on
On cabinet 1 Admin switch using ports 47 & 48 as the example:
1.      configure terminal
2.      interface gigabitethernet range 1/0/47 – 48
3.      switchport mode dynamic auto
4.      switchport access vlan 1
5.      channel-group 1 mode on

------------------
III.               Ciscos setup attempt 3 was made with the following steps:
a.      This configuration was decided on after Greg and I both looked over the various options available for each step.
b.      First we reset the switches back to default, and started fresh.
d.   Adding IP addresses was really the only other step we had to do.
c.      Differences from setup 2 are in bold red.
On the primary Admin switch, using ports 1 & 2 as an example:
1.      configure terminal
2.      interface gigabitethernet range 1/0/1 – 2
3.      switchport mode trunk
4.      switchport access vlan 1
5.      channel-group 1 mode on
On cabinet 1 Admin switch using ports 47 & 48 as the example:
1.      configure terminal
2.      interface gigabitethernet range 1/0/47 – 48
3.      switchport mode trunk
4.      switchport access vlan 1
5.      channel-group 1 mode on


Attempt 3 worked.  It allowed the network to operate normally, and removed the ARP Timeout error when booting CMU diskless nodes.