Wednesday, September 17, 2014

Udev Incorrectly Renaming NICs in an HP CMU Cluster Running SLES 11.3


Documenting the Udev System Solution
for _______'s Cloudera Cluster.

By David Holton

Introduction:

The problem and solution described here is for a Cloudera Hadoop cluster using HP hardware running SLES 11.3.  This cluster utilizes HP's Insight CMU software to manage provisioning, monitoring, and alerting on worker/compute nodes.

NOTE: Names of systems, devices, networks, vlans etc., have been changed from the system where this challenge was encountered.  A small device naming order was replicated that made this problem a bit more interesting, and allows a bit more clarity to how the solution was implemented.

Definitions:

A typical, reference architecture, Cloudera cluster from HP has the same base components as a High Performance cluster from HP.  Because of that, some basic HP definitions are needed to keep track of what components perform what function.  I find that a lot of confusion can be avoided if everyone uses the same terminology in the same way. Consistency is the key.

There are typically three or four networks involved:

Admin network: this is the just a basic ethernet network, internal to the cluster, over which basic system admin commands and function are performed.  This is the network that CMU uses to communicate with the OS running on each node.  AKA: CMU network, Management network,

Console network:  this is  another internal cluster network to which all node iLOs are connected.  The Virtual Serial Port of each node's iLO is very important to troubleshooting, and it is made available on this network.  This network also give CMU control of power on each node.  Additionally, the Admin and Console networks are usually connected together; many times they share the same IP subnet.  AKA: iLO network, BMC network,

Enterprise network: this is the corporate network that connects to the cluster, and allows remote access.  There are many ways this network is attached to clusters.  AKA: User network, Data  Network, Company network, Campus network,

Sometimes there is a separate, High-Speed network.  The names and uses of this network vary greatly. It is typically used to load data quickly and efficiently on the cluster, and allow message passing between applications running on compute nodes.  Normally, no users nor administrative functions are allowed on this network. In an HPC system this is normally an Infiniband network.  In a Cloudera cluster it is usually a 10GB Ethernet network.  AKA: HSI, MPI network, IB network, 10G network,

There are various type of nodes found in clusters.

Head Node: this is typically the main server from which all CMU and administrative functions originate.  AKA: management server, CMU server,

Compute Nodes: on these nodes the actual work of a cluster (HPC or Hadoop) take place.  AKA: worker nodes, computes,

Utility Nodes: These can perform many different functions: Database, Application, raw data store, Job Scheduler/Control, Resource Manager, Hadoop Name Nodes, Login nodes, Edge nodes

The Problem:

In the particular cluster where this problem was encountered, the internal networks were a bit unusual.

Admin network - is a separate, flat, internal, CMU-only network.

Cloudera network - is a high-speed, routed, internal, Cloudera-only network.

These two networks had to be kept separate.  Typically data wanted to default to the flat, admin network.  We want default traffic to use the Cloudera-only network.

Console network - is an externally connected network. One interface on the Head Node had to be configured for this network for CMU to have iLO access to the nodes.

Enterprise Network - is a high-speed, external network that attaches to data as well as allows remote access for users and administrators.

When the ProLiant servers boot using SLES 11.3, the device naming of the ethernet cards will not necessarily be the same as the previous times the system was booted.

Device naming of network interfaces is controlled by use of the 70-persistent-net.rules file in the /etc/udev/rules.d directory.  It is a trivial thing to edit this file, on individual systems, to assign the ethX device names as desired. Every system boot after that should maintain consistent device names.

In a CMU environment we are going to be capturing an image of a Golden Node, and cloning any number of systems in the same CMU Logical Group, with that image.  Once the cloned nodes come up, it is most likely the interface device names may come up differently. Whether or not CMU cleans the persistent rules file, new rules will be generated on each freshly cloned node.  This is because Udev will encounter missing MAC addresses or MAC addresses that are different from what it find in the rules file.  So if eth5 & eth6 are connected to separate networks, and the rules reverse them,
they may show as UP in the output of 'ifconfig' or 'ip addr show', but they will not have the proper IP addresses.  Therefore, they will not be able to communicate on the networks to which they physically attach.

A solution is needed to configure network rules in udev, on the fly, that name the network interfaces in a consistent manner, to match the device names on the Golden Node.  This is the only way that bonded and tagged vlan interfaces will remain consistent.

(NOTE: each Linux distribution may interpret Udev rules in slightly different ways, so the same solution for one distribution, may not work for another. Some rule syntax experimentation may be needed.)

The Desired Configuration:

To summarize the names of NICs and bus locations we want to see on the  systems with the most active ethernet interfaces:

Bus Pos.         EQ5  EQ6 EQ7
------- ----  ----   ----
03.00.0 eth0  eth0   eth0
03.00.1 eth1  eth1   eth1
03.00.2 eth3  eth3   eth3
03.00.3 eth4  eth4   eth4
-
04.00.0 eth2  eth2   eth2
04.00.1 eth5  eth5   eth5
-
24.00.0 eth6  eth6   eth6
24.00.1 eth7  eth7   eth7
-
27.00.0 eth8  eth8   eth8
27.00.1 eth9  eth9   eth9

This is the desired IP configuration:

eth0 configured to the 192.168.11.0/24 network.
bond0 is eth2 & eth5 configured for the 192.168.20/22 network.
bond1 is eth6, eth7, eth8, & eth9 configured for 3 VLANs.
VLAN1 on 192.168.23.0/19 network.
VLAN2 on 192.168.64.0/19 network.
VLAN3 on 192.168.96.0/19 network.
Unused: eth1, eth3, & eth4

Needed Solution:

Using CMU's post-cloning script, reconf.sh, capture the MAC address of the NICs while the system is still net booted.  Debian seems to initialize and name the disks based on the order they are found on the system bus.  This will give us consistent locations of the NICs on other systems configured exactly like the Golden Node.  Once the MAC addresses are captured, they are inserted into a rule set, one per NIC, in a newly constructed, persistent rules file.

When the system boots, as long as the rule is properly constructed and contains the proper elements, udev will name the NICs as instructed.

Detailed Example:

First I want to show how three of the systems booted up without any rules predefined in the 70-persistent-net.rules file.  All three systems have identical hardware, are running SLES 11.3, and have the same BIOS firmware.

 Manufacturer: HP
        Product Name: ProLiant DL380p Gen8

EQ5-Node-diskboot-NOrules-Dmidecode.out.txt-        Version: P70
EQ5-Node-diskboot-NOrules-Dmidecode.out.txt-        Release Date: 02/10/2014
--
EQ6-Node-diskboot-NOrules-Dmidecode.out.txt-        Version: P70
EQ6-Node-diskboot-NOrules-Dmidecode.out.txt-        Release Date: 02/10/2014
--
EQ7-Node-diskboot-NOrules-Dmidecode.out.txt-        Version: P70
EQ7-Node-diskboot-NOrules-Dmidecode.out.txt-        Release Date: 02/10/2014

All three nodes initialized the Broadcom and Intel NICs in a different order.  This made some interfaces unusable, and some bonded interfaces were either running at reduced capacity or were completely non-functional.

Nodes Booted Without Rules:

From EQ5:
From the Console messages during boot we can see:
Setting up (localfs) network interfaces:
    lo
    lo        IP address: 127.0.0.1/8
               IP address: 127.0.0.2/8                                done
    eth0      device: Broadcom Corporation NetXtreme BCM5719 Gigabi
    eth0      IP address: 192.168.11.15/24                           done

    eth1      device: Broadcom Corporation NetXtreme BCM5719 Gigabi
              No configuration found for eth1                        unused

    eth2      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth3      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+
              No configuration found for eth3                        unused

    eth4      device: Broadcom Corporation NetXtreme BCM5719 Gigabi
              No configuration found for eth4                        unused

    eth5      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth6      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth7      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth8      device: Broadcom Corporation NetXtreme BCM5719 Gigabi  done

    eth9      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    bond0
    bond0     enslaved interface: eth2
    bond0     enslaved interface: eth5
    bond0     IP address: 192.168.20.15/22                           done

    bond1
    bond1     enslaved interface: eth8
    bond1     enslaved interface: eth9
    bond1     enslaved interface: eth6
    bond1     enslaved interface: eth7                               done

    vlan1
    vlan1   IP address: 192.168.23.30/19                            done

    vlan2
    vlan2   IP address: 192.168.64.35/19                           done

    vlan3
    vlan3   IP address: 192.168.93.12/19                            done
Setting up service (localfs) network  .  .  .  .  .  .  .  .  .  .   done

The NICs came up in the following order:
Broadcom
Broadcom
Intel
Intel
Broadcom
Intel
Intel
Intel
Broadcom
Intel

From the booted system we can see the bus paths are not in order when we
sort by NIC name.

ls -l /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Sep 11 08:46 /sys/class/net/eth0 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Sep 11 08:46 /sys/class/net/eth1 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.1/net/eth1
lrwxrwxrwx 1 root root 0 Sep 11 08:46 /sys/class/net/eth2 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.0/net/eth2
lrwxrwxrwx 1 root root 0 Sep 11 08:46 /sys/class/net/eth3 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.1/net/eth3
lrwxrwxrwx 1 root root 0 Sep 11 08:46 /sys/class/net/eth4 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.2/net/eth4
lrwxrwxrwx 1 root root 0 Sep 11 08:46 /sys/class/net/eth5 -> ../../devices/pci0000:20/0000:20:02.2/0000:24:00.0/net/eth5
lrwxrwxrwx 1 root root 0 Sep 11 08:46 /sys/class/net/eth6 -> ../../devices/pci0000:20/0000:20:02.2/0000:24:00.1/net/eth6
lrwxrwxrwx 1 root root 0 Sep 11 08:46 /sys/class/net/eth7 -> ../../devices/pci0000:20/0000:20:02.0/0000:27:00.0/net/eth7
lrwxrwxrwx 1 root root 0 Sep 11 08:46 /sys/class/net/eth8 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.3/net/eth8
lrwxrwxrwx 1 root root 0 Sep 11 08:46 /sys/class/net/eth9 -> ../../devices/pci0000:20/0000:20:02.0/0000:27:00.1/net/eth9

From EQ6
From the Console messages during boot we can see:
Setting up (localfs) network interfaces:
    lo      
    lo        IP address: 127.0.0.1/8
              IP address: 127.0.0.2/8                                done

    eth0      device: Broadcom Corporation NetXtreme BCM5719 Gigabi
    eth0      IP address: 192.168.11.16/24                           done

    eth1      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+
              No configuration found for eth1                        unused

    eth2      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth3      device: Broadcom Corporation NetXtreme BCM5719 Gigabi
              No configuration found for eth3                        unused

    eth4      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+
              No configuration found for eth4                        unused

    eth5      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth6      device: Broadcom Corporation NetXtreme BCM5719 Gigabi  done

    eth7      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth8      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth9      device: Broadcom Corporation NetXtreme BCM5719 Gigabi  done

    bond0  
    bond0     enslaved interface: eth2
    bond0     enslaved interface: eth5
    bond0     IP address: 192.168.20.16/22                           done

    bond1  
    bond1     enslaved interface: eth8
    bond1     enslaved interface: eth9
    bond1     enslaved interface: eth6
    bond1     enslaved interface: eth7                               done

    vlan1
    vlan1   IP address: 192.168.23.31/19                            done

    vlan2
    vlan2   IP address: 192.168.64.36/19                           done

    vlan3
    vlan3   IP address: 192.168.93.13/19                            done
Setting up service (localfs) network  .  .  .  .  .  .  .  .  .  .   done

On this system the initialized order (and naming) is:
Broadcom
Intel
Intel
Broadcom
Intel
Intel
Broadcom
Intel
Intel
Broadcom

...and the sorted by NIC name we have:
ls -l /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Sep 11 08:57 /sys/class/net/eth0 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Sep 11 08:57 /sys/class/net/eth1 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.0/net/eth1
lrwxrwxrwx 1 root root 0 Sep 11 08:57 /sys/class/net/eth2 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.1/net/eth2
lrwxrwxrwx 1 root root 0 Sep 11 08:57 /sys/class/net/eth3 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.1/net/eth3
lrwxrwxrwx 1 root root 0 Sep 11 08:57 /sys/class/net/eth4 -> ../../devices/pci0000:20/0000:20:02.2/0000:24:00.0/net/eth4
lrwxrwxrwx 1 root root 0 Sep 11 08:57 /sys/class/net/eth5 -> ../../devices/pci0000:20/0000:20:02.2/0000:24:00.1/net/eth5
lrwxrwxrwx 1 root root 0 Sep 11 08:57 /sys/class/net/eth6 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.2/net/eth6
lrwxrwxrwx 1 root root 0 Sep 11 08:57 /sys/class/net/eth7 -> ../../devices/pci0000:20/0000:20:02.0/0000:27:00.0/net/eth7
lrwxrwxrwx 1 root root 0 Sep 11 08:57 /sys/class/net/eth8 -> ../../devices/pci0000:20/0000:20:02.0/0000:27:00.1/net/eth8
lrwxrwxrwx 1 root root 0 Sep 11 08:57 /sys/class/net/eth9 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.3/net/eth9

From EQ7
Again the NICs are initialized in a different order:
Setting up (localfs) network interfaces:
    lo      
    lo        IP address: 127.0.0.1/8
              IP address: 127.0.0.2/8                                done

    eth0      device: Broadcom Corporation NetXtreme BCM5719 Gigabi
    eth0      IP address: 192.168.11.17/24                           done

    eth1      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+
              No configuration found for eth1                        unused

    eth2      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth3      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+
              No configuration found for eth3                        unused

    eth4      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+
              No configuration found for eth4                        unused

    eth5      device: Broadcom Corporation NetXtreme BCM5719 Gigabi  done

    eth6      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth7      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth8      device: Broadcom Corporation NetXtreme BCM5719 Gigabi  done

    eth9      device: Broadcom Corporation NetXtreme BCM5719 Gigabi  done

    bond0  
    bond0     enslaved interface: eth2
    bond0     enslaved interface: eth5
    bond0     IP address: 192.168.20.17/22                           done

    bond1  
    bond1     enslaved interface: eth8
    bond1     enslaved interface: eth9
    bond1     enslaved interface: eth6
    bond1     enslaved interface: eth7                               done

    vlan1
    vlan1   IP address: 192.168.23.32/19                            done

    vlan2
    vlan2   IP address: 192.168.64.37/19                           done

    vlan3
    vlan3   IP address: 192.168.93.14/19                            done
Setting up service (localfs) network  .  .  .  .  .  .  .  .  .  .   done

The order is:
Broadcom
Intel
Intel
Intel
Intel
Broadcom
Intel
Intel
Broadcom
Broadcom

ls -l /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Sep 11 05:27 /sys/class/net/eth0 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Sep 11 05:27 /sys/class/net/eth1 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.0/net/eth1
lrwxrwxrwx 1 root root 0 Sep 11 05:27 /sys/class/net/eth2 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.1/net/eth2
lrwxrwxrwx 1 root root 0 Sep 11 05:27 /sys/class/net/eth3 -> ../../devices/pci0000:20/0000:20:02.2/0000:24:00.0/net/eth3
lrwxrwxrwx 1 root root 0 Sep 11 05:27 /sys/class/net/eth4 -> ../../devices/pci0000:20/0000:20:02.2/0000:24:00.1/net/eth4
lrwxrwxrwx 1 root root 0 Sep 11 05:27 /sys/class/net/eth5 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.1/net/eth5
lrwxrwxrwx 1 root root 0 Sep 11 05:27 /sys/class/net/eth6 -> ../../devices/pci0000:20/0000:20:02.0/0000:27:00.0/net/eth6
lrwxrwxrwx 1 root root 0 Sep 11 05:27 /sys/class/net/eth7 -> ../../devices/pci0000:20/0000:20:02.0/0000:27:00.1/net/eth7
lrwxrwxrwx 1 root root 0 Sep 11 05:27 /sys/class/net/eth8 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.2/net/eth8
lrwxrwxrwx 1 root root 0 Sep 11 05:27 /sys/class/net/eth9 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.3/net/eth9

Summary of Random NICs

To summarize the names give by each of these three systems and the system bus location:

Bus Pos.         EQ5  EQ6 EQ7
--------         ----    ----   ----
03.00.0 eth0  eth0 eth0
03.00.1 eth1  eth3  eth5
03.00.2 eth4  eth6  eth8
03.00.3 eth8  eth9  eth9
04.00.0 eth2  eth1  eth1
04.00.1 eth3  eth2  eth2
24.00.0 eth5  eth4  eth3
24.00.1 eth6  eth5  eth4
27.00.0 eth7  eth7  eth6
27.00.1 eth9  eth8  eth7

Now from a CMU Net Booted Node:

Now we take EQ7 and boot it via CMU's net boot image, and observe how the NICs are named.

From a Net Booted Node we can get reliable MAC addresses from all of the systems.  The net boot image is Debian running in a RAMdisk, which is naming the nodes in the same order on each system by default.

EQ7
From dmesg: We can see that the NICs are named in order of appearance on the system bus while the system is being initialized.

[   32.308843] tg3 0000:03:00.0: irq 286 for MSI/MSI-X
[   32.308853] tg3 0000:03:00.0: irq 287 for MSI/MSI-X
[   32.308863] tg3 0000:03:00.0: irq 288 for MSI/MSI-X
[   32.308873] tg3 0000:03:00.0: irq 289 for MSI/MSI-X
[   32.308883] tg3 0000:03:00.0: irq 290 for MSI/MSI-X
[   33.518953] tg3 0000:03:00.1: irq 291 for MSI/MSI-X
[   33.518963] tg3 0000:03:00.1: irq 292 for MSI/MSI-X
[   33.518974] tg3 0000:03:00.1: irq 293 for MSI/MSI-X
[   33.518984] tg3 0000:03:00.1: irq 294 for MSI/MSI-X
[   33.519001] tg3 0000:03:00.1: irq 295 for MSI/MSI-X
[   33.648691] tg3 0000:03:00.2: irq 296 for MSI/MSI-X
[   33.648701] tg3 0000:03:00.2: irq 297 for MSI/MSI-X
[   33.648712] tg3 0000:03:00.2: irq 298 for MSI/MSI-X
[   33.648729] tg3 0000:03:00.2: irq 299 for MSI/MSI-X
[   33.648739] tg3 0000:03:00.2: irq 300 for MSI/MSI-X
[   34.007534] tg3 0000:03:00.3: irq 301 for MSI/MSI-X
[   34.007545] tg3 0000:03:00.3: irq 302 for MSI/MSI-X
[   34.007556] tg3 0000:03:00.3: irq 303 for MSI/MSI-X
[   34.007566] tg3 0000:03:00.3: irq 304 for MSI/MSI-X
[   34.007576] tg3 0000:03:00.3: irq 305 for MSI/MSI-X
[   34.500756] ixgbe 0000:04:00.0: registered PHC device on eth4
[   34.751358] ixgbe 0000:04:00.0 eth4: detected SFP+: 4
[   34.816928] ixgbe 0000:04:00.1: registered PHC device on eth5
[   34.991322] ixgbe 0000:04:00.1 eth5: detected SFP+: 3
[   35.059786] ixgbe 0000:24:00.0: registered PHC device on eth6
[   35.311264] ixgbe 0000:24:00.0 eth6: detected SFP+: 6
[   35.382667] ixgbe 0000:24:00.1: registered PHC device on eth7
[   35.475273] ixgbe 0000:04:00.0 eth4: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[   35.635214] ixgbe 0000:24:00.1 eth7: detected SFP+: 5
[   35.655225] ixgbe 0000:04:00.1 eth5: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[   35.698550] ixgbe 0000:27:00.0: registered PHC device on eth8
[   35.871132] ixgbe 0000:27:00.0 eth8: detected SFP+: 6
[   35.936563] ixgbe 0000:27:00.1: registered PHC device on eth9
[   36.139133] ixgbe 0000:27:00.1 eth9: detected SFP+: 5
[   36.411973] tg3 0000:03:00.0 eth0: Link is up at 1000 Mbps, full duplex
[   36.499187] tg3 0000:03:00.0 eth0: Flow control is off for TX and off for RX
[   36.592109] tg3 0000:03:00.0 eth0: EEE is disabled
[   36.724664] ixgbe 0000:24:00.0 eth6: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[   37.081889] ixgbe 0000:24:00.1 eth7: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[   37.316612] ixgbe 0000:27:00.0 eth8: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[   37.517785] ixgbe 0000:27:00.1 eth9: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[   36.055141] Sending BOOTP requests .. OK
[   43.165986] IP-Config: Got BOOTP answer from 192.168.44.11, my address is 192.168.44.90
[   44.382179] ixgbe 0000:04:00.0: removed PHC on eth4
[   44.524657] ixgbe 0000:04:00.1: removed PHC on eth5
[   44.666746] ixgbe 0000:24:00.0: removed PHC on eth6
[   44.811539] ixgbe 0000:24:00.1: removed PHC on eth7
[   44.955456] ixgbe 0000:27:00.0: removed PHC on eth8
[   45.099432] ixgbe 0000:27:00.1: removed PHC on eth9
[   45.242954] IP-Config: Complete:
[   45.285428]      device=eth0, hwaddr=a0:d3:c1:fa:2a:fc, ipaddr=192.168.44.90,\ mask=255.255.255.128, gw=255.255.255.255
[   45.426422]      host=adappsrku007-cmu, domain=, nis-domain=(none)
[   45.507822]      bootserver=192.168.44.11, rootserver=192.168.44.11,\ rootpath=/opt/cmu/ntbt/rp/x86_64
[   45.627036]      nameserver0=192.168.44.11

Once the system is up, we can observe again, that the NIC names appear in system bus order.

PCI addresses of NICs:
CMU netboot adappsrku007-cmu:/tmp# cat EQ-EdgeNode-Netbooted-clioutput.txt
lrwxrwxrwx 1 root root 0 Sep 11 04:13 /sys/class/net/eth0 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Sep 11 04:13 /sys/class/net/eth1 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.1/net/eth1
lrwxrwxrwx 1 root root 0 Sep 11 04:13 /sys/class/net/eth2 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.2/net/eth2
lrwxrwxrwx 1 root root 0 Sep 11 04:13 /sys/class/net/eth3 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.3/net/eth3
lrwxrwxrwx 1 root root 0 Sep 11 04:13 /sys/class/net/eth4 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.0/net/eth4
lrwxrwxrwx 1 root root 0 Sep 11 04:13 /sys/class/net/eth5 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.1/net/eth5
lrwxrwxrwx 1 root root 0 Sep 11 04:13 /sys/class/net/eth6 -> ../../devices/pci0000:20/0000:20:02.2/0000:24:00.0/net/eth6
lrwxrwxrwx 1 root root 0 Sep 11 04:13 /sys/class/net/eth7 -> ../../devices/pci0000:20/0000:20:02.2/0000:24:00.1/net/eth7
lrwxrwxrwx 1 root root 0 Sep 11 04:13 /sys/class/net/eth8 -> ../../devices/pci0000:20/0000:20:02.0/0000:27:00.0/net/eth8
lrwxrwxrwx 1 root root 0 Sep 11 04:13 /sys/class/net/eth9 -> ../../devices/pci0000:20/0000:20:02.0/0000:27:00.1/net/eth9

Hardware (MAC) Addresses of NICs:
for NIC in $(seq 0 9); do printf "Device eth${NIC} hardware address "; cat /sys/class/net/eth${NIC}/address; done
Device eth0 hardware address de:ad:be:ef:2a:fc
Device eth1 hardware address de:ad:be:ef:2a:fd
Device eth2 hardware address de:ad:be:ef:2a:fe
Device eth3 hardware address de:ad:be:ef:2a:ff
Device eth4 hardware address be:ef:de:ad:db:d4
Device eth5 hardware address be:ef:de:ad:db:d5
Device eth6 hardware address be:ef:de:ad:d7:fc
Device eth7 hardware address be:ef:de:ad:d7:fd
Device eth8 hardware address be:ef:de:ad:db:cc
Device eth9 hardware address be:ef:de:ad:db:cd

We can use this consistent naming order to capture the MAC address and write a udev persistent net rules file for use when the system boots from its own hard drive.

CMU reconf.sh script.


We need to write code into the reconf.sh script to write unique, MAC based rules for the
70-persistent-net.rules file for each node.  First I put in the section that CMU adds automagically to the rules file which I started out using as an example.  I leave it in here as reference.  However, on the system this was developed for, the suggested elements and rule structure did not work.

#--custom code starts here --
#
########## Setup of the /etc/udev/rules.d/70-persistent-net.rules file correctly.
# CMU UDEV rule added at cloning time
#
# see CMU_ADD_NETBOOT_NIC_UDEV_RULE environment variable
# into /opt/cmu/etc/cmuserver.conf on the CMU management node
#
#ACTION--"add",SUBSYSTEM=="net",ATTR{address}=="a0:d3:c1:fa:2b:58",NAME="eth0"
#
########## END CMU Section
#Capture the current, net booted MACs which Debian lists in order.
ETH0_MAC=$(cat /sys/class/net/eth0/address)
ETH1_MAC=$(cat /sys/class/net/eth1/address)
ETH2_MAC=$(cat /sys/class/net/eth2/address)
ETH3_MAC=$(cat /sys/class/net/eth3/address)
ETH4_MAC=$(cat /sys/class/net/eth4/address)
ETH5_MAC=$(cat /sys/class/net/eth5/address)
ETH6_MAC=$(cat /sys/class/net/eth6/address)
ETH7_MAC=$(cat /sys/class/net/eth7/address)
ETH8_MAC=$(cat /sys/class/net/eth8/address)
ETH9_MAC=$(cat /sys/class/net/eth9/address)

# In SLES 11.3, SuSE's parsing of Udev rules seemed to need more rule
# elements than other distributions, so the system generated rule
# structure was followed.
RULES_FILE=${CMU_RCFG_PATH}/etc/udev/rules.d/70-persistent-net.rules
RULE_START='SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="'
RULE_MIDDLE='", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", '

echo "" > ${RULES_FILE}
echo '# system board NICs' >> ${RULES_FILE}
echo ${RULE_START}${ETH0_MAC}${RULE_MIDDLE}'NAME="eth0"' >> ${RULES_FILE}
echo ${RULE_START}${ETH1_MAC}${RULE_MIDDLE}'NAME="eth1"' >> ${RULES_FILE}
echo ${RULE_START}${ETH3_MAC}${RULE_MIDDLE}'NAME="eth3"' >> ${RULES_FILE}
echo ${RULE_START}${ETH4_MAC}${RULE_MIDDLE}'NAME="eth4"' >> ${RULES_FILE}
echo '# bond0 NICs'  >> ${RULES_FILE}
echo ${RULE_START}${ETH2_MAC}${RULE_MIDDLE}'NAME="eth2"' >> ${RULES_FILE}
echo ${RULE_START}${ETH5_MAC}${RULE_MIDDLE}'NAME="eth5"' >> ${RULES_FILE}
echo '# bond1 NICs'  >> ${RULES_FILE}
echo ${RULE_START}${ETH6_MAC}${RULE_MIDDLE}'NAME="eth6"' >> ${RULES_FILE}
echo ${RULE_START}${ETH7_MAC}${RULE_MIDDLE}'NAME="eth7"' >> ${RULES_FILE}
echo ${RULE_START}${ETH8_MAC}${RULE_MIDDLE}'NAME="eth8"' >> ${RULES_FILE}
echo ${RULE_START}${ETH9_MAC}${RULE_MIDDLE}'NAME="eth9"' >> ${RULES_FILE}

# Notice in the above rules that I grouped them by bonding
# so eth2 and eth5 are listed together.  The order of the rules does not
# matter.  What matters is the correct MAC is assigned the correct name.
# In this case the third device was named eth2 when I began work on the
# cluster.  I did not want to change it from what was working in the
# factory.

# I am including in this report, how I set up the bonded interfaces and
# vlans.

# Set up of the bond0 interface for the Cloudera network.
#
##Variables
IFCFG_BOND0=${CMU_RCFG_PATH}/etc/sysconfig/network/ifcfg-bond0
IFCFG_BOND1=${CMU_RCFG_PATH}/etc/sysconfig/network/ifcfg-bond1
IFCFG_VLAN1=${CMU_RCFG_PATH}/etc/sysconfig/network/ifcfg-vlan1
IFCFG_VLAN2=${CMU_RCFG_PATH}/etc/sysconfig/network/ifcfg-vlan2
IFCFG_VLAN3=${CMU_RCFG_PATH}/etc/sysconfig/network/ifcfg-vlan3
#
## I like separate temp files.
TMPFILE_B0=/tmp/cmu-tmpB0
TMPFILE_B1=/tmp/cmu-tmpB1
TMPFILE_V1=/tmp/cmu-tmpV1
TMPFILE_V2=/tmp/cmu-tmpV2
TMPFILE_V3=/tmp/cmu-tmpV3
#
## IP variables
IPSUFFIX=`echo ${CMU_RCFG_IP} | awk -F. '{print $4}'`
BOND0_IP_BASE=192.168.20
BOND1_IP_BASE=192.168
#   The variable CMU_RCFG_IP is a "built-in" variable supplied by CMU.
#   There are several CMU "built-in" variables available in reconf.sh.
#
## The last octet of the vlan IPs do not match the iLO, eth0, or bond0 IP,
## so they must be adjusted.
### Do the math
VLAN1_IP=$((IPSUFFIX + 15))
VLAN2_IP=$((IPSUFFIX + 20))
VLAN3_IP=$((IPSUFFIX - 5))
#
#  This is one reason to have IPs numbering run consistently, or you will
#  have to do a LOT more scripting to give each node's interfaces the
#  correct IP address(es) on any interface excpet the main CMU interface
#  (the one in the CMU database).  Read the manual if you are lost here.

## Make bond0 config file.
grep -v IPADDR ${IFCFG_BOND0} > ${TMPFILE_B0}
echo IPADDR=${BOND0_IP_BASE}.${IPSUFFIX} >> ${TMPFILE_B0}
mv ${TMPFILE_B0} ${IFCFG_BOND0}

## Make basic bond1 config file. In this case
### bond1 contains no IP address or NETMASK.
grep -v -e IPADDR -e NETMASK ${IFCFG_BOND1} > ${TMPFILE_B1}
mv ${TMPFILE_B1} ${IFCFG_BOND1}

## Set up the vlan1 interface to the external network
grep -v IPADDR ${IFCFG_VLAN1} > ${TMPFILE_V1}
echo IPADDR=${VLAN1_IP_BASE}.23.${VLAN1_IP} >> ${TMPFILE_V1}
mv ${TMPFILE_V1} ${IFCFG_VLAN1} >> ${TMPFILE_V1}

## Set up the vlan2 interface to the external network
grep -v IPADDR ${IFCFG_VLAN2} > ${TMPFILE_V2}
echo IPADDR=${VLAN2_IP_BASE}.64.${VLAN2_IP} >> ${TMPFILE_V2}
mv ${TMPFILE_V2} ${IFCFG_VLAN2} >> ${TMPFILE_V2}

## Set up the vlan3 interface to the external network
grep -v IPADDR ${IFCFG_VLAN3} > ${TMPFILE_V3}
echo IPADDR=${VLAN3_IP_BASE}.96.${VLAN3_IP} >> ${TMPFILE_V3}
mv ${TMPFILE_V3} ${IFCFG_VLAN3} >> ${TMPFILE_V3}

exit 0

# End of reconf.sh

Result of reconf.sh on diskbooted systems.  {{{6

From EQ7 after booting from its disks.

The reconf.sh creates the following rules file on each system, but with MACs that are unique to each interface, and assigned the right names.

(NOTE: each rule begins with SUBSYSTEM and ends with the NAME element. Each rule is on a line by itself, and CANNOT be on two lines.  Due to formatting here you may see rules on two lines.)

# system board NICs
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="de:ad:be:ef:2a:fc", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="de:ad:be:ef:2a:fd", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="de:ad:be:ef:2a:fe", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth3"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="de:ad:be:ef:2a:ff", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth4"
# bond0
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="be:ef:de:ad:db:d4", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="be:ef:de:ad:db:d5", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth5"
# bond1
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="be:ef:de:ad:d7:fc", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth6"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="be:ef:de:ad:d7:fd", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth7"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="be:ef:de:ad:db:cc", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth8"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="be:ef:de:ad:db:cd", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth9"


From the Console messages during boot we can see:
Setting up (localfs) network interfaces:
    lo
    lo        IP address: 127.0.0.1/8
              IP address: 127.0.0.2/8                                done

    eth0      device: Broadcom Corporation NetXtreme BCM5719 Gigabi
    eth0      IP address: 192.168.44.90/25                           done

    eth1      device: Broadcom Corporation NetXtreme BCM5719 Gigabi
              No configuration found for eth1                        unused

    eth2      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth3      device: Broadcom Corporation NetXtreme BCM5719 Gigabi
              No configuration found for eth3                        unused

    eth4      device: Broadcom Corporation NetXtreme BCM5719 Gigabi
              No configuration found for eth4                        unused

    eth5      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth6      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth7      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth8      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    eth9      device: Intel Corporation 82599EB 10-Gigabit SFI/SFP+  done

    bond0
    bond0     enslaved interface: eth2
    bond0     enslaved interface: eth5
    bond0     IP address: 192.168.43.18/26                           done

    bond1
    bond1     enslaved interface: eth8
    bond1     enslaved interface: eth9
    bond1     enslaved interface: eth6
    bond1     enslaved interface: eth7                               done

    vlan1
    vlan1   IP address: 172.21.8.109/22                            done

    vlan2
    vlan2   IP address: 172.21.13.121/22                           done

    vlan3
    vlan3   IP address: 172.21.80.16/23                            done
Setting up service (localfs) network  .  .  .  .  .  .  .  .  .  .   done

The NICs came up in the desired, order:
Broadcom
Broadcom
Intel
Broadcom
Broadcom
Intel
Intel
Intel
Intel
Intel


From the booted system we can see the bus paths are in order expected
from the rules file.

ls -l /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Sep 11 05:41 /sys/class/net/eth0 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Sep 11 05:41 /sys/class/net/eth1 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.1/net/eth1
lrwxrwxrwx 1 root root 0 Sep 11 05:41 /sys/class/net/eth2 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.0/net/eth2
lrwxrwxrwx 1 root root 0 Sep 11 05:41 /sys/class/net/eth3 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.2/net/eth3
lrwxrwxrwx 1 root root 0 Sep 11 05:41 /sys/class/net/eth4 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.3/net/eth4
lrwxrwxrwx 1 root root 0 Sep 11 05:41 /sys/class/net/eth5 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.1/net/eth5
lrwxrwxrwx 1 root root 0 Sep 11 05:41 /sys/class/net/eth6 -> ../../devices/pci0000:20/0000:20:02.2/0000:24:00.0/net/eth6
lrwxrwxrwx 1 root root 0 Sep 11 05:41 /sys/class/net/eth7 -> ../../devices/pci0000:20/0000:20:02.2/0000:24:00.1/net/eth7
lrwxrwxrwx 1 root root 0 Sep 11 05:41 /sys/class/net/eth8 -> ../../devices/pci0000:20/0000:20:02.0/0000:27:00.0/net/eth8
lrwxrwxrwx 1 root root 0 Sep 11 05:41 /sys/class/net/eth9 -> ../../devices/pci0000:20/0000:20:02.0/0000:27:00.1/net/eth9

Example 2:

In a different set of nodes, SL4540s, a Mellanox Ethernet card was encountered that threw a nice curve ball into this solution.  These nodes have 4 Ethernet Interfaces: eth0, eth1, eth2 & eth3.  Interfaces 2 & 3 are on the Mellanox NIC, BUT, when you probe the system's hardware via normal
utilities, the Mellanox NIC only reports one of the interfaces; not both.

Interfaces eth0 & eth1 are built into the system board.

There was not much time to spend digging into this variation of the problem.  A solution similar to the DL360pG8 solution was put in place for interfaces eth0, eth1, and eth2.  Eth0 was used for the Admin network, and eth2 & eth3 were bonded together.  (No vlan tagging was used for these node's interfaces.)

Since I was not able to find a quick programmatic way to identify eth3 from a CMU Net Booted system, I decided that the reconf.sh script should NOT write a rule for that interface.  Rules are written for NICs eth0, eth1, and eth2.

On boot up from disk, the system will notice that eth3 is not defined and will generate a rule for that interface automagically.  As long as there are no UNcommented rules in the file that name eth3, udev will name the interface eth3.

When the network comes up, it finds both eth2 and eth3 defined, and the bonded interface is up at full capacity.

If there is an UNcommented eth3 in the rules file, then the system will generate an interface eth4 for that NIC.  The bonded interface will only contain eth2, and will be running at half capacity.

#Example entries for reconf.sh
#Capture the current, net booted MACs which Debian lists in order.
##Built in NICs
ETH0_MAC=$(cat /sys/class/net/eth0/address)
ETH1_MAC=$(cat /sys/class/net/eth1/address)
##Mellanox Ethernet NIC with two ports.
ETH2_MAC=$(cat /sys/class/net/eth2/address)
ETH3_MAC=$(cat /sys/class/net/eth3/address)

RULES_FILE=${CMU_RCFG_PATH}/etc/udev/rules.d/70-persistent-net.rules
RULE_START='SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="'
RULE_MIDDLE='", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", '

echo "" > ${RULES_FILE}
echo '# system board NICs' >> ${RULES_FILE}
echo ${RULE_START}${ETH0_MAC}${RULE_MIDDLE}'NAME="eth0"' >> ${RULES_FILE}
echo ${RULE_START}${ETH1_MAC}${RULE_MIDDLE}'NAME="eth1"' >> ${RULES_FILE}
echo '# bond0 NICs'  >> ${RULES_FILE}
echo ${RULE_START}${ETH2_MAC}${RULE_MIDDLE}'NAME="eth2"' >> ${RULES_FILE}
####### Keep the rule for eth3 commented out so the udev system generates it.
####### echo ${RULE_START}${ETH3_MAC}${RULE_MIDDLE}'NAME="eth3"' >> ${RULES_FILE}

When the system comes up it can be observed that another interface was
generated by udev.  It should be named eth3.
ls -l /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Sep 11 05:41 /sys/class/net/eth0 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Sep 11 05:41 /sys/class/net/eth1 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.1/net/eth1
lrwxrwxrwx 1 root root 0 Sep 11 05:41 /sys/class/net/eth2 -> ../../devices/pci0000:00/0000:00:04.0/0000:04:00.0/net/eth2
lrwxrwxrwx 1 root root 0 Sep 11 05:41 /sys/class/net/eth3 -> ../../devices/pci0000:00/0000:00:04.0/0000:04:00.1/net/eth3

Summary:

The udev system is very picky, and can be more so on certain Linux distributions.  I know we encounter more udev problems with SuSE than we do with RHEL or derivatives of it.

In this example, we had to have a generic solution that would fit all of the nodes, real or potential, that could exist in the cluster.  A few variations on this solution had to be employed due to system and functional differences between the various types of nodes operating in this cluster.

If a different distribution version of Debian were to be used, the solution would have to be double checked again to make sure that Debian still names the devices in bus order.

One major lesson learned is that if Udev appears to be ignoring the rules written for it to follow, that typically means there is something wrong with the rule.  Either it does not contain enough elements to be made effective or there is a bad key and/or value present in the rule.

When there are bad keys in the rule, error messages can sometimes be observed scrolling on the boot screen, and may be available in dmesg.

I also found that reboots were necessary to fully realize the changes made manually while experimenting with rule syntax.  The udevadm tool was not sufficient to have interface name changes realized in an already running system.

In the end a few different groups of nodes came up with multiple NICs in multiple bond/vlan configurations, connected to various networks.

Tuesday, September 16, 2014

Creating Floppy Disk Images for use with Virtualbox

Two solutions:

I made slight edit to the name of the floppy disk image.   Depending on your Linux distribution, you may also need to use 'sudo' for some of these commands.

From Superuser forum:

(http://superuser.com/questions/342433/how-to-create-an-empty-floppy-image-with-virtualbox-windows-guest)

fallocate  -l  1474560  floppy1.vfd
head  -c  1474560  /dev/zero  >  floppy1.vfd
mkfs.vfat   floppy1.vfd
mkdir  /media/floppy1
mount  -o  loop  ./floppy1.vfd   /media/floppy1


From the Virtualbox.org forum:

(https://forums.virtualbox.org/viewtopic.php?t=1426)

dd  if=/dev/zero  of=floppy2.img   bs=512  count=2880
mkfs.vfat   floppy2.img
mkdir  /media/floppy2
mount  -o  loop  ./floppy2.img   /media/floppy2