The Infiniband troubleshooting quick reference

Glossary of Infiniband Terminology:

  • GID: Global Identifier
  • GUID: Global Unique Identifiers (also known as Direct Address)
  • HCA: Host Channel Adapter
  • LID: Local Identifier
  • TCA: Target Channel Adapter
  • SM: Subnet Manager

Infiniband: The Host Perspective

Every host on an Infiniband fabric has three identifiers: GUID, GID, and LID. A GUID is similar in concept to a MAC address because it consists of a 24-bit manufacturer’s prefix and a 40-bit device identifier (64 bits total). The Global Identifier (GID) is a 128-bit identifier similar to an IPv6 address (technically, a GID is a valid IPv6 identifier with restrictions). The GID consists of the 64-bit GUID plus an additional 64-bit EUI-64 identifier, for a total of 128 bits. The GID is used for routing between subnets. The default GID prefix is 0xfe80::0. Finally, there is the local identifier (LID), which is assigned by the subnet manager. The LID is a 16-bit identifier that is unique within a subnet. Hosts have an LID between 0 and 48,000, usually expressed in hexadecimal notation (such as 0xb1). Routing within a subnet is managed by LID. The GUID, GID, and LID for a Linux server are stored in text files. The exact path to this text file will depend upon which Infiniband driver is used on the system. For a system with the MLX4 driver (such as RedHat/CentOS 5.x), the commands are:

cat /sys/class/infiniband/mlx4_0/node_guid 
0002:c903:0001:0a48
cat /sys/class/infiniband/mlx4_0/ports/1/gids/0
fe80:0000:0000:0000:0002:c903:0001:0a49
cat /sys/class/infiniband/mlx4_0/ports/1/lid
0x14a

Note that the GUID is contained within the GID, so you really don’t need to fetch the GUID if you have the GID.

Infiniband: Switch Perspective

Understanding an Infiniband network from the perspective of a switch is more difficult. The concepts are the same, but an “Infiniband switch” is actually composed of multiple spine and leaf (line) modules, each of which is considered a “switch” with its own GUID and LID! For example, consider this snipped of output from the ibswitches command on a Voltaire Grid Director 4200-L50:

Switch	: 0x0008f10500652600 ports 36 "Voltaire sLB-4018    Line 4  Chip 1 4200 #4200-CF50" enhanced port 0 lid 125 lmc 0
Switch	: 0x0008f10500652758 ports 36 "Voltaire sLB-4018    Line 3  Chip 1 4200 #4200-CF50" enhanced port 0 lid 127 lmc 0
Switch	: 0x0008f10500652748 ports 36 "Voltaire sLB-4018    Line 2  Chip 1 4200 #4200-CF50" enhanced port 0 lid 126 lmc 0
Switch	: 0x0008f10500380986 ports 36 "Voltaire 40Gb InfiniBand Switch Module for IBM BladeCenter" enhanced port 0 lid 124 lmc 0
Switch	: 0x0008f1050038084c ports 36 "Voltaire 40Gb InfiniBand Switch Module for IBM BladeCenter" enhanced port 0 lid 123 lmc 0
Switch	: 0x0008f105007521a6 ports 36 "Voltaire sFB-4200    Spine 4  Chip 1 4200 #4200-CF50" enhanced port 0 lid 128 lmc 0
Switch	: 0x0008f105007521dc ports 36 "Voltaire sFB-4200    Spine 3  Chip 1 4200 #4200-CF50" enhanced port 0 lid 131 lmc 0
Switch	: 0x0008f105007521cc ports 36 "Voltaire sFB-4200    Spine 2  Chip 1 4200 #4200-CF50" enhanced port 0 lid 129 lmc 0
Switch	: 0x0008f105007521d8 ports 36 "Voltaire sFB-4200    Spine 1  Chip 1 4200 #4200-CF50" enhanced port 0 lid 130 lmc 0
Switch	: 0x0008f10500652742 ports 36 "Voltaire sLB-4018    Line 1  Chip 1 4200 #4200-CF50" enhanced port 0 lid 1 lmc 0

References

Infiniband Fundamentals

InfiniBand Troubleshooting:

Man pages for useful diagnostic tools:

Leave a Reply