My last post showed how to monitor networked devices with SNMP. You could try to remember to manually check the status of things periodically, but that would be missing the point of computers. Instead, automate your monitoring with Nagios, a web-based monitoring tool for Linux that automates the process of actively querying devices and doing something with the information. Nagios is available as free open source software (Nagios Core), and the company offers additional non-free products with premium features. The open-source version is fine for getting started and setting up basic monitoring. Nagios does a lot more than just SNMP monitoring. I’ll refer you to the Nagios Core documentation to get Nagios up and running, and I’ll focus on how to set up Nagios to actively monitor devices with SNMP.
In Part 1, I summarized the basic concepts of SNMP and defined the terms and acronyms used in this post. Now, I will show how to use SNMP to monitor actual devices. As an example, I will monitor an enterprise-grade uninterruptible power supply (UPS) and power distribution unit (PDUs) from Tripp-Lite. These devices have an SNMPWEBCARD installed to support communication over Ethernet.
Command-line tools for SNMP communication should be available for any Linux distribution (or any other UNIX-derived OS). Documentation for the basic SNMP tools is available online. The challenge with SNMP is figuring out what parameters are supported by a particular device. Most devices support a set of standard OIDs that return basic information such as device name, uptime, etc.
SNMP is a protocol for conveying information and controlling devices over a network. SNMP can be used in two ways:
- Active: a device sends a command to set a parameter or request information for another device
- Passive: a device sends an alert (called a trap) to another device, which is configured to receive traps and do something with the information.
The “payload” of an SNMP message is called an Object Identifier, or OID. An OID is an ordered list of non-negative numbers, such as:
The sequence is hierarchical, starting with the highest-level object and progressing to lower-level objects. The above sequence corresponds to:
iso(1) org(3) dod(6) internet(1) mgmt(2) mib-2(1) system(1) sysUpTime(3) 0
When this command is sent to a device, it will return the uptime of the device.
The translation between the numerical sequence and the human-readable form is stored in a text file called a Management Information Base, or MIB. The format of the MIB is defined in RFC 2578. Some MIB files are standard and contain object IDs that are recognized by almost all devices. Device manufacturers also provide custom MIB files in which they define specialized object IDs for a particular device. Unfortunately, some devices don’t have MIB files, and you will have to query the device to see what objects it supports and decipher what they mean.
In Part 2 of this series, I will use active SNMP to monitor infrastructure.
What do you do when you want to distribute or release source code that is stored in a Git repository? Obviously, if your target audience is using Git, you can just compress the directory that contains the repository and distribute the copies, or give the users a way to clone your repository (such as GitHub). However, your audience may not be Git users, or the hidden .git directory may be very large and you don’t want to distribute it. The solution is the git archive command, which packs the files from a tree-ish into an achive (ZIP or TAR). By “tree-ish”, they mean that you can specify a branch, commit, HEAD, etc.
git archive is somewhat analagous to the
svn export command. I find the most useful form of this command to be:
git archive --output ~/example.zip --format=zip --prefix=example/ HEAD
Do not forget the trailing slash after the directory that you specify with the
REFERENCE: How to do a “git export” (like svn export)
The GENI Project is a networking testbed that is used by researchers studying novel networking technologies. While the technology is fascinating, the web site is, unfortunately, a confusing mess. Here are some pointers to get you started (or refresh your memory). This post will be updated as I learn more.
Key GENI Links
This is where you log into the GENI Project. Your institution must have Shibboleth enabled and be part of the InCommon Federation. Click on the “Use GENI” button, enter the name of your institution into the search box, and you will be redirected to your institution’s login page.
Tutorials, How-Tos, and other documentation can be found at the GENI Experimenter Page.
GitHub is a great tool for collaborating on projects. However, sometimes it is necessary to mimic the “GitHub workflow” using a shared repository on a local Linux server. The following example shows how I shared an example repository with multiple users. We are also using the Git flow model for branching, aided by the handy git flow plugin.
On my workstation
I started by creating a repo on my local workstation and setting it up to use the git flow plugin.
git init . Continue reading
git flow init
git flow feature start first_feature
Our high performance compute cluster (HPCC) has fairly primitive tools for managing the deployment of the operating system on the compute nodes. Our current tools are “aspencopy,” which takes an “image” of a the filesystem of a running server and saves it as a .tar.gz file (NOT a disk image). “aspenrestore” is its counterpart, which deploys an “image” to another server. The utility is smart enough to update things like the host name, IP address, host SSH keys, etc. However, the images are essentially “black boxes,” in the sense that there is no system for keeping track of which configuration changes have been applied to which image, and no way to know which image is running on each server. The next cluster that I am responsible for purchasing must include a configuration management/data center automation system, such as:
On a related note, Vagrant is a system for managing virtual machines. You can define a virtual machine configuration in a specification file, and Vagrant will automate the startup and shutdown of arbitrary numbers of virtual machines.
My recent Ubuntu installation was my first experience with the new GRUB 2.x series of bootloaders. Unforunately, the process of manually configuring GRUB2 on Ubuntu is not well documented in the case that everything doesn’t work “automagically.” I had to solve two problems: the blank screen at boot, and getting GRUB to boot to an existing partition with CentOS 5 installed.
I recently installed Ubuntu on an older PC with 1GB of RAM and 80GB of hard drive space, so I wanted a lightweight desktop interface. I chose XFCE, since it is both lightweight and usable, and I have used it extensively. You can get Ubuntu pre-made with XFCE (xubuntu), but there are some disadvantages. The ISO is slightly too large to fit on a CD, and it comes with a lot of applications that I don’t need. Instead, I installed Ubuntu from a minimal CD and then used apt-get to install XFCE and the LightDM display manager. I learned that a couple of extra steps must be taken to get XFCE to play with lightdm. I got the error “can’t find session ubuntu”. Don’t bother changing the .dmrc file in your home directory; this file is overwritten every time lightdm starts! I configured lightdm to use an XFCE session with the command:
/usr/lib/lightdm/lightdm-set-defaults --session xfce4-session
I then got the error “can’t find session xfce” until I realized that I had to install another package to get the xfce4 session file:
apt-get install xubuntu-default-settings That was a lot more trouble than it had to be, but I now have a responsive Linux GUI running in 1GB of RAM that consumes only 2.2GB of hard drive space.
A previous post documented that a Linux server running a pre-2.6.24 kernel can fail to allocate large chunks of memory after its memory has been fragmented by a “thrashing” incident. In this post, I will point out some ways to prevent this problem.
Use a Newer Kernel
We have some servers running RHEL 5.9 with the kernel updated to 18.104.22.168. After a thrashing incident, these servers do not experience the same problem with allocating large blocks of memory. I think the fix is documented in the release notes for kernel 2.6.24. Section 2.4 talks about “anti-fragmentation patches” and includes a link to this article about Linux memory management, which links to this thorough documentation of the anti-fragmentation patches.(BTW, here is the full list of 2.6 kernel changelogs) My plan is to deploy RHEL 5.9 with the updated kernel to all the compute nodes in our cluster. However, this still doesn’t solve the problem of a user who requests some portion of the RAM on a node and then proceeds to consume more memory than requested. This is unfair to another user whose job is running on the same node.
Limit RAM Used By a Process
There are ways to prevent servers from thrashing in the first place. My discussion will be specific to HPC compute nodes, not the more general case of web servers, mail servers, etc. I’ll start by saying that ulimit is not the solution because, in part, its limits don’t propagate to child processes. Read this thorough discussion of the limitations of ulimit, and check out this script to limit the time and memory used by a Linux program. I haven’t evaluated that script yet, and its approach (polling run time and memory usage of the process and all of its children, grandchildren, etc.) seems a bit brute-force. I hope that process groups and Control Groups (cgroups) can be used instead. Also check out this Red Hat documentation on the memory subsystem in Linux.
It is conceivable that a compute node could have processes owned by multiple users.