Automating IP Host Data Collection on a LAN

Joe Nasal

Issue #67, November 1999

Using Linux and the ACEDB database system makes easy work of local area network management.

Linux's agility and power inspires the efficient design and implementation of specialty tools for specific tasks. On a data network, engineers and administrators appreciate the ease and flexibility with which Linux can be implemented as a platform for data collection, analysis and processing. In this article I'll demonstrate techniques for implementing ACEDB (an object-oriented database) and a few other tools to provide comprehensive access to administrative data that you might already be collecting from your network.

The management of TCP/IP local area networks often entails an enduring struggle to control address space. Workstations, servers and managed subsystems (routers, firewalls, etc.) are all added to and subtracted from the network as the shape of the organization and the flow of data changes. However, remembering which machine is assigned to a particular IP address is not always as simple as keeping an up-to-date list of IP address to node assignments. Sometimes network architecture changes without proper documentation even under the best of circumstances—DHCP, BOOTP, ubiquitous SNMP and managed repeater ports included. Yet, keeping track of IP address assignments is important. Since the logical address space of an IP subnet is limited to a finite number of usable addresses, recycling IP addresses is a must. It's also useful to know the kind of machine responsible for a particular instance of packet generation when performing data-analysis on a LAN (metrics collection, troubleshooting, security audits, etc.). Automating the collection of and access to this information would go a long way toward reclaiming lost or unknown administrative data.

If you haven't yet discovered Arpwatch (available via anonymous ftp at ftp://ftp.ee.lbl.gov/arpwatch.tar.Z), please allow me to introduce you. From Arpwatch's man page, “Arpwatch keeps track of Ethernet and IP address pairings. It syslogs activity and reports certain changes via e-mail.” Arpwatch uses the libpcap API (ftp://ftp.ee.lbl.gov/libpcap.tar.Z) to listen for and capture ARP (address resolution protocol) requests and replies on a local Ethernet interface.

RFC 826 introduces the origins of the address resolution protocol, but you may find a more up-to-date description in your favorite networking handbook. For the purposes of this column, ARP is the method by which any machine on a logical TCP/IP subnet determines the Ethernet address (sometimes called a hardware or MAC (mandatory access control) address) of any other machine on the same logical subnet. ARP with respect to unicast data communications on a LAN provides a “sending node” with the Ethernet address of the “receiving node”--information essential to the successful completion of a communications transaction. The sending node generates an ARP request, a broadcast heard by all machines on the network asking for the machine assigned a specific IP address to respond with its Ethernet address. An ARP reply comes from the machine owning the particular IP address and contains its own Ethernet address as an answer to the query.

Arpwatch listens to the ARP conversations taking place between the machines on a network, extracts Ethernet address, IP address pairings from the dialogue, and stores the results with a timestamp in a local table. As ARP happens over time, this table grows to represent an accurate list of the available nodes on a network. Arpwatch cross references a machine's Ethernet address against a list of vendor descriptions of network interfaces, formats a report of its findings, and finally, e-mails the report or prints it to STDERR, like this:

From: arpwatch (Arpwatch)
To: root
Subject: new station (node.yourdomain.com)
   hostname:  node.yourdomain.com
   ip address:  192.168.10.10
   Ethernet address:  0:a0:24:56:c4:3a
   Ethernet vendor:  3com
   timestamp:  Sunday, May 9, 1999 11:16:57 -0400

Arpwatch only processes information collected from the same logical network that the interface it listens on participates in. This is true even if the LAN has been designed so that different subnets share the same physical wire for data transmission. Collecting data from multiple logical subnetworks requires the execution of separate instances of Arpwatch, one for each logical subnet, each tied to an autonomous network interface. Since Linux allows multiple network interfaces to reside in the same system (reference the “Mini-HOWTO on using multiple Ethernet adapters with Linux”), this can be conveniently accomplished in a single box. A modest PC with one Intel 486 DX-2 66 processor and an ISA bus can easily collect data from several busy subnetworks. Conversely, since Linux and Arpwatch are both readily obtained, multiple Arpwatch collection stations, one for each subnetwork, can be established if preferred.

Typically, Arpwatch is configured to distribute the information it collects via sendmail. This functions well as an alert mechanism with respect to changes within or additions to the network infrastructure, yet the utility of an Arpwatch report is therefore limited to the expression of this information in a static e-mail message. Sure, these reports can be searched and archived, but wouldn't it be useful to process and funnel the information collected by Arpwatch into a database? In this way Arpwatch's reports could be supplemented with a description of the node's responsible individual, its physical location, its operating system, its primary function or any number of other useful attributes.

The ACEDB database system was created by Richard Durbin and Jean Thierry-Mieg to provide a flexible and dynamic storage medium for a complex data set in support of their biological research. ACEDB is flexible in that it allows for a wide variety of different kinds of data to be stored. It is dynamic in that the structure of the database is easily modified as the data either comes to be understood differently over time or as it changes shape, such as through the addition of a new attribute or a new data type. ACEDB is also object-oriented, meaning that organization of data within the system preserves the real-world uniqueness and autonomy of each individual data object. Data within an ACEDB database is easily accessible because it is not sliced up into constituent parts and stored within relational tables. This provides for more intuitive access to data than is typical within many classic relational or object-relational database management systems. Don't get too hung-up on ACEDB's “object-oriented” tag. ACEDB's flair for objects simply means that it is easy to create and understand complex relationships between different kinds of data.

ACEDB is easily implemented within Linux. You'll find the latest distribution in the index at ftp://ftp.ncbi.nlm.nih.gov/repository/acedb/ace4/. Fetch a copy of the INSTALL script and examine the NOTES and README files for news and installation instructions. Download binary distributions of the ACEDB database and server, each of which has been precompiled for libc6.

Like Linux, ACEDB has enjoyed the benefits of collaborative engineering and development that are characteristic of an open-source software product. One of the major contributors to the ACEDB project is Lincoln Stein, perhaps best known for his work as the author of the CGI.pm module for Perl-based CGI scripting. Mr. Stein has created Jade, a Java-based interface between ACEDB and the Java programming language. He's also written an award-winning, object-oriented Perl interface for ACEDB called AcePerl. AcePerl provides functionality for connecting to and updating ACEDB databases, performing queries, fetching objects and other administrative tasks. To compliment AcePerl Mr. Stein has created and released AceBrowser, a small set of CGI scripts which work with AcePerl to provide for the easy creation of a web-browsable interface to any ACEDB database. In this column, we'll use both AcePerl and AceBrowser to build, view and modify the database of IP host data collected by Arpwatch. You'll find links to the latest versions of AcePerl and AceBrowser at http://stein.cshl.org/. Download these distributions and make sure your local Perl is up to date (perl -v should report version 5.004_04 or higher). One goal of this project is to make the data stored in the ACEDB database available via the Web, so you'll also need a web server. Of course, the Apache web server is freely available from http://www.apache.org/. Finally, AceBrowser references CGI.pm, so make sure you've got version 2.46 or higher installed. CGI.pm is available at http://stein.cshl.org/WWW/software/CGI/.

We're going to use Linux to build a single system that functions as an Arpwatch collection station (a platform for storing the collected data in an ACEDB database), an ACEDB server and a web server for providing natural access to this data from anywhere on the network.

Install the ACEDB distribution via the INSTALL script. ACEDB comes with both a graphical (xace) and a text-based (tace) front end, each of which provide easy access to the system. Accessing an ACEDB database is as simple as running the acedb script from the top-level directory of an ACEDB installation and using the GUI to navigate through the data structure. Though ACEDB is powerful and highly customizable, a few other things must be known before getting started.

In an ACEDB installation, content of the files in the /wspec directory defines the ACEDB environment—everything from the appearance of xace's GUI interface and the allocation of cache space on the local disk to the internal definition of data structures. Rely on the default values presented in these files and verify that your user name is included in the passwd.wrm file in the list of authorized users. The key to understanding and getting started with ACEDB comes from an examination of the models.wrm, file also known as the “Model File”. The Model File is a template that defines both the structure of objects within the database and relationships that exists between objects. Each object in an ACEDB database has both a class and a name. An object's class describes the type of object that it is, and an object's name serves to uniquely identify it. Using ACEDB jargon, I could describe myself as an object of class “person” with the name of “Joe”. Or, if I were building a personnel database for a company, it might be more useful to describe myself as an object of class “Engineer” with a name of “NasalJS”, for example. I could make the object “NasalJS” more useful still by attaching a tag called “Full_name” to it with a value of “Joseph S. Nasal” and a tag called “Birthday” with a value of “06June1969”, so that people who use my database will know some additional information about “NasalJS”. ACEDB objects are built like this with a tree-like hierarchy of tags and values. The object file's content, called the schema, represents the generic structure (or form) of each object class within an ACEDB database including the tags used to build objects and give them meaning.

The schema can be complex (as it is with the databases of genome sequences for which ACEDB was originally written) or simple depending on the data and the requirements of the application. We'll begin with a very simple schema to represent the data collected via Arpwatch. We can use the format of an Arpwatch report to suggest the structure of our ACEDB objects and create a schema that looks like this:

?Host   host_name UNIQUE Text
   ip_address UNIQUE Text
   ethernet_address current_ea UNIQUE Text
                         previous_ea Text
   ethernet_vendor  current_ev UNIQUE
      ?EtherType XREF participating_nodes
                         previous_ev Text
   timestamp current_ts UNIQUE Text
                  previous_ts Text
                  delta_ts Text
?EtherType   participating_nodes ?Host XREF
   ethernet_vendor

This schema defines two classes of objects, Host and EtherType (classes are identified using the syntax ?Classname). In the Host class, the host_name tag has been constructed to contain a text value which is further described by the capitalized directive UNIQUE. Therefore, in an object of class Host the host_name tag will contain text and have no more than one data value. Similarly, the ip_address tag also utilizes the UNIQUE directive and may contain only data of type Text. Notice that the tags ethernet_address, ethernet_vendor and timestamp are each complex data types which are further described by their respective subtags.

An ACEDB database may utilize other data types (such as Int, Float and DateType, for example) but we'll keep things simple and treat all of the Arpwatch data as plain text. Directives other than UNIQUE are also available. In this example, the XREF directive is used to establish a cross-reference relationship between the current_ev subtag within the class Host and the participating_nodes tag within the class EtherType. The effect of this will be to build lists of Host objects which share a common network interface vendor.

Listing 1

To test this schema, we can fire-up xace via the acedb script and load a flat file containing sample data, as shown in Listing 1. In this example file, objects are separated by blank lines, and each tag of every object occupies its own row. Notice that when providing a value for complex data types, only the rightmost subtag needs to be specified. To test the schema, save this data in a file with an extension of .ace in the top-level directory of the ACEDB installation. Edit the contents of the layout.wrm file in the /wspec directory to contain two lines, the first being Host and the second EtherType, so that xace will display these sample objects by default in the GUI. Start up xace and import the data file by selecting “Edit” and “Read an .ace file”. After opening the data file and reading in the data, use the GUI to navigate through the data set. Observe how objects are presented in accordance with the template established in the model file, including the results of the cross-reference relationship between objects of class Host and objects of class EtherType.

In the live system, we'll funnel Arpwatch records directly into the database using AcePerl. The first step is to capture Arpwatch's output to a file. Arpwatch's man page reads, “Starting Arpwatch with the -d flag inhibits forking into the background and e-mailing reports. Instead, they are sent to STDERR.” Using redirection, we can save Arpwatch's reports to a file, like this:

arpwatch -d > report.data

Listing 2

Perl is the perfect tool for extracting record data and building the database. In Listing 2, you'll find code which utilizes Mr. Stein's AcePerl modules to connect with an ACEDB server listening at port 20000100 on the local machine. After connecting with the server, the script drops into a loop of fetching Arpwatch records from the data file. When a complete record has been built, the last “else” condition in the while loop calls the “process” subroutine to update the database. First, the subroutine removes unnecessary whitespace and checks to see if $host_name contains the string <unknown> (if Arpwatch is unable to resolve the name associated with a node's IP address it marks the record <unknown>). Since we're going to use the data in $host_name to name ACEDB Host objects, the code translates <unknown> into a unique identifier based upon timestamp and the current Ethernet address.

Next, the data in $host_name and AcePerl's fetch method are used to see if a corresponding Host object exists. If it doesn't, a new Host object is created using AcePerl's new method, and the object is built by adding value to its tags with AcePerl's add method. Finally, the new Host object is written to the database with a call to AcePerl's commit method.

If it turns out that Arpwatch is only reporting some new information about an existing Host object, then process uses AcePerl's add and commit methods to simply update the existing object with the new data. Finally, the new record flag is reset and the subroutine exits back into the while loop to collect more record data.

When all of the records in the file have been objectified and added to the database, the script pauses for five minutes, then utilizes Perl's native seek function to reset the end-of-file error condition. This trick allows the code to follow and process the growing data file (emulating UNIX's tail -f command) as Arpwatch continues to collect more records.

As easy as that, we've built an ACEDB database by turning Arpwatch reports into objects and processing them with AcePerl. However, we've barely touched upon ACEDB's power for data representation and AcePerl's flexibility as an API for building and manipulating ACEDB objects. With a little bit more coding, for example, we could call methods within the SNMP Perl module to probe and collect data from SNMP-aware devices on the network and add this data to Host objects as they are built or updated. Or, we could add subclasses to the Host object based upon machine type (server, router, firewall, etc.). Any functional modifications which require changes to object definitions within the schema are easily handled by ACEDB. Go ahead and make the change and then let ACEDB automatically update the structure of existing objects within the database.

The last step is to make this database of Arpwatch records available via the Web. Mr. Stein's AceBrowser CGI scripts provide an easy solution. Unpack and install AceBrowser in a subdirectory of your web server's CGI directory. AceBrowser comes with a set of scripts for fetching, displaying and interacting with text and static GIF images stored in an ACEDB database. AceBrowser code is provided as an excellent starting point for creating custom web interfaces to any ACEDB database. However, AceBrowser's model-independent scripts can be used to display our data right out of the box, without modification. Follow the instructions for supplying the information in the site-specific global definitions (the location of the ACEDB server, the HTML stylesheet, etc.) and fire-up your web browser. I've used the simple search script to discover the Host object presented in Figure 1. The object is displayed using AceBrowser's tree script and is represented on-screen to mirror the object's structure as defined in the Model File. Notice that the data in the current_ev tag is represented as a hotlink. The cross-reference relationship defined in the schema creates lists of Host objects which share a common network interface vendor. In this example, the hotlink leads to a browsable list of hosts which use network interfaces manufactured by 3Com Corporation.

With ease and only a little bit of coding, we've used Linux and the ACEDB database to create a powerful tool for the collection of IP administrative data on a LAN. Linux and ACEDB are a good match because they are both flexible enough to allow for the invention of specialized databases on the fly and powerful enough to ensure data integrity. Lincoln Stein's AcePerl modules provide a powerful and friendly API for interacting with an ACEDB database, and his AceBrowser scripts are ready-made to interface with any ACEDB database over the Web.

So what are you waiting for? What can you make Linux and ACEDB do?

All listings referred to in this article are available by anonymous download in the file ftp.linuxjournal.com/pub/lj/listings/issue67/3517.tgz.

Joe Nasal (joe.nasal@usa.net) lives to hack Linux and networks and often gets to do both as a software/systems engineer. However, he needs to escape from the lab more often to spend time with his beautiful wife, Kristin, who is eternally patient.