Characterizing the FreeBSD File Tree
Rich Morin <[email protected]>
Last month's article (The
FreeBSD Browser) said quite a bit about the purposes
of the browser, but very little about how it actually
works. This month, I would like to cover some
high-level design issues, while postponing showing off
my (rather hacky :-) implementation.
The hardest job that the browser performs is finding
a "closely related" set of files and manual pages. It
does this by applying some fairly naive graph
traversal logic to some (pre-defined and
dynamically-generated) linkage information. Let's
look at the linkage information first, then move on to
the graph traversal logic.
Linkage Information
The demo obtains its linkage information from a
variety of sources, including automated analysis of
system files, hand-edited XML files, and rule-based
links. The automated analysis is performed by a
relatively small amount of code, but it generates
hundreds of thousands of links. Hand editing adds a
few thousand more links, some annotations, etc. The
CGI script weaves this material together, adding some
(rule-based) links of its own.
Automated analysis is great for some things. Grabbing
file system links (hard and symbolic) is a piece of
cake. Parsing man pages for FILES and SEE ALSO
references isn't much harder. It's even possible,
with a bit of cheating, to pull "contrib" directory
references out of random makefiles. In short,
automated analysis provides a lot of leverage.
When automated analysis starts to break down,
hand-edited annotations can be used to fill in the
gaps. If the man pages don't mention particular files
or man pages, snippets of hand-coded XML can make the
missing connections. Unfortunately, although any
given bit of XML is easy to generate, producing
hundreds or thousands of snippets can be a
mind-boggling task. So, I have tried to keep the
number of hand-edited files to a minimum; the current
number is just short of a thousand...
One trick I use is a simple-minded form of
inheritance. The tree of hand-edited files contains
sub-trees for all/all (all OS variants and
versions), fb/all (FreeBSD, all versions),
fb/42 (FreeBSD, version 4.2), etc. The
database loading script finds the most specific match
it can for each node in the file tree. A simple
"include" mechanism can be used to import more
abstract declarations into, say, an fb/42
annotation file.
If a relationship can be encoded (e.g., by a bit of
run-time Perl code), we can avoid storing large
numbers of explicit links. This saves storage and
batch-mode processing time, but it may have an
offsetting cost at run time. Deciding which things to
encode in the database and which things to encode as
rules is a tricky proposition; the current balance is
workable, but I wouldn't claim that it is
optimal...
I have only implemented a few categories of rules to
date, but I expect to add more over time. If you
think of any rules that might be useful, please let me
know! Here, in any case, is a summary of the
browser's current rule base:
Manual Pages
Files in certain directories (e.g.,
/bin) commonly have manual pages in
well-known places (e.g., section 1). By guessing
(and then confirming) the existence of related
files, the CGI script can map man files to manual
pages, and vice versa.
Source and Build Directories
FreeBSD's build tree has a (mostly :-) regular
mapping to the resulting system. The exceptions are
few enough that some table-driven code suffices to
map many items to their source and build
directories.
Symbolic Links
Although automated analysis is used to find symbolic
links, a rule is used to assess the interesting
implications of a given link. Thus, if a symlink is
(or points to) part of a path name, the result is
another path name for the item in question.
Graph Traversal
The CGI script performs a "constrained, breadth-first
search" of the links that are connected to the initial
item. The initial item is used to seed some Perl
hashes, which are then used to motivate a single-level
search. As each new item is "discovered", assorted
information about the associated link is added to the
hashes. This information is used to constrain the
search, explain the logic to humans (via "Linkage
Trail" pages), etc.
For example, each link type has an associated a
"weight" value. The initial item has a weight of 1.0,
but some links have weights of less than this. As
links are added to the linkage trail, the link weights
are multiplied together. If the product falls below a
threshold, no further searching is done.
An even simpler mechanism, based on the number of
links traversed, keeps the script from chasing its
tail around link cycles, etc. The perceptive reader
will note that this is borrowed from the "time to
live" technique that the Internet uses to keep packets
from bouncing around the network forever.
Some items are worth reporting, but do not qualify for
further investigation. If the "follow" flag on a link
is false, no further searching is done in that
direction. A man page's FILE or SEE ALSO references
to a preceding item fall into this category.
In summary, the browser searches a local "neighborhood"
of the original item, stopping when all of the trails
have "died out". By adding links, tuning the weights
and tweaking the code, I have been able to increase
the number of "interesting" results, while keeping the
"noise" (i.e., false positives) under control.
Future Directions
I have recently started to add Mac OS X coverage to the
browser and may also add NetBSD to the mix. Covering
a "family" of similar operating systems allows me to
test (and extend) the flexibility of the design, while
staying in a relatively constrained problem domain.
Once this family is under control, I can consider
adding other operating systems to the coverage.
With few exceptions, the automated analysis stops at
the file level. By adding parsing rules for assorted
file types (e.g., control files, man pages, source
files), I could allow the browser to explore more
types of relationships. At the same time, adding some
file-level browsing capability would allow users to
follow this information into the files themselves.
I have also been thinking about how to analyze,
characterize, and represent "subsystems" such as the
mail and print spoolers. Because this problem brings
in the possibility of handling the entire Ports
Collection, some caution is warranted in approaching
it!
The browser's current (web-based) design limits it to
covering "generic" operating system information. By
creating support for "local clients" (e.g., via XML),
I could allow a user to examine his or her own system,
using the browser's information resources. This would
be a significant step towards the implementation of a
true Meta
system.
Return to the
April 2001 Issue