⟦ec1ff7227⟧

TextFile

.ds AU Gary Perlman
.CW "Gary Perlman"
.(T "Notes on the History and Design of UNIX|STAT
Gary Perlman
School of Information Technology
Wang Institute of Graduate Studies
Tyngsboro, MA 01879 USA
(617) 649-9731
.)T "History of UNIX|STAT
.ls 1
.bp 1
.!T
.P
In this paper,
I discuss some the issues in the design of the UNIX|STAT programs.
UNIX|STAT is a collection of programs running on the UNIX
operating system.
The programs are designed to be easy to use
in the sorts of data analysis problems common in analyzing
experimental psychological data.
Because data analysis is a common activity in diverse fields,
the programs have found use in a variety of contexts.
To date,
the programs are being distributed
to over 100 UNIX sites in 11 countries on 4 continents,
and with their recent release on the USENIX distribution tape,
their use will probably increase.
.XX US Canada England France Switzerland Belgium Australia Japan Israel SaudiArabia Greece
Most UNIX|STAT users have access to other more powerful statistical
packages, but for the cases where UNIX|STAT programs are applicable,
the programs seem to be preferred because they are so easy to use.
In later paragraphs, the reasons for their ease of use will be discussed.
.MH "History
.P
UNIX|STAT began when I was a psychology graduate student
at the University of California, San Diego.
We had Version 6 UNIX running on a DEC PDP 11/45 with limited memory.
As a psychological laboratory, we had data to analyze,
but we had no analysis programs on UNIX.
Occasionally, we got a tape from a site with their latest programs,
but they tended to be too big, too slow, too clumsy,
or not do what we wanted them to do.
I tried to find statistical packages with no success.
We even got a standard large statistical package,
BMD-P,
but its programs were not designed for our little PDP 11/45.
So huge were they, with all their immense power,
that to load a program and do all the necessary overlaying
of subroutines required two minutes to get the mean of one number.
We were not happy.
.P
I wrote the first version of
.T desc
about that time.
It was late 1978.
.T desc
was a program to describe a single distribution of data
with summary statistics, frequency tables, and histograms.
I was tired of writing a
.T mean
program every time I wanted to do the simplest data analysis,
so I tried to write a general descriptive once and for all.
.T desc
was not welcomed with much fan fare;
I had to go door to door to get people to use it.
Evidently,
people thought it would be easier to write their own programs,
or do the analyses with a calculator,
than learn to use my program.
I searched personal binary directories
to see if people were making their own programs
to do sub-parts of
.T desc .
Sure enough, people had programs, even shell scripts,
to compute means and the occasional standard deviation.
Their programs were difficult to use,
with no documentation and silly conventions like having to
end the input with a minus one.
.T desc
allowed free format input
and read until the end of a file,
provided over a dozen options,
and its output was easy to read.
After about six months,
.T desc
was in moderate use by part of our lab.
I was encouraged to see people other than myself berating
a third party for writing their own programs to do what
.T desc
did.
.P
After
.T desc ,
I wrote
.T pair ,
a program for paired data analysis
(a common design in psychology).
.T pair
was based on a program written by Don Gentner of our lab.
I liked his program, but not its output format, which I modified.
.T pair
followed similar format conventions as
.T desc
and people familiar with
.T desc
found
.T pair
that much easier to use.
Over the years since then,
additional statistics and a simple scattergram routine
have been added to
.T pair .
.P
Both
.T desc
and
.T pair
read from the standard input and write to the standard output.
The main reason for this was that it is easier than opening files.
Looking back on this convenient decision,
I see it as critical to the design of the rest of the package.
The standard UNIX philosophy is to
to create modular programs that can be connected via pipelines with existing
tools in novel ways.
I knew of this philosophy, but did not follow it for any particular reason.
As other UNIX|STAT programs developed,
particularly ones for transforming data,
the philosophy dominated the design.
.P
We tended to keep our data in matrix type files
with some uniform number of blank separated fields per line.
We used these as master data files from which we extracted
the columns we were interested in.
Jay McClelland wrote a program called
.T colex
that extracted columns from such files.
.T colex
was indispensable,
and Jay added features to extract columns based on conditions
by allowing analysts to set minimum and maximum allowable values
for each column.
We wanted to be able to add, subtract, and do other transformations
on the values of the columns to create new files of data
so I wrote a program called
.T trans ,
which did an inadequate
job using a simple syntax.
We needed some data massager that could be handed complex expressions.
.P
In the summer of 1980,
I was teaching myself about compilers
and at the same time learned of a "compiler-compiler" called
.T yacc .
Based on our need for a data massager
and my interest in
.T yacc ,
I wrote
.T dm ,
now called the data
.I manipulator
for public consumption.
.T dm
allows an analyst to write a series of C-like expressions
for conditional extraction of algebraic combinations of columns
in a master data file.
With
.T dm ,
.T colex
and
.T trans
became immediately obsolete,
especially after Jay helped add
.T dm "'s"
string handling capabilities.
It was not until a year later that we found out that there
was a program called
.T awk
that could do all that
.T dm
did,
and more.
Even so,
.T dm
held its own because it was more convenient
for most of our needs.
.P
After
.T dm
was around,
our view of data analysis changed to that of a series of transformations
followed by a specific analysis.
.ce
cat data | transformations | analysis
Data begins in a master data file,
is sent via a pipe to a series of transformations (e.g., by
.T dm )
followed by a pipe into an analysis program (e.g.,
.T desc ).
Jay McClelland's concept of having all the data in a master data file,
and a set of transformation routines prompted him to write the
original version of
.T abut
on which mine is based.
.T abut
helps construct master data files while
.T dm
takes them apart.
Dave Rumelhart wrote a program called
.T mc
(for Make Column)
to re-shape data files;
.T maketrix
is based on this.
Later, the
.T transpose
program was added to the list of
transformation tools.
We were using files for storing string matrix data.
.P
About the same time,
Craig Will was working on a front end to a standard
statistical package's anova program,
BMD-08V,
the kind with unforgiving fixed column input formats
(all UNIX|STAT programs use white-space separated fields).
I thought this was noble,
but that the design of such programs was not that easy to fix.
Besides the input formats,
the control languages were abominable,
especially those parts to specify the experimental design.
I decided that there must be a natural way to enter that data
from a design to make it impossible to err
while making the data easier to describe.
.P
Then,
Jay McClelland wrote his
.T dt
(for Data Tabulation)
program to get cell means
using an innovative input format.
Each datum is accompanied by a set of identifying strings
that are used by
.T dt
to decide what cell a datum belongs to.
For example, if you found that a department store named ABC
sold large shoes for $10.95,
you would enter a line like:
.ce
ABC large shoe 10.95
This is the essence of the relational database model.
.P
This concept of self identification was followed in
.T anova
in which each datum is preceded by a series of labels,
one for each factor,
identifying the conditions of the factors under which the
datum was collected.
A file of data like this is self documenting,
and
.T anova
is able to infer the design from its format
without any interactive specification by the user.
The
.T regress
program was designed to accept its natural input format too:
a file of N columns implies an N variable regression.
This is easy to explain to even the most computer naive analysts.
.P
By this time, the package had reached a critical mass,
at least for our laboratory's needs.
Someone suggested I share these programs with other psychologists
and sent a general description of them to a journal in which psychologists
share technical information about
.I doing
research as opposed to the
.I results
of research.
On Greg Davidson's suggestion,
I added more extensive error checking to the programs
before releasing them to the public.
.MH "Interactive Use
.P
The UNIX shell is the command language for UNIX|STAT.
Using the standard input and output in the programs
allows joining them together with pipelines,
and the shell provides the command language to do it.
Using the shell is advantageous because it saves analysts learning time.
If analysts are familiar with the shell and some UNIX programs
like
.T ed ,
.T sort ,
.T mv ,
.T rm ,
.T cp ,
and others,
then they will be on familiar ground.
If not, they can be introduced to these utilities in
a familiar context,
and they will be able to use this knowledge in other contexts.
The main point of having all transformation and analysis routines
as stand alone programs and the shell as the command language
is that they can be used with UNIX utilities by anyone familiar with UNIX.
The opposite is also true:
some of the transformation routines are used in contexts
other than data analysis.
.P
The
.T io
program is an intelligent controller and monitor of input and output
in the shell and is based on Don Norman's notions about flow of data
through pipelines and the need to monitor it.
Monitoring the progress of programs is important if complex calculations
are being performed
because they can take a long time,
sometimes several minutes.
.T io
allows an analyst to get feedback about how much of a source
file has been exhausted,
or how many blocks of data have passed through a pipe.
With a complex pipeline of transformations,
an irritating error is to forget to specify an input file.
In addition to
.T io
giving feedback of progress,
most of the UNIX|STAT programs inform the analyst when
no input redirection is specified and data are to be read from
the terminal keyboard.
The programs assume that data entry from the keyboard is
the exception rather than the rule.
.T io
is also a safety minded replacement for input and output redirection
because it does not allow analysts
to accidently overwrite the contents of files.
.P
UNIX|STAT assumes that user efficiency is more valuable than
computer efficiency.
The cost paid by having a slow algorithm
(perhaps because a free-format Ascii data file is being processed)
or larger than necessary data files
(because they are self documenting) is easy to compute.
The cost in wasted human resources
because of increased command specification time,
increased probability of error and increased correction time,
and the high cost of undetected errors is more difficult.
Rather than count time for a program to run,
it is more reasonable to count the time an analyst spends
constructing the commands to run the program
and interpret the results.
These ideas are most evident in the design of the
.T anova
program,
which removes many time consuming and error prone tasks
from the analyst.
.MH "Comparison With S
.P
I am often asked to compare UNIX|STAT to the S data analysis system,
developed at Bell Labs by Rick Becker and John Chambers.
UNIX|STAT is entirely independent of S and was developed without
its knowledge.
The major difference is that UNIX|STAT was developed for working
with the sorts of experimental psychological data we in our lab
at UCSD were accustomed to using,
while S was developed as a general data analysis system,
and so offers a more comprehensive set of functions.
The philosophies of design of the two systems are similar in some ways,
and different in others.
.P
UNIX|STAT is a set of programs that are
.I added
to the existing
set of UNIX programs.
They all can read and write using pipes,
and so can be used with most other UNIX utilities,
using the facilities provided by the UNIX shell.
For example,
if you wanted to sort the data in a data file,
you would use the UNIX
.T sort
program,
not one provided by UNIX|STAT.
Or if you wanted graphical output,
you would use the standard UNIX
.T plot
and
.T graph
programs (though these could benefit by having a high level front end).
It is possible to use UNIX utilities with UNIX|STAT programs
because all assume Ascii format files with data separated by white space
(spaces, tabs, and sometimes newlines).
.P
S is a special environment for doing data analysis
separated from the UNIX environment.
The programs use a special data format so, in general,
they can not be used with UNIX utilities.
In general, it is not possible to use some standard UNIX utility
with S data sets because the data have such a special format.
One result of this is that S needs to have its own versions
of utilities in UNIX.
Another is that is is, in general, difficult at best
to use S functions outside the S environment.
.P
Both systems are extensible,
though I suspect adding to a set of UNIX programs is easier.
Both systems offer programming capabilities:
S has its own macro definition language
while UNIX|STAT depends on the shell's programming language.
Using one over the other depends on the needs of the analyst;
certainly, S's set of statistical functions is much larger.
Concerning efficiency,
my impression, without benchmarks or controlled data,
is that the UNIX|STAT programs are faster and easier to use
for small to medium data sets, especially for one shot analyses.
For larger data sets, S's matrix manipulation primitives
seem to give it an advantage that makes up for its slow startup time.
Concerning portability and related areas,
the UNIX|STAT programs use 300K bytes for the C source files,
and 350K bytes for the executables,
with programs averaging 15K bytes.
I am unsure about the requirements for S,
but I know them to be much larger.
The UNIX|STAT programs are running on UNIX systems
dating back to Version 6 UNIX
(of course, all subsequent systems can run them),
and run on computers as small as PDP 11/23's without problem.
.MH "Recent Developments
.P
After joining the faculty of the Wang Institute,
I restarted development on UNIX|STAT.
After a few hundred hours of work,
the result is a much better set of programs
that current UNIX|STAT users should consider reordering.
.LH "Improvements
.P
The first focus of development was to produce a better
quality product:
to improve the portability, reliability, and usability of the programs.
.PH "Option Standards
The option parser
.T getopt
is used in all the programs taking options.
This is to make the syntax more consistent.
Several commands had options renamed
or the order of operands changed to make the package internally consistent.
Few of the programs read or write files;
if possible,
programs only deal with the standard input and output.
.PH "Double Precision
All computations are now done in double precision,
so the results will be less susceptible to rounding errors.
.PH "Probability of F-Ratio
A better approximation to
.I pi
has removed the anomalous -0.000 probabilities.
.PH "Error Messages
The error checking is more rigorous throughout the package,
and the error messages are standardized and more diagnostic.
.PH "Random Number Seeding
The random number generation has a better seeding procedure.
.PH "Runtime Optimization
Much of the input processing is more efficient.
.PH "Exit Status
All the programs use a zero exit status after a successful run,
non-zero otherwise,
to follow the conventions used throughout UNIX.
.PH "Version Control
SCCS (Source Code Control System) version strings
have been installed in all the routines.
The second quarter 1985 release of UNIX|STAT has all the modules
set at version 5.0.
.PH "MSDOS Port
With the aid of Fred Horan at Cornell,
the complete package has been ported to MSDOS on the IBM PC
using the Lattice C compiler.
To make this possible,
some non-portable features of UNIX were replaced with simpler routines,
at no discernible loss.
.PH "New Documentation
All the documentation has been updated.
.LH "Additions
.P
Some new functionality was added for Version 5.0.
More can be expected in the future.
In particular, I am working on a general cross-tabulation
and chi-square program.
.PH "Partial Correlation
Thanks to the help of Caroline Palmer,
the
.T regress
program now allows you to analyze the contribution of
individual variables to the regression.
Because no-one ever saw the point of doing N simultaneous regressions,
.T regress
now predicts the first column's data with the rest.
.PH "One Way ANOVA & T-test
The
.T oneway
program was added to simplify the comparison of
data from different groups.
.T oneway
performs a one-way analysis of variance,
allowing unequal cell sizes.
It can also be called as the
.T ttest
program, for which (in the two variable case)
it prints its ANOVA table in a simpler format.
.PH "Data Manipulation Routines
New versions of
.T colex
and
.T trans
were added as alternatives for
.T dm
for systems that cannot compile
.T dm .
.T colex
might still be used instead of
.T dm
because it is faster at its limited task.

DataMuseum.dk

DKUUG/EUUG Conference tapes

⟦ec1ff7227⟧ TextFile

Derivation

TextFile