|
DataMuseum.dkPresents historical artifacts from the history of: DKUUG/EUUG Conference tapes |
This is an automatic "excavation" of a thematic subset of
See our Wiki for more about DKUUG/EUUG Conference tapes Excavated with: AutoArchaeologist - Free & Open Source Software. |
top - metrics - downloadIndex: T d
Length: 17547 (0x448b) Types: TextFile Names: »design«
└─⟦87ddcff64⟧ Bits:30001253 CPHDIST85 Tape, 1985 Autumn Conference Copenhagen └─⟦this⟧ »cph85dist/stat/doc/design«
.ds AU Gary Perlman .CW "Gary Perlman" .(T "Notes on the History and Design of UNIX|STAT Gary Perlman School of Information Technology Wang Institute of Graduate Studies Tyngsboro, MA 01879 USA (617) 649-9731 .)T "History of UNIX|STAT .ls 1 .bp 1 .!T .P In this paper, I discuss some the issues in the design of the UNIX|STAT programs. UNIX|STAT is a collection of programs running on the UNIX operating system. The programs are designed to be easy to use in the sorts of data analysis problems common in analyzing experimental psychological data. Because data analysis is a common activity in diverse fields, the programs have found use in a variety of contexts. To date, the programs are being distributed to over 100 UNIX sites in 11 countries on 4 continents, and with their recent release on the USENIX distribution tape, their use will probably increase. .XX US Canada England France Switzerland Belgium Australia Japan Israel SaudiArabia Greece Most UNIX|STAT users have access to other more powerful statistical packages, but for the cases where UNIX|STAT programs are applicable, the programs seem to be preferred because they are so easy to use. In later paragraphs, the reasons for their ease of use will be discussed. .MH "History .P UNIX|STAT began when I was a psychology graduate student at the University of California, San Diego. We had Version 6 UNIX running on a DEC PDP 11/45 with limited memory. As a psychological laboratory, we had data to analyze, but we had no analysis programs on UNIX. Occasionally, we got a tape from a site with their latest programs, but they tended to be too big, too slow, too clumsy, or not do what we wanted them to do. I tried to find statistical packages with no success. We even got a standard large statistical package, BMD-P, but its programs were not designed for our little PDP 11/45. So huge were they, with all their immense power, that to load a program and do all the necessary overlaying of subroutines required two minutes to get the mean of one number. We were not happy. .P I wrote the first version of .T desc about that time. It was late 1978. .T desc was a program to describe a single distribution of data with summary statistics, frequency tables, and histograms. I was tired of writing a .T mean program every time I wanted to do the simplest data analysis, so I tried to write a general descriptive once and for all. .T desc was not welcomed with much fan fare; I had to go door to door to get people to use it. Evidently, people thought it would be easier to write their own programs, or do the analyses with a calculator, than learn to use my program. I searched personal binary directories to see if people were making their own programs to do sub-parts of .T desc . Sure enough, people had programs, even shell scripts, to compute means and the occasional standard deviation. Their programs were difficult to use, with no documentation and silly conventions like having to end the input with a minus one. .T desc allowed free format input and read until the end of a file, provided over a dozen options, and its output was easy to read. After about six months, .T desc was in moderate use by part of our lab. I was encouraged to see people other than myself berating a third party for writing their own programs to do what .T desc did. .P After .T desc , I wrote .T pair , a program for paired data analysis (a common design in psychology). .T pair was based on a program written by Don Gentner of our lab. I liked his program, but not its output format, which I modified. .T pair followed similar format conventions as .T desc and people familiar with .T desc found .T pair that much easier to use. Over the years since then, additional statistics and a simple scattergram routine have been added to .T pair . .P Both .T desc and .T pair read from the standard input and write to the standard output. The main reason for this was that it is easier than opening files. Looking back on this convenient decision, I see it as critical to the design of the rest of the package. The standard UNIX philosophy is to to create modular programs that can be connected via pipelines with existing tools in novel ways. I knew of this philosophy, but did not follow it for any particular reason. As other UNIX|STAT programs developed, particularly ones for transforming data, the philosophy dominated the design. .P We tended to keep our data in matrix type files with some uniform number of blank separated fields per line. We used these as master data files from which we extracted the columns we were interested in. Jay McClelland wrote a program called .T colex that extracted columns from such files. .T colex was indispensable, and Jay added features to extract columns based on conditions by allowing analysts to set minimum and maximum allowable values for each column. We wanted to be able to add, subtract, and do other transformations on the values of the columns to create new files of data so I wrote a program called .T trans , which did an inadequate job using a simple syntax. We needed some data massager that could be handed complex expressions. .P In the summer of 1980, I was teaching myself about compilers and at the same time learned of a "compiler-compiler" called .T yacc . Based on our need for a data massager and my interest in .T yacc , I wrote .T dm , now called the data .I manipulator for public consumption. .T dm allows an analyst to write a series of C-like expressions for conditional extraction of algebraic combinations of columns in a master data file. With .T dm , .T colex and .T trans became immediately obsolete, especially after Jay helped add .T dm "'s" string handling capabilities. It was not until a year later that we found out that there was a program called .T awk that could do all that .T dm did, and more. Even so, .T dm held its own because it was more convenient for most of our needs. .P After .T dm was around, our view of data analysis changed to that of a series of transformations followed by a specific analysis. .ce cat data | transformations | analysis Data begins in a master data file, is sent via a pipe to a series of transformations (e.g., by .T dm ) followed by a pipe into an analysis program (e.g., .T desc ). Jay McClelland's concept of having all the data in a master data file, and a set of transformation routines prompted him to write the original version of .T abut on which mine is based. .T abut helps construct master data files while .T dm takes them apart. Dave Rumelhart wrote a program called .T mc (for Make Column) to re-shape data files; .T maketrix is based on this. Later, the .T transpose program was added to the list of transformation tools. We were using files for storing string matrix data. .P About the same time, Craig Will was working on a front end to a standard statistical package's anova program, BMD-08V, the kind with unforgiving fixed column input formats (all UNIX|STAT programs use white-space separated fields). I thought this was noble, but that the design of such programs was not that easy to fix. Besides the input formats, the control languages were abominable, especially those parts to specify the experimental design. I decided that there must be a natural way to enter that data from a design to make it impossible to err while making the data easier to describe. .P Then, Jay McClelland wrote his .T dt (for Data Tabulation) program to get cell means using an innovative input format. Each datum is accompanied by a set of identifying strings that are used by .T dt to decide what cell a datum belongs to. For example, if you found that a department store named ABC sold large shoes for $10.95, you would enter a line like: .ce ABC large shoe 10.95 This is the essence of the relational database model. .P This concept of self identification was followed in .T anova in which each datum is preceded by a series of labels, one for each factor, identifying the conditions of the factors under which the datum was collected. A file of data like this is self documenting, and .T anova is able to infer the design from its format without any interactive specification by the user. The .T regress program was designed to accept its natural input format too: a file of N columns implies an N variable regression. This is easy to explain to even the most computer naive analysts. .P By this time, the package had reached a critical mass, at least for our laboratory's needs. Someone suggested I share these programs with other psychologists and sent a general description of them to a journal in which psychologists share technical information about .I doing research as opposed to the .I results of research. On Greg Davidson's suggestion, I added more extensive error checking to the programs before releasing them to the public. .MH "Interactive Use .P The UNIX shell is the command language for UNIX|STAT. Using the standard input and output in the programs allows joining them together with pipelines, and the shell provides the command language to do it. Using the shell is advantageous because it saves analysts learning time. If analysts are familiar with the shell and some UNIX programs like .T ed , .T sort , .T mv , .T rm , .T cp , and others, then they will be on familiar ground. If not, they can be introduced to these utilities in a familiar context, and they will be able to use this knowledge in other contexts. The main point of having all transformation and analysis routines as stand alone programs and the shell as the command language is that they can be used with UNIX utilities by anyone familiar with UNIX. The opposite is also true: some of the transformation routines are used in contexts other than data analysis. .P The .T io program is an intelligent controller and monitor of input and output in the shell and is based on Don Norman's notions about flow of data through pipelines and the need to monitor it. Monitoring the progress of programs is important if complex calculations are being performed because they can take a long time, sometimes several minutes. .T io allows an analyst to get feedback about how much of a source file has been exhausted, or how many blocks of data have passed through a pipe. With a complex pipeline of transformations, an irritating error is to forget to specify an input file. In addition to .T io giving feedback of progress, most of the UNIX|STAT programs inform the analyst when no input redirection is specified and data are to be read from the terminal keyboard. The programs assume that data entry from the keyboard is the exception rather than the rule. .T io is also a safety minded replacement for input and output redirection because it does not allow analysts to accidently overwrite the contents of files. .P UNIX|STAT assumes that user efficiency is more valuable than computer efficiency. The cost paid by having a slow algorithm (perhaps because a free-format Ascii data file is being processed) or larger than necessary data files (because they are self documenting) is easy to compute. The cost in wasted human resources because of increased command specification time, increased probability of error and increased correction time, and the high cost of undetected errors is more difficult. Rather than count time for a program to run, it is more reasonable to count the time an analyst spends constructing the commands to run the program and interpret the results. These ideas are most evident in the design of the .T anova program, which removes many time consuming and error prone tasks from the analyst. .MH "Comparison With S .P I am often asked to compare UNIX|STAT to the S data analysis system, developed at Bell Labs by Rick Becker and John Chambers. UNIX|STAT is entirely independent of S and was developed without its knowledge. The major difference is that UNIX|STAT was developed for working with the sorts of experimental psychological data we in our lab at UCSD were accustomed to using, while S was developed as a general data analysis system, and so offers a more comprehensive set of functions. The philosophies of design of the two systems are similar in some ways, and different in others. .P UNIX|STAT is a set of programs that are .I added to the existing set of UNIX programs. They all can read and write using pipes, and so can be used with most other UNIX utilities, using the facilities provided by the UNIX shell. For example, if you wanted to sort the data in a data file, you would use the UNIX .T sort program, not one provided by UNIX|STAT. Or if you wanted graphical output, you would use the standard UNIX .T plot and .T graph programs (though these could benefit by having a high level front end). It is possible to use UNIX utilities with UNIX|STAT programs because all assume Ascii format files with data separated by white space (spaces, tabs, and sometimes newlines). .P S is a special environment for doing data analysis separated from the UNIX environment. The programs use a special data format so, in general, they can not be used with UNIX utilities. In general, it is not possible to use some standard UNIX utility with S data sets because the data have such a special format. One result of this is that S needs to have its own versions of utilities in UNIX. Another is that is is, in general, difficult at best to use S functions outside the S environment. .P Both systems are extensible, though I suspect adding to a set of UNIX programs is easier. Both systems offer programming capabilities: S has its own macro definition language while UNIX|STAT depends on the shell's programming language. Using one over the other depends on the needs of the analyst; certainly, S's set of statistical functions is much larger. Concerning efficiency, my impression, without benchmarks or controlled data, is that the UNIX|STAT programs are faster and easier to use for small to medium data sets, especially for one shot analyses. For larger data sets, S's matrix manipulation primitives seem to give it an advantage that makes up for its slow startup time. Concerning portability and related areas, the UNIX|STAT programs use 300K bytes for the C source files, and 350K bytes for the executables, with programs averaging 15K bytes. I am unsure about the requirements for S, but I know them to be much larger. The UNIX|STAT programs are running on UNIX systems dating back to Version 6 UNIX (of course, all subsequent systems can run them), and run on computers as small as PDP 11/23's without problem. .MH "Recent Developments .P After joining the faculty of the Wang Institute, I restarted development on UNIX|STAT. After a few hundred hours of work, the result is a much better set of programs that current UNIX|STAT users should consider reordering. .LH "Improvements .P The first focus of development was to produce a better quality product: to improve the portability, reliability, and usability of the programs. .PH "Option Standards The option parser .T getopt is used in all the programs taking options. This is to make the syntax more consistent. Several commands had options renamed or the order of operands changed to make the package internally consistent. Few of the programs read or write files; if possible, programs only deal with the standard input and output. .PH "Double Precision All computations are now done in double precision, so the results will be less susceptible to rounding errors. .PH "Probability of F-Ratio A better approximation to .I pi has removed the anomalous -0.000 probabilities. .PH "Error Messages The error checking is more rigorous throughout the package, and the error messages are standardized and more diagnostic. .PH "Random Number Seeding The random number generation has a better seeding procedure. .PH "Runtime Optimization Much of the input processing is more efficient. .PH "Exit Status All the programs use a zero exit status after a successful run, non-zero otherwise, to follow the conventions used throughout UNIX. .PH "Version Control SCCS (Source Code Control System) version strings have been installed in all the routines. The second quarter 1985 release of UNIX|STAT has all the modules set at version 5.0. .PH "MSDOS Port With the aid of Fred Horan at Cornell, the complete package has been ported to MSDOS on the IBM PC using the Lattice C compiler. To make this possible, some non-portable features of UNIX were replaced with simpler routines, at no discernible loss. .PH "New Documentation All the documentation has been updated. .LH "Additions .P Some new functionality was added for Version 5.0. More can be expected in the future. In particular, I am working on a general cross-tabulation and chi-square program. .PH "Partial Correlation Thanks to the help of Caroline Palmer, the .T regress program now allows you to analyze the contribution of individual variables to the regression. Because no-one ever saw the point of doing N simultaneous regressions, .T regress now predicts the first column's data with the rest. .PH "One Way ANOVA & T-test The .T oneway program was added to simplify the comparison of data from different groups. .T oneway performs a one-way analysis of variance, allowing unequal cell sizes. It can also be called as the .T ttest program, for which (in the two variable case) it prints its ANOVA table in a simpler format. .PH "Data Manipulation Routines New versions of .T colex and .T trans were added as alternatives for .T dm for systems that cannot compile .T dm . .T colex might still be used instead of .T dm because it is faster at its limited task.