⟦3daa0c67c⟧

TextFile

.ds AU Gary Perlman
.ds UX UNIX
.ds US UNIX|STAT
.ds DT March 1985
.CW
.TC 4
.(T "\*(US" "Data Analysis Programs for \*(UX" "A Tutorial Introduction"
Gary Perlman
School of Information Technology
Wang Institute of Graduate Studies
Tyngsboro, MA 01879 USA
(617) 649-9731
.)T "\*(US Tutorial
.ls 1
.de PG
.       LH "\\$1: \\$2
.P
..
.bp 1
.!T
.HH Abstract
.sp
I describe some programs for data analysis
for use on the \*(UX operating system.
The programs fall into four classes:
validation, transformations, description, and inference.
These programs are compact and versions run on micro-computers as small
as a PDP 11/23 and PC's running MSDOS.
The programs are designed to be used
in pipelines and require little user interaction
because the programs often are able to infer the design of the data
from data file formats.
An ordinary command involves a pipeline of transformations
terminated by an analysis command.
Most users familiar with statistics
learn to do simple analyses in a few seconds,
and complex analyses inside an hour.
.bp
.!T
.P
In this paper I demonstrate some programs available on \*(UX
for analyzing data.
The programs were written by me while I was at the University of California
at San Diego between 1979 and 1982
and at the Wang Institute of Graduate Studies
from 1984 to the present.
Collectively, they are called \*(US,
so named because of their heavy use of the
.T |
symbol
for the \*(UX ``pipe'' operation.
The programs have been designed to be easy to use,
and small enough to fit on mini computers such as
DEC PDP
11/45's, 11/34's, and 11/23's.
I will first describe the format prescribed for the programs,
and how people use the programs in the \*(UX environment.
Although this document might be a sufficient introduction
to the facilities for analyzing data,
it is not complete and should not be used as a substitute for
the reference manuals for selected programs like
.T dm
and
.T calc ),
nor for the manual entries on all the programs
(which can be printed by typing
.T manstat
program in the shell).
In general, the on-line documentation is more up to date
than the printed documentation.
.P
The programs are of several different types:
.PH "Data Transformation
These programs are useful for changing the format of data files,
for transforming data, and for filtering unwanted data.
One program is useful for monitoring the progress
of the data transformations.
.PH "Data Validation
These include programs for checking the number of columns
in data files and their types (e.g., alphanumeric, integer).
.PH "Descriptive Statistics
These procedures include both numerical statistics,
and simple graphical displays.
There are procedures for single distributions,
paired data, and multivariate cases.
.PH "Inferential Statistics
These include multivariate linear regression
and multi-factor analysis of variance.
Some simple inferential statistics are also incorporated into
the descriptive statistics programs, but are used less often.
.PH "Other Programs
Some programs peculiar to psychological data analysis
are included in the \*(US distribution tape,
but are not described here.
For mathematical psychology fans there is a d' analysis program,
.T dprime ,
and a vincentizing program,
.T vincent .
.br
.ne 3i
.(D "Table of Programs Described
.if t .ta 1i
.if n .ta 10n
.ft R
abut	abut files beside each other
anova	multi-factor anova with unequal cell sizes and repeated measures
biplot	bivariate plotting with options for summary statistics
calc	algebraic calculator
corr	multivariate linear correlation with summary statistics
desc	univariate statistics, frequency tables, and histograms
dm	conditional transformations of data
io	control and monitor file input and output
maketrix	form a matrix from an unstructured file
oneway	one way analysis of variance with unequal cell sizes
pair	bivariate summary statistics with options for scatterplots
perm	randomly permute lines
regress	multivariate linear regression
repeat	repeat a file or string
reverse	reverse lines, fields, or characters
series	print a series of numbers
ts	time series analysis
transpose	transpose matrix type file (flip rows and columns)
ttest	between groups t-test
.)D
.P
All these programs run on the \*(UX operating system,
a program development and text editing facility developed
at Bell Laboratories.
\*(UX is a highly interactive operating system,
and because of this,
it is an ideal environment for data analysis.
Persons doing data analysis sit at a terminal and repeatedly
specify program options and data on which these program act.
They have immediate access to intermediate results,
and can make analysis decisions based on them.
.HH "Using \*(UX
.P
This section describes the typical use of \*(UX
and is meant to give non-users an brief introduction
to its use.
It may also provide a useful summary for experienced users
who have little experience with constructing complex commands with pipelines.
\*(UX users sit at a terminal
at which they repeatedly specify a program,
the input to that program, and to where the output from the program
should be directed.
They specify this program, input, and output
to a program called a ``shell.''
The shell is most users primary way of interacting with \*(UX.
If the user does not specify from where the input to a program
is to be read,
the default ``standard input'' is the user's terminal keyboard.
(For data analysis programs, this often is a mistake.)
Similarly, if unspecified, the default ``standard output''
is the terminal screen.
To override these default standard input and outputs,
\*(UX shells provide simple mechanisms
called ``redirection'' and ``pipelining.''
To indicate that a program should read its input from a file rather than
the terminal keyboard,
a user can ``redirect'' the input from a file with the
.T <
symbol.
(In all following examples, I will use the convention of placing
commands in a typewriter font so that
.T "interaction with the computer is shown like this" .)
Thus,
.(D
program < file
.)D
indicates to \*(UX
(really it indicates to the shell which controls input and output)
that the input to the program named
.T program
should be read from
the file named
.T file
rather than the terminal keyboard.
Analogously,
the output from a program can be saved in a file
be redirecting it to a file with the
.T >
symbol.
Thus,
.(D
program < input > output
.)D
indicates that the program
.T program
should read its input from
the file
.T input
and put its output into a new file called
.T output .
If the file
.T input
does not exist,
an error message will be printed.
If the file
.T output
exists,
then whatever was in that file before will get destroyed.
A mistake to avoid is a command like:
.(D
program < data > data       (bad)
.)D
which one might think replaces the contents of
.T data
with whatever
.T program
does to it.
The effect of such a command is to destroy the contents of
.T data
before
.T program
has a chance to read it.
For a safer method of input and output, see the later discussion of
the
.T io
program.
.P
The output from one program can be made the input to another program
without the need for temporary files.
This action is called ``pipelining'' or ``piping.''
It creates one complex function from a series of simple ones.
The vertical bar, or ``pipe'' symbol,
.T | ,
is placed between programs to pipe the output from one into the other.
Thus,
.(D
program1 < input | program2
.)D
tells \*(UX to run the program
.T program1
on the file
.T input
and feed the output to
.T program2 .
Here,
the final output would be printed on the terminal screen
because the final output from
.T program2
is not redirected.
Redirection could be accomplished with a command line like:
.(D
program1 < input | program2 > output
.)D
In general, only one input redirection is allowed,
and only one output redirection is allowed,
and the latter must follow the former.
Several programs can be joined with piping.
.P
In general, \*(UX programs do not know if their input is coming
from a terminal keyboard or from a file or pipeline.
Nor do they generally know where their output is destined.
One of the features of \*(UX is that the output from one program
can be the input to another via a pipeline.
It is possible to make complex programs from simple ones
without touching their program code.
Pipelining makes it desirable to keep the outputs of programs
clean of annotations so that they can be read by other programs.
This has the unfortunate result that the outputs of many \*(UX programs
are cryptic and have to be read with a legend.
The advantages of pipelining will be made clear in the examples of
later sections.
.HH "The Master Data File
.P
The key ideas of the format of the master data file are simplicity
and self documentation.
The reason for this is to make transformation of data easy,
and to be able to use a wide variety of programs to operate on 
a master data file.
Each line of a master data has the same number of alphanumeric fields.
For readability, the fields can be separated
by any amount of white space (blank spaces or tabs),
and, in general, blank lines are ignored.
Each line of a master data file corresponds to the data collected
on one trial or series of trials of an experiment.
Along with the data, a set of fields describe the conditions under which
those data were obtained.
Usually, a master data file contains all the data for an experiment.
Often, however,
a user would not want all the data from an experiment from this file
to be input to a program.
Some parts may be of particular interest for a specific statistical test,
or some data may need to be transformed before input to
a data analysis program.
.HH "Program Descriptions
.P
The programs described here were designed with the philosophy that data,
in a simple format,
can implicitly convey all or most of the information a program needs
to analyze them.
With data transforming utilities,
the need for a special language to specify design information
all but disappears.
Users can implicitly specify design information
by putting their data into specific formats.
For the most part,
the programs have few options,
preferring to print more statistics than analysts might want,
but removing the need for specifying options.
.P
The strategy I will use here is to describe the programs available
and give examples of how they are used.
This is meant only to make the ideas of analysis with these programs
familiar and should not be used as a substitute for the
manual entries on individual programs.
Only a few of the capabilities of the programs are described.
Before discussing specific programs,
a discussion of some general properties of their use is useful.
.MH "\*(US Conventions
.P
\*(US programs all follow a set of user interface conventions
that when learned,
make it easy to learn to use new programs.
These are in the areas of input format,
command syntax,
and error handling.
.LH "Input Format
.(D
.ft R
Input fields are separated by spaces or tabs.
Most programs read from the standard input.
All programs write to the standard output.
Numerical input is checked for type and range validity.
Input is read until the end of file.
.)D
.LH "Command Syntax
.(D
.ft R
Command line options are read using the \fIgetopt\fP parser.
Options are single letters and preceded by a dash (-).
Option values should be preceded by a space:
.T "	command -w 12"
Option values are not optional.
Options without values can be bundled.
.T "	command -a -b -c
.T "	command -abc
All options must precede any operands.
A single dash indicates reading the standard input for a file.
A double dash (--) indicates the end of the options.
.)D
.LH "Error Handling
.(D
.ft R
Error messages:
	print the name of the program,
	print diagnostic information,
	sound an audible bell,
	cause the program to exit.
Many programs print a usage summary on error.
.)D
.PG calc  "An Algebraic Calculator
.T calc
is a program that converts your computer
and terminal into a $20.00 algebraic calculator.
To use it, you just type its name,
and
.T calc
prompts you for expressions like:
.(D
12 + sqrt (50) / log (18)
hypotenuse = sqrt (a^2 + b^2)
.)D
for which it will print reasonable values.
.MH "Transforming Data
.P
In this section, I describe programs for transforming data.
The reason for describing transformation programs before statistical programs
is that usually analysts want to transform
their data before analysis.
The general form of a command would thus be:
.(D
transform < data | analyze > output
.)D
where
.T transform
is some program to transform data
from an input file
.T data
and the output from
.T transform
is piped to an analysis program,
.T analyze
whose output is directed to an output file,
.T output .
One program,
.T abut ,
takes the data from several files,
and puts corresponding lines from those files on the same
line of the output.
Another,
.T dm ,
is a data manipulator for extracting and transforming
columns of lines.
.PG abut "Abut Files
.T abut
is a program to take several files,
each with N lines, and make one file with N lines.
This is useful when data from repeated measures experiments,
such as paired data, are in separate files and need to be
placed into one file for analysis (see
.T pair
and
.T regress ).
For example, the command:
.(D
abut file1 file2 file3 > file123
.)D
would create
.T file123
with its first line
the first lines of
.T file1 ,
.T file2 ,
and
.T file3 ,
in order.
Successive lines of
.T file123
would have the data from
the corresponding lines of the named files joined together.
.PG io  "Control and Monitor Input/Output
.T io
is a general program for controlling input and output of files.
It can be used instead of the standard shell redirection
mechanisms and is sometimes safer.
It can also monitor the progress of data analysis commands,
some of which can take a long time.
.PH "Catenate Files
.T io
has a similar function to
.T abut .
Instead of, in effect, placing the files named
.I beside
each other,
.T io
places them one after another.
A user may want to analyze the data from several files,
and this can be accomplished with a command like:
.(D
io file1 file2 file3 | program
.)D
.PH "Monitoring Input and Output
.T io
also monitors the flow of data between programs.
When called with the
.T -m
flag, it acts as a meter of input and output flow,
printing the percentage of its input that has been processed.
The program can also be used in the middle of pipelines,
and at the end of pipelines to monitor the absolute flow of data,
printing a special character for every block of data processed.
.PH "Input and Output Control
Finally, it can be used as a safe form of controlling i/o
to files, creating temporary files and copying rather than
automatically overwriting output files.
For example,
.T io
can sort a file onto itself with
one pipeline using the standard \*(UX
.T sort
program:
.(D
io -m file | sort | io -m file
.)D
The above command would replace
.T file
with a sorted version of itself.
Because the monitor
.T -m
flag is used,
the user would see an output like:
.(D
 10%  20%  30%  40%  50%  60%  70%  80%  90% 100%
==========
.)D
The percentages show the flow from
.T file
into the
.T sort
program
(percentages are possible because
.T io
knows the length of
.T file ),
and the equal signs indicate a flow of about a thousand bytes each
coming out of the
.T sort
program.
The command:
.(D
sort file > file
.)D
would destroy the contents of
.T file
before
.T sort
had a chance to read it.
The command with
.T io ,
is both safer and acts as a meter showing
input and output progress.
.PH "Saving Intermediate Transformations
In a command like:
.(D
io file1 file2 | program1 | program2 | io output
.)D
the intermediate results before
.T program1
and before
.T program2
are lost.
.T io
can save them by diverting a copy of its input
to a file before continuing a pipeline:
.(D
io file1 file2 | io in1 | program1 | io in2 | program2 | io output
.)D
Any of these calls to
.T io
could be made metering versions
by using the
.T -m
flag.
.PG transpose  "Transpose Matrix-Type Files
.T transpose
is a program to transpose a matrix-like file.
That is, it flips the rows and columns of the file.
For example, if
.T file
looks like:
.(D
1 2 3 4
5 6 7 8
9 10 11 12
.)D
then the command:
.(D
transpose < file
.)D
will print:
.(D
.ta 1i 2i 3i 4i
	1	5	9
	2	6	10
	3	7	11
	4	8	12
.)D
.PG series "Print a Series of Numbers
.T series
is useful for creating some test data or dummy variables.
You supply a starting number and an ending number,
and
.T series
prints all the numbers in between.
Optionally, the increments between the endpoints can be adjusted.
Series can be in reverse order, so the command:
.(D
series 20 10 | transpose
.)D
produces something like:
.(D
20 19 18 17 16 15 14 13 12 11 10	
.)D
.PG repeat "Repeat a File or String
.T repeat
prints the named file the specified number of times,
or the standard input if no file name is supplied.
The repetitions are controlled by a numerical argument
which, if negative, tell
.T repeat
to simply repeat the name, not the file.
For example,
.(D
repeat -3 "The rain in Spain falls mainly on the plain"
.)D
will produce:
.(D
The rain in Spain falls mainly on the plain
The rain in Spain falls mainly on the plain
The rain in Spain falls mainly on the plain
.)D
.T repeat
can generate random numbers when combined with
.T dm ,
by providing
.T dm
with a specified number of input lines:
.(D
repeat -100 x | dm RAND
.)D
.PG reverse "Reverse Lines, Fields, and Characters
.T reverse
reorders lines,
space separated fields,
or characters in lines.
This may be useful for
.T dm
if the fields you want to work with are
in the final columns of a file.
The options can be combined to provide interesting results.
.(D
results. interesting provide to combined be can options The
.stluser gnitseretni edivorp ot denibmoc eb nac snoitpo ehT
ehT snoitpo nac eb denibmoc ot edivorp gnitseretni .stluser
.)D
.PG perm "Randomly Permute Lines
.T perm
produces a random ordering of the lines in a file.
.T perm
is useful to randomize experimental conditions,
and for Monte Carlo simulations.
.PG maketrix "Create Matrix-Type File
.T maketrix
is a program to create matrix type files for input to
other \*(US programs.
Its integer argument is the number of columns to form the file from.
If
.T file
contains:
.(D
1 2 3 4 5 6 7 8 9 10 11 12
.)D
.(D
maketrix 3 < file
.)D
will produce:
.(D
.ta 1i 2i 3i 4i
	1	2	3
	4	5	6
	7	8	9
	10	11	12
.)D
.PG dm  "A Data Manipulator
.T dm
is a data manipulating program that allows
its user to extract columns (delimited by white space) from a file,
possibly based on conditions,
and produce algebraic combinations of columns.
.T dm
is probably the most used of all the programs described in this paper.
To use
.T dm ,
a user writes a series of expressions,
and,
for each line of its input,
.T dm
reevaluates and prints the values of those expressions in order.
.P
.T dm
allows users to access the field of each line of its input.
Numerical values of fields on a line can be accessed by the letter
.T x
followed by the column number.
Character strings can be accessed by the letter
.T s
followed by the column number.
Consider for example the following contents of the file
.T ex1 :
.(D
.ta 1i 2i 3i 4i
	12	45.2	red	***
	10	42	blue	---
	8	39	green	---
	6	22	orange	***
.)D
The first line of
.T ex1
has four columns, or fields.
In this line,
.T x1
is the number
.T 12 ,
and
.T s1
is the string
.T 12 .
.T dm
distinguishes between numbers and strings (the latter are
enclosed in quotes) and only numbers can be involved in algebraic
expressions.
.PH "Column extraction
Simple column extraction can be accomplished by typing the strings
in the columns desired.
To print, in order, the second, third, and first columns from the file
.T ex1 ,
one would use the call to
.T dm :
.(D
dm s2 s3 s1 < ex1
.)D
This would print a reordering the first three columns without column 4.
.PH "Algebraic Expressions
.T dm
produces algebraic combinations of columns.
For example, the following call to
.T dm
will print
the first column, the sum of the first two columns,
and the square root of the second column.
.(D
dm x1 x1+x2 "sqrt(x2)" < ex1
.)D
Note that the parentheses in the third expression
requires quotes around the whole expression.
This is because parentheses are special characters
in the shell.
If a string in either of the columns was not a number,
.T dm
would print and error message and stop.
.PH "Conditional Operations
.T dm
ignores lines that are not wanted.
A simple example is to print only those lines with minus signs in them.
.(D
dm "if '-' C INPUT then INPUT else NEXT" < ex1
.)D
The above call to
.T dm
has one expression in quotes to
overcome problems with special characters and spaces
inserted for readability.
The conditional has the syntax
.T if-then-else .
Between the
.T if
and
.T then
is a condition that is tested,
here
testing if the one-character string
.T -
is in
.T INPUT ,
a special string holding the input line.
If the condition is true, the expression between
the
.T then
and
.T else
parts is printed,
in this example, the input line,
.T INPUT .
If the condition is not true,
then the expression after the
.T else
part is printed.
.T NEXT
is a special control variable that is not
printed but causes the next line to be read.
.MH "Data Validation
.P
Before analysis begins,
it is a good idea to make sure data are entered correctly.
The programs described in this sub-section
are useful for verifying the consistency of data files.
Individual analysis programs do their own verification,
so the programs described, in practice,
help find errors detected by specific analysis programs.
.PG validata  "Check Data Validity
A master data file is assumed to have an equal number of fields per line.
.T validata
checks its input from the standard input or argument file and
complains if the number of fields per line changes.
After reading its input,
.T validata
reports the number of entries of various data types
for each column.
The data types
.T validata
knows about include
integer and real numbers, alphabetics and alphanumeric fields,
and some others.
The minimum and maximum values of each column are also reported.
.T validata
shows incorrect entries
such as non-numerical data in columns expected to be numerical,
or accidently entered invisible control characters.
.PG dm "Range Checking on Columns
.T dm
can verify data as well as transform it.
For example, to check that all numbers in column three
of a file are greater than zero and less than 100,
the following call to
.T dm
would print all lines that
did not display that property.
.(D
dm "if !(x3>0 & x3<100) then INPUT else NEXT"
.)D
If non-numerical data appeared in column three,
.T dm
would report an error.
.MH "Descriptive Statistics
.PG desc  "Describing a Single Distribution
.T desc
analyzes a single distribution of data.
Its input is a series of numbers,
in any format,
so that numbers are separated by spaces, tabs, or newlines.
Like most of these programs,
.T desc
reads from the standard input.
.PH "Summary Statistics
.T desc
prints a variety of statistics, including order statistics.
Optionally,
.T desc
prints a t-test for any specified null mean.
.PH "Frequency Tables
.T desc
optionally prints frequency tables, or tables of proportions,
with cumulative entries if requested.
These tables will be formed based on the data
so they are a reasonable size,
or they can be formatted by the user
who can specify the minimum value of the first interval,
and the interval width.
For example, the following command would print
a table of cumulative frequencies and proportions,
.T -cfp ,
in a table with a minimum interval value of zero,
.T "-m\ 0"
and an interval width of ten,
.T "-i\ 10" .
.(D
desc -i 10 -m 0 -cfp < data
.)D
.PH "Histograms
If requested,
.T desc
will print a histogram with the same format
as would be obtained with options to control frequency tables.
For example, the following command would print a histogram of its
input by choosing an appropriate interval width for bins.
.(D
desc -h < data
.)D
The format of the histogram, as well as tables,
is controlled by setting options.
The following line sets the minimum of the first bin to zero,
and the interval width to ten,
an appropriate histogram for grading exams.
.(D
desc -h -i 10 -m 0 < grades
.)D
.PG pair  "Paired Data Analysis
.T pair
analyzes paired data.
Its input is a series of lines, two numbers per line,
which it reads from the standard input.
Options are available for printing a bivariate plot,
which is the default when the program is called by its alias,
.T biplot .
Other options control the type of output.
.PH "Summary Statistics
From
.T pair "'s"
input, minima, maxima, means, and standard
deviations are printed for both columns as well as their
difference.
Also printed is the correlation of the two columns
and the regression equation relating them.
The simplest use of
.T pair
is with no arguments.
To analyze a data file of lines of X-Y pairs,
the following command will usually be satisfactory:
.(D
pair < data
.)D
.P
Often the paired data to be analyzed are in two files,
each variable occupying a single column.
These can be joined with
.T abut
and input to
.T pair
via a pipe:
.(D
abut var1 var2 | pair
.)D
Or perhaps the two variables occupy two columns in a master data file.
If the variables of interest are in columns four and six,
the following command would produce the paired data analysis:
.(D
io data | dm s4 s6 | pair
.)D
.PH "Scatterplots
With the
.T -p
or
.T -b
options,
a scatterplot of the two variables can be printed.
Alternatively,
.T pair
has an alias, called
.T biplot ,
which lets the user
get a bivariate plot of a data file of X-Y pairs:
.(D
biplot < data
.)D
.PG corr  "Multiple Correlation
.T corr
prints summary statistics for multivariate or repeated measures data.
Its input is a series of lines, each with an equal number of data.
It prints the mean, standard deviation, minimum, and maximum
for each column in its input.
Then it prints a correlation matrix with all pairwise correlations.
Like
.T pair ,
columns from files can be joined with
.T abut
or extracted from files with
.T dm .
.MH "Inferential Statistics
.PG anova  "Multi-Factor Analysis of Variance
.T anova
performs multi-factor analysis of variance
with repeated measures factors (within subjects),
and with unequal cell sizes allowed on grouping factors (between subjects).
.T anova
reads in a series of lines from the standard input,
each with  the same number of alphanumeric fields.
Each datum occupies one line and is preceded by
a list of levels of independent variables describing
the conditions under which that datum was obtained.
The first field is some string indicating the level
of the random variable in the design,
and subsequent fields describe other independent variables.
From this input,
.T anova
infers which factors are between subjects,
and which are within subjects.
.T anova
prints cell sizes, means, and standard deviations
for all main effects and interactions.
Then
.T anova
prints a summary of the design of its input,
followed by a standard F-table.
.P
Suppose you had a design in which you presented problems
to subjects.  These problems varied in difficulty (easy/hard),
and in length (short/medium/long).
The dependent measure is time to solve the problem,
with a time limit of five minutes.
Your data file would have lines like this:
.(D
fred	easy	medium	5
ethel	hard	long	2
.)D
In the first column is a string that identifies the level of the
random factor (here, subject name), followed by strings indicating
the level of the independent factors, followed by the dependent
measure (here, solution time).
The data file holding lines like those above would be analyzed with:
.(D
anova subject difficulty length time < data
.)D
Individual factors can be ignored by excluding their
columns from the analysis:
.(D
dm s1 s2 s4 < data | anova subject difficulty time
.)D
Similarly, different factors can be used as the random factor.
This is common in psycho-linguistic experiments in which
both subjects and items can be thought of as random variables.
.PG "oneway/ttest" "One Way Analysis of Variance & T-Test
.T oneway
allows the comparison of up to 20 different groups of data
using a between groups analysis of variance.
It can also be called by the alias
.T ttest
for which it prints its significance test in a simpler format
for the common two group case.
The input format is simpler than for
.T anova .
Each group is read in free format with data separated by
spaces, tabs, or new lines.
The data for a new group is indicated by the presence of a special value,
called the splitter (by default, -1), which can be chosen by the analyst.
Similar to how
.T anova ,
lets the analyst name factors,
the names of the groups can be supplied to
.T oneway
on the command line.
.P
Suppose we have three group's data in three files:
.T large ,
.T medium ,
and
.T small .
Further suppose that the data in these files range from
-10.0 to 10.0,
so the default group splitter value might be in the data.
We make a file, called
.T split
that contains the single value
.T 999 .
Now we can construct the input to
.T oneway
with the command:
.(D
cat large split medium split small
.)D
and send that construction to
.T oneway ,
specifying the new group splitter and some group labels:
.(D
oneway -s 999 Large Medium Small
.)D
.PG regress  "Multivariate Linear Regression
.T regress
reads its input of a series of lines from the standard input,
each with the same number of columns of numbers.
From this input,
.T regress
prints minima, maxima,
means, and standard deviations for each variable in each column.
Also printed is the correlation matrix showing the correlations between
all pairs of variables.
Suppose you had a file called
.T data
with any number of lines
and five columns, respectively called ``blood\ pressure,''
``age,'' ``height,'' ``risk,'' and ``salary.''
You could do a multiple regression with:
.(D
regress pressure age height risk salary < data
.)D
If only a few columns were of interest,
they could be extracted with
.T dm :
.(D
dm x1 x3 x5 < data | regress pressure height salary
.)D
.PG pair  "Paired Data Comparisons
.T pair
compares two distributions of paired data.
Often, the two columns of interest are pulled out of a master data file
with
.T dm .
The following command takes columns 5 and 3 from
.T data
and inputs them to
.T pair :
.(D
dm x5 x3 < data | pair
.)D
For its two-column input,
.T pair
will print a t-test on the differences of the two columns,
which is equivalent to a paired t-test.
.T pair
will also print a regression equation relating the two,
along with a significance test on their correlation,
which is equivalent to testing the slope against zero.
.PG ts "Time Series Analysis
.T ts
produces a variety of statistics for data collected
in repeated measures over time.
The special options include auto-correlations of various lags
(these compare one part of the series to another),
and transformations between time series of different lengths.
A few simple plots are available.
.PG "pof & critf"  "F-Ratio to Probability Conversion
.T pof
determines the probability of an F-ratio
given the F-ratio, and degrees of freedom:
.(D
pof 12.2 3 48
.)D
.T critf
determines the critical F-ratio needed to
attain the given significance level (useful for confidence intervals):
.(D
critf .01 3 48
.)D
.HH "Acknowledgments
.P
Jay McClelland has been instrumental in his influence in
the user interfaces.
He also wrote the original version of
.T abut .
Mark Wallen was helpful in being able to convey
many of the intricacies of \*(UX C programming.
Greg Davidson was the force behind most of the error checking facilities.
Don Norman wrote the original flow meter on which the
.T io
program
is based.
Bob Elmasian was helpful in working on Version 6 compatibility.
Jeff Miller kindly supplied the F-ratio to probability conversion functions.
Caroline Palmer helped with enhancements to
.T regress .
Fred Horan spearheaded the port to MSDOS.
.TC
DataMuseum.dk

DKUUG/EUUG Conference tapes

⟦3daa0c67c⟧ TextFile

Derivation

TextFile