⟦8199c3ef6⟧

TextFile

.TC 10
.ds AU Gary Perlman
.CW
.ds DT March 1985
.de EG
.(D
\\$1
.)D
..
.(T "DM: A Data Manipulator" "Tutorial Introduction and Manual"
Gary Perlman
Cognitive Science Laboratory
University of California, San Diego
.)T "DM Tutorial and Manual
.bp 1
.ls 1
.!T
.P
.T dm
is a data manipulating program with many
operators for manipulating columnated files of
numbers and strings.
.T dm
helps avoid writing little BASIC or C
programs every time some transformation
to a file of data is wanted.
To use
.T dm ,
a list of expressions is entered, and
for each line of data,
.T dm
prints the result of evaluating each expression.
.LH "Introductory Examples"
.P
Usually, the input to
.T dm
is a file of lines,
each with the same number of fields.
Put anotehr way,
.T dm 's
input is a file with some set number of columns.
.PH "Column Extraction"
.T dm
can be used to extract columns easily.
If
.T data
is the name of a file of five columns,
then the following will extract
the third string followed by the first, followed by the fourth,
and print them to the standard output.
.EG "dm s3 s1 s4 < data
Thus
.T dm
is useful for putting data in a correct format for input to many programs,
notably the UNIX|STAT data analysis programs.
.PH "Simple Expressions"
In the preceding example, columns were accessed by typing the letter
.T s
(for string) followed by a column number.
The numerical value of a column can be accessed by typing
.T x
followed by a column number.
This is useful to form simple expressions
based on various columns.
Suppose
.T data
is a file of four numerical columns,
and that the task is to print the sum of the first two columns
followed by the difference of the second two.
The easiest way to do this is with:
.EG "dm x1+x2 x3-x4 < data
Almost all arithmetic operations are available and
expressions can be of arbitrary complexity,
however, care must be taken because many of the symbols
used by
.T dm
(such as
.T *
for multiplication) have special meaning when
used in UNIX.
This can be avoided by putting expressions in double quotes.
For example, the following will print the sum of the squares of
the fist two columns followed by the square of the third,
a sort of Pythagorean program.
.EG "dm ""x1*x1+x2*x2"" ""x3*x3"" < data
.PH "Line Extraction Based on Conditions"
.T dm
allows testing conditions and printing values depending on whether
the conditions are met.
The
.T dm
call
.EG "dm ""if x1 >= 100 then INPUT else KILL"" < data
will print only those lines that have first columns with values
greater than or equal to 100.
The variable
.T INPUT
refers to the whole input line.
The special variable
.T KILL
instructs
.T dm
to terminate processing on
the current line and go to the next.
.MH "Data Types"
.LH "String Data"
.P
To access or print a column in a file,
the string variable,
.T s ,
is provided.
.T s i
(the letter
.T s
followed by a column number, such as
.T 5 )
refers to the ith column of the input, treated as a string.
The most simple example is to use an
.T s i
as the only part
of an expression.
.EG "dm s2 s3 s1
will print the second, third and first columns of the input.
One special string is called
.T INPUT ,
and is the current input line of data.
String constants in expressions are delimited by single quotes.
For example:
.EG " 'I am a string'
.LH "Numerical Data"
.P
Two general numerical variables are available
To refer to the input columns, there is
.T x i
and for the result of evaluated expressions, there is
.T y i.
.T x i
refers to the ith column of the input, treated as a number.
.T x i
is the result of converting
.T s i
to a number.
If
.T s i
contains non-numerical characters,
.T x i
may have strange values.
A common use of the
.T x i
is in algebraic expressions.
.EG "dm x1+x2 x1/x2
will print out two columns,
first the sum of the first two input columns,
then their ratio.
.P
The value of a previously evaluated expression can be accessed
to avoid evaluating the same sub-expression more than once.
.T y i
refers to the numerical value of the ith expression.
Instead of writing:
.EG "dm x1+x2+x3 (x1+x2+x3)/3
the following would be more efficient:
.EG "dm x1+x2+x3     y1/3
.T y1
is the value of the first expression,
.T x1+x2+x3 .
String values of expressions are unfortunately unaccessable.
.P
Indexing numerical variables
is usually done by putting the index after
.T x
or
.T y ,
but if value of the index is to depend on the input,
such as when there are a variable number of columns,
and only the last column is of interest, the index value
will depend on the number of columns.
If a computed index is desired for
.T x
or
.T y
the index should be an expression in square brackets following
.T x
or
.T y .
For example,
.T x[N]
is the value of the last column of the input.
.T N
is a special variable equal to the number of columns in
.T INPUT .
There is the option to use
.T x1
or
.T x[1]
but
.T x1
will execute faster
so computed indexes should not be used unless necessary.
.LH "Special Variables"
.P
.T dm
offers some special variables and control primitives for
commonly desired operations.
Many of the special variables have more than one name
to allow more readable expressions.
Many can be abbreviated,
and these will be shown in square brackets.
.PH N
.T N
is the number of columns in the current input line.
.PH SUM
.T SUM
is the sum of the
.T x i,
i = 1,
.T N .
This number may be of limited use if some columns are
non-numerical.
.PH INLINE
.T INLINE
is the line number of the input.
For the first line of input,
.T INLINE
is
.T 1.0 .
.PH OUTLINE
.T OUTLINE
is the number of lines so far output.
For the first line of input,
.T OUTLINE
is
.T 0.0 .
.T OUTLINE
will not increase if a line is killed with
.T KILL .
.PH RAND
.T "RAND [R]"
is a random number uniform in [0,1).
.PH "INPUT"
.T INPUT [I]
is the original input line, all spaces, etc. included.
.PH NIL
If an expression evaluates to
.T NIL ,
then there will be no output
for that expression (for that input line).
This should not be confused with
.T KILL
that suppresses output
for a whole line, or a prefix
.T X
that unconditionally suppresses
output for an expression.
.PH KILL
If an expression evaluates to
.T "KILL [K]" ,
then there will be no output
for the present line.
All expressions after a
.T KILL ed
expression are not evaluated.
This can be useful to avoid nastiness like division by zero.
.T NEXT
and
.T SKIP
can be used as synonyms for
.T KILL .
.PH EXIT
If an expression evaluates to
.T "EXIT [E]" ,
then
.T dm
immediately
exits.
This can be useful for stopping a search after
a target is found.
.MH "User Interface"
.LH "Expressions"
.P
Expressions are written in common computer language syntax,
and can have spaces inserted anywhere except
(1) in the middle of constants, and (2) in the middle of multicharacter
operators,
such as
.T <=
(less than or equal to).
Four modes are available for specifying expressions to
.T dm .
They provide the choice of entering expressions from the terminal or a file,
and the option to use
.T dm
interactively or in batch mode.
.PH "Argument Mode"
The most common but restrictive mode is to supply expressions as arguments
to the shell level call to
.T dm ,
as featured in previous examples.
The main problem with this mode is that many special characters
in UNIX are used as operators, requiring that many
expressions be quoted (with single or double quotes).
The main advantage is that this mode is most useful in
constructing pipelines and shell scripts.
.PH "Expression File Mode"
Another non-interactive method is to supply
.T dm
with a file with
expressions in it (one to each line) by calling
.T dm
with:
.EG "dm Efilename
where
.T filename
is a file of expressions.
This mode makes it easier to use
.T dm
with pipelines and redirection.
.PH "Interactive Mode"
.T dm
can also be used interactively by calling
.T dm
with no arguments.
In interactive mode,
.T dm
will first ask for a file of expressions.
If the expressions are not in a file,
type RETURN when asked for the expression file,
and they can be entered interactively.
A null filename tells
.T dm
to read expressions from the terminal.
In terminal mode,
.T dm
will prompt with the expression number,
and print out how it interprets what is type in if it has correct syntax,
otherwise it allows corrections.
When the last expression has been entered,
an empty line informs
.T dm .
there are no more.
If the expressions are in a file, type in the name of that file,
and
.T dm
will read them from there.
.LH "Input"
.P
If
.T dm
is used in interactive mode, it will prompt for an input file.
A file name can be supplied
or a RETURN without a file name
tells
.T dm
to read data from the terminal.
Out of interactive mode,
.T dm
will read from the standard input.
.P
.T dm
reads data a line at a time and stores that line
in a string variable called
.T INPUT .
.T dm
then takes each column in
.T INPUT ,
separated by spaces or tabs,
and stores each in the string variables,
.T s i.
.T dm
then tries to convert these strings to numbers and stores the
result in the number variables,
.T x i.
If a column is not a number (eg. it is a name)
then its numerical value will be inaccessible,
and trying to refer to such a column will cause an error message.
The number of columns in a line is stored in a special variable called
.T N ,
so variable numbers of columns can be dealt with gracefully.
The general control structure of
.T dm
is summarized in Figure 1.
.P
.ne 3i
.(D "Figure 1: Control structure for DM
read in n expressions; e1, e2, ..., en.
repeat while there is some input left
	INPUT  = <next line from input file>
	N      = <number of fields in INPUT>
	SUM    = 0
	RAND   = <a new random number in [0,1)>
	INLINE = INLINE + 1
	for i  = 1 until N do
		si  = <ith string in INPUT>
		xi  = <si converted to a number>
		SUM = SUM + xi
	for i = 1 until n do
		switch on <value of ei>
			case EXIT: <stop the program>
			case KILL: <go to get new INPUT>
			case NIL : <go to next expression>
			default  : OUTLINE = OUTLINE + 1
				   yi = <value of ei>
				   print yi
	<print a newline character>
.)D
.LH "Output"
.P
In interactive mode,
.T dm
will ask for an output file or pipe.
.(D
Output file or pipe: 
.)D
A filename, a ``pipe command,'' or just RETURN can be entered.
A null filename tells
.T dm
to print to the terminal.
If output is being directed to a file,
the output file should be different from the input file.
.T dm
will ask permission to overwrite any
file that contains anything, but that does not mean
it makes sense to write the file it is reading from.
.P
The output from
.T dm
can be redirected
to another program by having the first character of the
output specification be a pipe symbol, the vertical bar:
.T | .
For example, the following line tells
.T dm
to pipe its output to
.T tee
which prints a copy of its output to the terminal,
and a copy to its argument file.
.(D
Output file or pipe: | tee dm.save
.)D
.P
Out of interactive mode,
.T dm
prints to the standard output.
.P
.T dm
prints the values of all its expressions
in
.T %.6g
format for numbers (maintaining at most six digits of precision
and printing in the fewest possible characters), and
.T %s
format for strings.
A tab is printed after every column to insure separation.
.MH "Operations"
.P
.T dm
offers many numerical, logical, and string operators.
The operators are evaluated in the usual order (eg. times before plus)
and expressions tend be evaluated from left to right.
Parentheses can be used to make the order of operations clear.
The way
.T dm
interprets expressions can be verified by entering them
interactively, in which case
.T dm
prints a fully parenthesized form.
.P
An assignment operator is not directly available.
Instead, variables can be evaluated but not printed
by using the expression suppression flag,
.T X .
If the first character of an expression is
.T X ,
it will
be evaluated, but not printed.
The value of a suppressed expression can later be accessed with
the expression value variable,
.T y i.
.LH "String Operations"
.P
Strings can be lexically compared with several comparators:
.T <
(less-than),
.T <=
(less-than or equal),
.T =
(equal),
.T !=
(not equal),
.T >= 
greater-than or equal),
and
.T >
(greater than).
They return
.T 1.0
if their condition holds, and
.T 0.0
otherwise.
For example,
.EG " 'abcde' <= 'eeek!'
is equal to
.T 1.0.
The length of strings can be found with the
.T #
operator.
.EG "# 'five'
evaluates to four, the length of the string argument.
.P
Individual characters inside strings can be accessed by following a
string with an index in square brackets.
.EG " 'abcdefg'[4]
is the ASCII character number (164.0) of the 4th character in
.T "abcdefg" .
Indexing a string is mainly useful for comparing characters because
it is not the character that is printed, but the character number.
A warning is appropriate here:
.EG "s1[1] = '*'
will result in an error because the left side of the
.T =
is a number, and the right hand side is a string.
The correct (although inelegant) form is:
.EG "s1[1] = '*'[1]
.P
A substring test is available. The expression:
.EG "string1 C string2
will return
.T 1.0
if
.T string1
is somewhere in
.T string2 .
This can be used as a test for character
membership if string1 has only one character.
Also available is
.T !C
which returns
.T 1.0
if
.T string1
is NOT in
.T string2 .
.LH "Numerical Operators"
.P
The numerical comparators are:
.(D
<  <=  =  !=  >=  >
.)D
and have
the analogous meanings as their string counterparts.
.P
The binary operators,
.T +
(addition),
.T -
(subtraction or "change-sign"),
.T *
(multiplication), and
.T /
(division)
are available.
Multiplication and division are evaluated
before addition and subtraction,
and are all evaluated left to right.
Exponentiation,
.T ^ ,
is the binary operator of highest precedence
and is evaluated right to left.
Modulo division,
.T % ,
has the same properties as division,
and is useful for tests of even/odd and the like.
NOTE: Modulo division truncates its operands to integers before
dividing.
.P
Several unary functions are available:
.T l
(natural log
.T [log] ),
.T L
(base ten log
.T [Log] ),
.T e
(exponential
.T [exp] ),
.T a
(absolute value
.T [abs] ),
.T f
(floor
.T [floor] ),
.T c
(ceiling
.T [ceil] ).
Their meaning can be verified in the UNIX programmer's manual.
Single letter names for these functions
or the more nmemonic strings bracketed after their names can be used.
.LH "Logical Operators"
.P
Logical operators are of lower precedence that any other operators.
Both logical AND,
.T &
and OR
.T |
can be used to form complicated tests.
For example, to see if the first three columns are
in either increasing or decreasing order,
one could test if
.T x2
was between
.T x1
and
.T x3 :
.EG "x1<x2 & x2<x3 | x1>x2 & x2>x3
would equal
.T 1.0
if the condition was satisfied.
Parentheses are unnecessary because
.T <
and
.T >
are of
higher precedence than
.T &
which is of higher precedence than
.T | .
The unary logical operator,
.T !
(NOT), evaluates to
1.0
if its operand is
.T 0.0 ,
otherwise it equals
.T 0.0 .
Many binary operators can be immediately preceded by
.T !
to negate their value.
.T !=
is "not equal to,"
.T !|
is "neither,"
.T !&
is "not both,"
and
.T !C
is "not in."
.LH "Conditional Expressions"
.P
The expressions:
.(D
if expression1 then expression2 else expression3
   expression1  ?   expression2   :  expression3
.)D
evaluate to
.T expression2
if
.T expression1
is non-zero, otherwise
they evaluate to
.T expression3 .
The first form is more mnemonic than the second which is
consistent with C syntax.
Both forms have the same meaning.
.T expression1
has to be numerical,
.T expression2
or
.T expression3
can be numerical or string.
For example, The following expression will filter out lines
with the word
.T bad
in them.
.EG "if 'bad' C INPUT then KILL else INPUT
As another example, the following expression will print the
ratio of columns two and three if (a) there are at least three
columns, and (b) column three is not zero.
.EG "if (N >= 3) & (x3 != 0) then x2/x3 else 'bad line'
These are the only expressions, besides
.T s i
or a string constant
that can evaluate to a string.
If a conditional expression does evaluate to a string,
then it CANNOT be used in some other expression.
The conditional expression is of absolute lowest precedence
and groups left to right,
however parenthese are recommended to make the semantics obvious.
.MH "Expression Syntax"
.P
Arithmetic expressions may be formed using
variables (with
.T x i
and
.T y i)
and constants
and can be of arbitrary complexity.
In the following table,
unary and binary operators are listed along with their
precedences and a brief description.
All unary operators are prefix except string indexing,
.T [] ,
which is postfix.  All binary operators are infix.
.P
Operators of higher precedence are executed first.
All binary operators are left associative except exponentiation,
which groups to the right.
An operator,
.T O ,
is left associative if
.T xOxOx
is parsed as
.T (xOx)Ox ,
while one that is right associative is parsed as
.T xO(xOx) .
.(D
.ft R
.ps 9
.vs 10
.ta .5iC 1iC 1.5i
Unary Operators:
.ul
	op	prec	description
	l	10	base e logarithm [log]
	L	10	base 10 logarithm [Log]
	e	10	exponential [exp]
	a	10	absolute value [abs]
	c	10	ceiling (rounds up to next integer) [ceil]
	f	10	floor (rounds down to last integer) [floor]
	#	10	number of characters in string
	[]	10	ASCII number of indexed string character
	-	9	change sign
	!	4	logical not

Binary Operators:
.ul
	op	prec	description
	^	8	exponentiation
	*	7	multiplication
	/	7	division
	%	7	modulo division
	+	6	addition
	-	6	subtraction
	=	5	test for equality (opposite !=)
	>	5	test for greater-than (opposite <=)
	<	5	test for less-than (opposite >=)
	C	5	substring (opposite !C)
	&	4	logical AND (opposite !&)
	|	3	logical OR (opposite !|)
.ps
.vs 12
.)D
.TC
DataMuseum.dk

DKUUG/EUUG Conference tapes

⟦8199c3ef6⟧ TextFile

Derivation

TextFile