⟦4689fa24c⟧

TextFile

Info file gawk-info, produced by Makeinfo, -*- Text -*- from input
file gawk.texinfo.

This file documents `awk', a program that you can use to select
particular records in a file and perform operations upon them.

Copyright (C) 1989 Free Software Foundation, Inc.

Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.

Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided that
the entire resulting derived work is distributed under the terms of a
permission notice identical to this one.

Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that this permission notice may be stated in a
translation approved by the Foundation.


▶1f◀
File: gawk-info,  Node: Function Calls,  Next: Precedence,  Prev: Conditional Exp,  Up: Expressions

Function Calls
==============

A "function" is a name for a particular calculation.  Because it has
a name, you can ask for it by name at any point in the program.  For
example, the function `sqrt' computes the square root of a number.

A fixed set of functions are "built in", which means they are
available in every `awk' program.  The `sqrt' function is one of
these.  *Note Built-in::, for a list of built-in functions and their
descriptions.  In addition, you can define your own functions in the
program for use elsewhere in the same program.  *Note User-defined::,
for how to do this.

The way to use a function is with a "function call" expression, which
consists of the function name followed by a list of "arguments" in
parentheses.  The arguments are expressions which give the raw
materials for the calculation that the function will do.  When there
is more than one argument, they are separated by commas.  If there
are no arguments, write just `()' after the function name.  Here are
some examples:

     sqrt(x**2 + y**2)    # One argument
     atan2(y, x)          # Two arguments
     rand()               # No arguments

*Do not put any space between the function name and the
open-parenthesis!*  A user-defined function name looks just like the
name of a variable, and space would make the expression look like
concatenation of a variable with an expression inside parentheses. 
Space before the parenthesis is harmless with built-in functions, but
it is best not to get into the habit of using space, lest you do
likewise for a user-defined function one day by mistake.

Each function expects a particular number of arguments.  For example,
the `sqrt' function must be called with a single argument, the number
to take the square root of:

     sqrt(ARGUMENT)

Some of the built-in functions allow you to omit the final argument. 
If you do so, they use a reasonable default.  *Note Built-in::, for
full details.  If arguments are omitted in calls to user-defined
functions, then those arguments are treated as local variables,
initialized to the null string (*note User-defined::.).

Like every other expression, the function call has a value, which is
computed by the function based on the arguments you give it.  In this
example, the value of `sqrt(ARGUMENT)' is the square root of the
argument.  A function can also have side effects, such as assigning
the values of certain variables or doing I/O.

Here is a command to read numbers, one number per line, and print the
square root of each one:

     awk '{ print "The square root of", $1, "is", sqrt($1) }'


▶1f◀
File: gawk-info,  Node: Precedence,  Prev: Function Calls,  Up: Expressions

Operator Precedence: How Operators Nest
=======================================

"Operator precedence" determines how operators are grouped, when
different operators appear close by in one expression.  For example,
`*' has higher precedence than `+'; thus, `a + b * c' means to
multiply `b' and `c', and then add `a' to the product.

You can overrule the precedence of the operators by writing
parentheses yourself.  You can think of the precedence rules as
saying where the parentheses are assumed if you do not write
parentheses yourself.  In fact, it is wise always to use parentheses
whenever you have an unusual combination of operators, because other
people who read the program may not remember what the precedence is
in this case.  You might forget, too; then you could make a mistake. 
Explicit parentheses will prevent any such mistake.

When operators of equal precedence are used together, the leftmost
operator groups first, except for the assignment, conditional and and
exponentiation operators, which group in the opposite order.  Thus,
`a - b + c' groups as `(a - b) + c'; `a = b = c' groups as `a = (b =
c)'.

The precedence of prefix unary operators does not matter as long as
only unary operators are involved, because there is only one way to
parse them--innermost first.  Thus, `$++i' means `$(++i)' and `++$x'
means `++($x)'.  However, when another operator follows the operand,
then the precedence of the unary operators can matter.  Thus, `$x**2'
means `($x)**2', but `-x**2' means `-(x**2)', because `-' has lower
precedence than `**' while `$' has higher precedence.

Here is a table of the operators of `awk', in order of increasing
precedence:

assignment
     `=', `+=', `-=', `*=', `/=', `%=', `^=', `**='.  These operators
     group right-to-left.

conditional
     `?:'.  These operators group right-to-left.

logical ``or''.
      `||'.

logical ``and''.
      `&&'.

array membership
     `in'.

matching
     `~', `!~'.

relational, and redirection
     The relational operators and the redirections have the same
     precedence level.  Characters such as `>' serve both as
     relationals and as redirections; the context distinguishes
     between the two meanings.

     The relational operators are `<', `<=', `==', `!=', `>=' and `>'.

     The I/O redirection operators are `<', `>', `>>' and `|'.

     Note that I/O redirection operators in `print' and `printf'
     statements belong to the statement level, not to expressions. 
     The redirection does not produce an expression which could be
     the operand of another operator.  As a result, it does not make
     sense to use a redirection operator near another operator of
     lower precedence, without parentheses.  Such combinations, for
     example `print foo > a ? b : c', result in syntax errors.

concatentation
     No special token is used to indicate concatenation.  The
     operands are simply written side by side.

add, subtract
     `+', `-'.

multiply, divide, mod
     `*', `/', `%'.

unary plus, minus, ``not''
     `+', `-', `!'.

exponentiation
     `^', `**'.  These operators group right-to-left.

increment, decrement
     `++', `--'.

field
     `$'.


▶1f◀
File: gawk-info,  Node: Statements,  Next: Arrays,  Prev: Expressions,  Up: Top

Actions: Control Statements
***************************

"Control statements" such as `if', `while', and so on control the
flow of execution in `awk' programs.  Most of the control statements
in `awk' are patterned on similar statements in C.

All the control statements start with special keywords such as `if'
and `while', to distinguish them from simple expressions.

Many control statements contain other statements; for example, the
`if' statement contains another statement which may or may not be
executed.  The contained statement is called the "body".  If you want
to include more than one statement in the body, group them into a
single compound statement with curly braces, separating them with
newlines or semicolons.


* Menu:

* If Statement::            Conditionally execute some `awk' statements.

* While Statement::         Loop until some condition is satisfied.

* Do Statement::            Do specified action while looping until some
                            condition is satisfied.

* For Statement::           Another looping statement, that provides
                            initialization and increment clauses.

* Break Statement::         Immediately exit the innermost enclosing loop.

* Continue Statement::      Skip to the end of the innermost enclosing loop.

* Next Statement::          Stop processing the current input record.

* Exit Statement::          Stop execution of `awk'.

 
▶1f◀
File: gawk-info,  Node: If Statement,  Next: While Statement,  Prev: Statements,  Up: Statements

The `if' Statement
==================

The `if'-`else' statement is `awk''s decision-making statement.  It
looks like this:

     if (CONDITION) THEN-BODY [else ELSE-BODY]

Here CONDITION is an expression that controls what the rest of the
statement will do.  If CONDITION is true, THEN-BODY is executed;
otherwise, ELSE-BODY is executed (assuming that the `else' clause is
present).  The `else' part of the statement is optional.  The
condition is considered false if its value is zero or the null
string, true otherwise.

Here is an example:

     if (x % 2 == 0)
         print "x is even"
     else
         print "x is odd"

In this example, if the expression `x % 2 == 0' is true (that is, the
value of `x' is divisible by 2), then the first `print' statement is
executed, otherwise the second `print' statement is performed.

If the `else' appears on the same line as THEN-BODY, and THEN-BODY is
not a compound statement (i.e., not surrounded by curly braces), then
a semicolon must separate THEN-BODY from `else'.  To illustrate this,
let's rewrite the previous example:

     awk '{ if (x % 2 == 0) print "x is even"; else
             print "x is odd" }'

If you forget the `;', `awk' won't be able to parse the statement,
and you will get a syntax error.

We would not actually write this example this way, because a human
reader might fail to see the `else' if it were not the first thing on
its line.


▶1f◀
File: gawk-info,  Node: While Statement,  Next: Do Statement,  Prev: If Statement,  Up: Statements

The `while' Statement
=====================

In programming, a "loop" means a part of a program that is (or at
least can be) executed two or more times in succession.

The `while' statement is the simplest looping statement in `awk'.  It
repeatedly executes a statement as long as a condition is true.  It
looks like this:

     while (CONDITION)
       BODY

Here BODY is a statement that we call the "body" of the loop, and
CONDITION is an expression that controls how long the loop keeps
running.

The first thing the `while' statement does is test CONDITION.  If
CONDITION is true, it executes the statement BODY.  (Truth, as usual
in `awk', means that the value of CONDITION is not zero and not a
null string.)  After BODY has been executed, CONDITION is tested
again, and if it is still true, BODY is executed again.  This process
repeats until CONDITION is no longer true.  If CONDITION is initially
false, the body of the loop is never executed.

This example prints the first three fields of each record, one per
line.

     awk '{ i = 1
            while (i <= 3) {
                print $i
                i++
            }
     }'

Here the body of the loop is a compound statement enclosed in braces,
containing two statements.

The loop works like this: first, the value of `i' is set to 1.  Then,
the `while' tests whether `i' is less than or equal to three.  This
is the case when `i' equals one, so the `i'-th field is printed. 
Then the `i++' increments the value of `i' and the loop repeats.  The
loop terminates when `i' reaches 4.

As you can see, a newline is not required between the condition and
the body; but using one makes the program clearer unless the body is
a compound statement or is very simple.  The newline after the
open-brace that begins the compound statement is not required either,
but the program would be hard to read without it.


▶1f◀
File: gawk-info,  Node: Do Statement,  Next: For Statement,  Prev: While Statement,  Up: Statements

The `do'-`while' Statement
==========================

The `do' loop is a variation of the `while' looping statement.  The
`do' loop executes the BODY once, then repeats BODY as long as
CONDITION is true.  It looks like this:

     do
       BODY
     while (CONDITION)

Even if CONDITION is false at the start, BODY is executed at least
once (and only once, unless executing BODY makes CONDITION true). 
Contrast this with the corresponding `while' statement:

     while (CONDITION)
       BODY

This statement does not execute BODY even once if CONDITION is false
to begin with.

Here is an example of a `do' statement:

     awk '{ i = 1
            do {
               print $0
               i++
            } while (i <= 10)
     }'

prints each input record ten times.  It isn't a very realistic
example, since in this case an ordinary `while' would do just as
well.  But this reflects actual experience; there is only
occasionally a real use for a `do' statement.


▶1f◀
File: gawk-info,  Node: For Statement,  Next: Break Statement,  Prev: Do Statement,  Up: Statements

The `for' Statement
===================

The `for' statement makes it more convenient to count iterations of a
loop.  The general form of the `for' statement looks like this:

     for (INITIALIZATION; CONDITION; INCREMENT)
       BODY

This statement starts by executing INITIALIZATION.  Then, as long as
CONDITION is true, it repeatedly executes BODY and then INCREMENT. 
Typically INITIALIZATION sets a variable to either zero or one,
INCREMENT adds 1 to it, and CONDITION compares it against the desired
number of iterations.

Here is an example of a `for' statement:

     awk '{ for (i = 1; i <= 3; i++)
               print $i
     }'

This prints the first three fields of each input record, one field
per line.

In the `for' statement, BODY stands for any statement, but
INITIALIZATION, CONDITION and INCREMENT are just expressions.  You
cannot set more than one variable in the INITIALIZATION part unless
you use a multiple assignment statement such as `x = y = 0', which is
possible only if all the initial values are equal.  (But you can
initialize additional variables by writing their assignments as
separate statements preceding the `for' loop.)

The same is true of the INCREMENT part; to increment additional
variables, you must write separate statements at the end of the loop.
The C compound expression, using C's comma operator, would be useful
in this context, but it is not supported in `awk'.

Most often, INCREMENT is an increment expression, as in the example
above.  But this is not required; it can be any expression whatever. 
For example, this statement prints all the powers of 2 between 1 and
100:

     for (i = 1; i <= 100; i *= 2)
       print i

Any of the three expressions in the parentheses following `for' may
be omitted if there is nothing to be done there.  Thus,
`for (;x > 0;)' is equivalent to `while (x > 0)'.  If the
CONDITION is omitted, it is treated as TRUE, effectively yielding an
infinite loop.

In most cases, a `for' loop is an abbreviation for a `while' loop, as
shown here:

     INITIALIZATION
     while (CONDITION) {
       BODY
       INCREMENT
     }

The only exception is when the `continue' statement (*note Continue
Statement::.) is used inside the loop; changing a `for' statement to
a `while' statement in this way can change the effect of the
`continue' statement inside the loop.

There is an alternate version of the `for' loop, for iterating over
all the indices of an array:

     for (i in array)
         DO SOMETHING WITH array[i]

*Note Arrays::, for more information on this version of the `for' loop.

The `awk' language has a `for' statement in addition to a `while'
statement because often a `for' loop is both less work to type and
more natural to think of.  Counting the number of iterations is very
common in loops.  It can be easier to think of this counting as part
of looping rather than as something to do inside the loop.

The next section has more complicated examples of `for' loops.


▶1f◀
File: gawk-info,  Node: Break Statement,  Next: Continue Statement,  Prev: For Statement,  Up: Statements

The `break' Statement
=====================

The `break' statement jumps out of the innermost `for', `while', or
`do'-`while' loop that encloses it.  The following example finds the
smallest divisor of any integer, and also identifies prime numbers:

     awk '# find smallest divisor of num
          { num = $1
            for (div = 2; div*div <= num; div++)
              if (num % div == 0)
                break
            if (num % div == 0)
              printf "Smallest divisor of %d is %d\n", num, div
            else
              printf "%d is prime\n", num  }'

When the remainder is zero in the first `if' statement, `awk'
immediately "breaks out" of the containing `for' loop.  This means
that `awk' proceeds immediately to the statement following the loop
and continues processing.  (This is very different from the `exit'
statement (*note Exit Statement::.) which stops the entire `awk'
program.)

Here is another program equivalent to the previous one.  It
illustrates how the CONDITION of a `for' or `while' could just as
well be replaced with a `break' inside an `if':

     awk '# find smallest divisor of num
          { num = $1
            for (div = 2; ; div++) {
              if (num % div == 0) {
                printf "Smallest divisor of %d is %d\n", num, div
                break
              }
              if (div*div > num) {
                printf "%d is prime\n", num
                break
              }
            }
     }'


▶1f◀
File: gawk-info,  Node: Continue Statement,  Next: Next Statement,  Prev: Break Statement,  Up: Statements

The `continue' Statement
========================

The `continue' statement, like `break', is used only inside `for',
`while', and `do'-`while' loops.  It skips over the rest of the loop
body, causing the next cycle around the loop to begin immediately. 
Contrast this with `break', which jumps out of the loop altogether. 
Here is an example:

     # print names that don't contain the string "ignore"
     
     # first, save the text of each line
     { names[NR] = $0 }
     
     # print what we're interested in
     END {
        for (x in names) {
            if (names[x] ~ /ignore/)
                continue
            print names[x]
        }
     }

If one of the input records contains the string `ignore', this
example skips the print statement for that record, and continues back
to the first statement in the loop.

This isn't a practical example of `continue', since it would be just
as easy to write the loop like this:

     for (x in names)
       if (names[x] !~ /ignore/)
         print names[x]

The `continue' statement in a `for' loop directs `awk' to skip the
rest of the body of the loop, and resume execution with the
increment-expression of the `for' statement.  The following program
illustrates this fact:

     awk 'BEGIN {
          for (x = 0; x <= 20; x++) {
              if (x == 5)
                  continue
              printf ("%d ", x)
          }
          print ""
     }'

This program prints all the numbers from 0 to 20, except for 5, for
which the `printf' is skipped.  Since the increment `x++' is not
skipped, `x' does not remain stuck at 5.  Contrast the `for' loop
above with the `while' loop:

     awk 'BEGIN {
          x = 0
          while (x <= 20) {
              if (x == 5)
                  continue
              printf ("%d ", x)
              x++
          }
          print ""
     }'

This program loops forever once `x' gets to 5.


▶1f◀
File: gawk-info,  Node: Next Statement,  Next: Exit Statement,  Prev: Continue Statement,  Up: Statements

The `next' Statement
====================

The `next' statement forces `awk' to immediately stop processing the
current record and go on to the next record.  This means that no
further rules are executed for the current record.  The rest of the
current rule's action is not executed either.

Contrast this with the effect of the `getline' function (*note
Getline::.).  That too causes `awk' to read the next record
immediately, but it does not alter the flow of control in any way. 
So the rest of the current action executes with a new input record.

At the grossest level, `awk' program execution is a loop that reads
an input record and then tests each rule's pattern against it.  If
you think of this loop as a `for' statement whose body contains the
rules, then the `next' statement is analogous to a `continue'
statement: it skips to the end of the body of this implicit loop, and
executes the increment (which reads another record).

For example, if your `awk' program works only on records with four
fields, and you don't want it to fail when given bad input, you might
use this rule near the beginning of the program:

     NF != 4 {
       printf("line %d skipped: doesn't have 4 fields", FNR) > "/dev/stderr"
       next
     }

so that the following rules will not see the bad record.  The error
message is redirected to the standard error output stream, as error
messages should be.  *Note Special Files::.

The `next' statement is not allowed in a `BEGIN' or `END' rule.


▶1f◀
File: gawk-info,  Node: Exit Statement,  Prev: Next Statement,  Up: Statements

The `exit' Statement
====================

The `exit' statement causes `awk' to immediately stop executing the
current rule and to stop processing input; any remaining input is
ignored.

If an `exit' statement is executed from a `BEGIN' rule the program
stops processing everything immediately.  No input records are read. 
However, if an `END' rule is present, it is executed (*note
BEGIN/END::.).

If `exit' is used as part of an `END' rule, it causes the program to
stop immediately.

An `exit' statement that is part an ordinary rule (that is, not part
of a `BEGIN' or `END' rule) stops the execution of any further
automatic rules, but the `END' rule is executed if there is one.  If
you don't want the `END' rule to do its job in this case, you can set
a variable to nonzero before the `exit' statement, and check that
variable in the `END' rule.

If an argument is supplied to `exit', its value is used as the exit
status code for the `awk' process.  If no argument is supplied,
`exit' returns status zero (success).

For example, let's say you've discovered an error condition you
really don't know how to handle.  Conventionally, programs report
this by exiting with a nonzero status.  Your `awk' program can do
this using an `exit' statement with a nonzero argument.  Here's an
example of this:

     BEGIN {
            if (("date" | getline date_now) < 0) {
              print "Can't get system date" > "/dev/stderr"
              exit 4
            }
     }


▶1f◀
File: gawk-info,  Node: Arrays,  Next: Built-in,  Prev: Statements,  Up: Top

Arrays in `awk'
***************

An "array" is a table of various values, called "elements".  The
elements of an array are distinguished by their "indices".  Indices
may be either numbers or strings.  Each array has a name, which looks
like a variable name, but must not be in use as a variable name in
the same `awk' program.


* Menu:

* Intro: Array Intro.      Basic facts about arrays in `awk'.
* Reference to Elements::  How to examine one element of an array.
* Assigning Elements::     How to change an element of an array.
* Example: Array Example.  Sample program explained.

* Scanning an Array::      A variation of the `for' statement.  It loops
                           through the indices of an array's existing elements.

* Delete::                 The `delete' statement removes an element from an array.

* Multi-dimensional::      Emulating multi-dimensional arrays in `awk'.
* Multi-scanning::         Scanning multi-dimensional arrays.

 
▶1f◀
File: gawk-info,  Node: Array Intro,  Next: Reference to Elements,  Prev: Arrays,  Up: Arrays

Introduction to Arrays
======================

The `awk' language has one-dimensional "arrays" for storing groups of
related strings or numbers.

Every `awk' array must have a name.  Array names have the same syntax
as variable names; any valid variable name would also be a valid
array name.  But you cannot use one name in both ways (as an array
and as a variable) in one `awk' program.

Arrays in `awk' superficially resemble arrays in other programming
languages; but there are fundamental differences.  In `awk', you
don't need to specify the size of an array before you start to use it.
What's more, in `awk' any number or even a string may be used as an
array index.

In most other languages, you have to "declare" an array and specify
how many elements or components it has.  In such languages, the
declaration causes a contiguous block of memory to be allocated for
that many elements.  An index in the array must be a positive
integer; for example, the index 0 specifies the first element in the
array, which is actually stored at the beginning of the block of
memory.  Index 1 specifies the second element, which is stored in
memory right after the first element, and so on.  It is impossible to
add more elements to the array, because it has room for only as many
elements as you declared.

A contiguous array of four elements might look like this,
conceptually, if the element values are 8, `"foo"', `""' and 30:

     +--------+--------+-------+--------+
     |    8    |  "foo"  |   ""   |    30   |    value
     +--------+--------+-------+--------+
          0         1         2         3        index

Only the values are stored; the indices are implicit from the order
of the values.  8 is the value at index 0, because 8 appears in the
position with 0 elements before it.

Arrays in `awk' are different: they are "associative".  This means
that each array is a collection of pairs: an index, and its
corresponding array element value:

     Element 4     Value 30
     Element 2     Value "foo"
     Element 1     Value 8
     Element 3     Value ""

We have shown the pairs in jumbled order because their order doesn't
mean anything.

One advantage of an associative array is that new pairs can be added
at any time.  For example, suppose we add to that array a tenth
element whose value is `"number ten"'.  The result is this:

     Element 10    Value "number ten"
     Element 4     Value 30
     Element 2     Value "foo"
     Element 1     Value 8
     Element 3     Value ""

Now the array is "sparse" (i.e., some indices are missing): it has
elements 4 and 10, but doesn't have elements 5, 6, 7, 8, or 9.

Another consequence of associative arrays is that the indices don't
have to be positive integers.  Any number, or even a string, can be
an index.  For example, here is an array which translates words from
English into French:

     Element "dog" Value "chien"
     Element "cat" Value "chat"
     Element "one" Value "un"
     Element 1     Value "un"

Here we decided to translate the number 1 in both spelled-out and
numeric form--thus illustrating that a single array can have both
numbers and strings as indices.

When `awk' creates an array for you, e.g., with the `split' built-in
function (*note String Functions::.), that array's indices are
consecutive integers starting at 1.


▶1f◀
File: gawk-info,  Node: Reference to Elements,  Next: Assigning Elements,  Prev: Array Intro,  Up: Arrays

Referring to an Array Element
=============================

The principal way of using an array is to refer to one of its elements.
An array reference is an expression which looks like this:

     ARRAY[INDEX]

Here ARRAY is the name of an array.  The expression INDEX is the
index of the element of the array that you want.

The value of the array reference is the current value of that array
element.  For example, `foo[4.3]' is an expression for the element of
array `foo' at index 4.3.

If you refer to an array element that has no recorded value, the
value of the reference is `""', the null string.  This includes
elements to which you have not assigned any value, and elements that
have been deleted (*note Delete::.).  Such a reference automatically
creates that array element, with the null string as its value.  (In
some cases, this is unfortunate, because it might waste memory inside
`awk').

You can find out if an element exists in an array at a certain index
with the expression:

     INDEX in ARRAY

This expression tests whether or not the particular index exists,
without the side effect of creating that element if it is not present.
The expression has the value 1 (true) if `ARRAY[INDEX]' exists, and 0
(false) if it does not exist.

For example, to test whether the array `frequencies' contains the
index `"2"', you could write this statement:

     if ("2" in frequencies) print "Subscript \"2\" is present."

Note that this is *not* a test of whether or not the array
`frequencies' contains an element whose *value* is `"2"'.  (There is
no way to do that except to scan all the elements.)  Also, this *does
not* create `frequencies["2"]', while the following (incorrect)
alternative would do so:

     if (frequencies["2"] != "") print "Subscript \"2\" is present."


▶1f◀
File: gawk-info,  Node: Assigning Elements,  Next: Array Example,  Prev: Reference to Elements,  Up: Arrays

Assigning Array Elements
========================

Array elements are lvalues: they can be assigned values just like
`awk' variables:

     ARRAY[SUBSCRIPT] = VALUE

Here ARRAY is the name of your array.  The expression SUBSCRIPT is
the index of the element of the array that you want to assign a
value.  The expression VALUE is the value you are assigning to that
element of the array.


▶1f◀
File: gawk-info,  Node: Array Example,  Next: Scanning an Array,  Prev: Assigning Elements,  Up: Arrays

Basic Example of an Array
=========================

The following program takes a list of lines, each beginning with a
line number, and prints them out in order of line number.  The line
numbers are not in order, however, when they are first read:  they
are scrambled.  This program sorts the lines by making an array using
the line numbers as subscripts.  It then prints out the lines in
sorted order of their numbers.  It is a very simple program, and gets
confused if it encounters repeated numbers, gaps, or lines that don't
begin with a number.

     {
       if ($1 > max)
         max = $1
       arr[$1] = $0
     }
     
     END {
       for (x = 1; x <= max; x++)
         print arr[x]
     }

The first rule keeps track of the largest line number seen so far; it
also stores each line into the array `arr', at an index that is the
line's number.

The second rule runs after all the input has been read, to print out
all the lines.

When this program is run with the following input:

     5  I am the Five man
     2  Who are you?  The new number two!
     4  . . . And four on the floor
     1  Who is number one?
     3  I three you.

 its output is this:

     1  Who is number one?
     2  Who are you?  The new number two!
     3  I three you.
     4  . . . And four on the floor
     5  I am the Five man

If a line number is repeated, the last line with a given number
overrides the others.

Gaps in the line numbers can be handled with an easy improvement to
the program's `END' rule:

     END {
       for (x = 1; x <= max; x++)
         if (x in arr)
           print arr[x]
     }


▶1f◀
File: gawk-info,  Node: Scanning an Array,  Next: Delete,  Prev: Array Example,  Up: Arrays

Scanning All Elements of an Array
=================================

In programs that use arrays, often you need a loop that executes once
for each element of an array.  In other languages, where arrays are
contiguous and indices are limited to positive integers, this is
easy: the largest index is one less than the length of the array, and
you can find all the valid indices by counting from zero up to that
value.  This technique won't do the job in `awk', since any number or
string may be an array index.  So `awk' has a special kind of `for'
statement for scanning an array:

     for (VAR in ARRAY)
       BODY

This loop executes BODY once for each different value that your
program has previously used as an index in ARRAY, with the variable
VAR set to that index.

Here is a program that uses this form of the `for' statement.  The
first rule scans the input records and notes which words appear (at
least once) in the input, by storing a 1 into the array `used' with
the word as index.  The second rule scans the elements of `used' to
find all the distinct words that appear in the input.  It prints each
word that is more than 10 characters long, and also prints the number
of such words.  *Note Built-in::, for more information on the
built-in function `length'.

     # Record a 1 for each word that is used at least once.
     {
       for (i = 0; i < NF; i++)
         used[$i] = 1
     }
     
     # Find number of distinct words more than 10 characters long.
     END {
       num_long_words = 0
       for (x in used)
         if (length(x) > 10) {
           ++num_long_words
           print x
       }
       print num_long_words, "words longer than 10 characters"
     }

*Note Sample Program::, for a more detailed example of this type.

The order in which elements of the array are accessed by this
statement is determined by the internal arrangement of the array
elements within `awk' and cannot be controlled or changed.  This can
lead to problems if new elements are added to ARRAY by statements in
BODY; you cannot predict whether or not the `for' loop will reach
them.  Similarly, changing VAR inside the loop can produce strange
results.  It is best to avoid such things.


▶1f◀
File: gawk-info,  Node: Delete,  Next: Multi-dimensional,  Prev: Scanning an Array,  Up: Arrays

The `delete' Statement
======================

You can remove an individual element of an array using the `delete'
statement:

     delete ARRAY[INDEX]

When an array element is deleted, it is as if you had never referred
to it and had never given it any value.  Any value the element
formerly had can no longer be obtained.

Here is an example of deleting elements in an array:

     for (i in frequencies)
       delete frequencies[i]

This example removes all the elements from the array `frequencies'.

If you delete an element, a subsequent `for' statement to scan the
array will not report that element, and the `in' operator to check
for the presence of that element will return 0:

     delete foo[4]
     if (4 in foo)
       print "This will never be printed"


▶1f◀
File: gawk-info,  Node: Multi-dimensional,  Next: Multi-scanning,  Prev: Delete,  Up: Arrays

Multi-dimensional Arrays
========================

A multi-dimensional array is an array in which an element is
identified by a sequence of indices, not a single index.  For
example, a two-dimensional array requires two indices.  The usual way
(in most languages, including `awk') to refer to an element of a
two-dimensional array named `grid' is with `grid[X,Y]'.

Multi-dimensional arrays are supported in `awk' through concatenation
of indices into one string.  What happens is that `awk' converts the
indices into strings (*note Conversion::.) and concatenates them
together, with a separator between them.  This creates a single
string that describes the values of the separate indices.  The
combined string is used as a single index into an ordinary,
one-dimensional array.  The separator used is the value of the
built-in variable `SUBSEP'.

For example, suppose we evaluate the expression `foo[5,12]="value"'
when the value of `SUBSEP' is `"@"'.  The numbers 5 and 12 are
concatenated with a comma between them, yielding `"5@12"'; thus, the
array element `foo["5@12"]' is set to `"value"'.

Once the element's value is stored, `awk' has no record of whether it
was stored with a single index or a sequence of indices.  The two
expressions `foo[5,12]' and `foo[5 SUBSEP 12]' always have the same
value.

The default value of `SUBSEP' is actually the string `"\034"', which
contains a nonprinting character that is unlikely to appear in an
`awk' program or in the input data.

The usefulness of choosing an unlikely character comes from the fact
that index values that contain a string matching `SUBSEP' lead to
combined strings that are ambiguous.  Suppose that `SUBSEP' were
`"@"'; then `foo["a@b", "c"]' and `foo["a", "b@c"]' would be
indistinguishable because both would actually be stored as
`foo["a@b@c"]'.  Because `SUBSEP' is `"\034"', such confusion can
actually happen only when an index contains the character with ASCII
code 034, which is a rare event.

You can test whether a particular index-sequence exists in a
``multi-dimensional'' array with the same operator `in' used for
single dimensional arrays.  Instead of a single index as the
left-hand operand, write the whole sequence of indices, separated by
commas, in parentheses:

     (SUBSCRIPT1, SUBSCRIPT2, ...) in ARRAY

The following example treats its input as a two-dimensional array of
fields; it rotates this array 90 degrees clockwise and prints the
result.  It assumes that all lines have the same number of elements.

     awk '{
          if (max_nf < NF)
               max_nf = NF
          max_nr = NR
          for (x = 1; x <= NF; x++)
               vector[x, NR] = $x
     }
     
     END {
          for (x = 1; x <= max_nf; x++) {
               for (y = max_nr; y >= 1; --y)
                    printf("%s ", vector[x, y])
               printf("\n")
          }
     }'

When given the input:

     1 2 3 4 5 6
     2 3 4 5 6 1
     3 4 5 6 1 2
     4 5 6 1 2 3

it produces:

     4 3 2 1
     5 4 3 2
     6 5 4 3
     1 6 5 4
     2 1 6 5
     3 2 1 6


▶1f◀
File: gawk-info,  Node: Multi-scanning,  Prev: Multi-dimensional,  Up: Arrays

Scanning Multi-dimensional Arrays
=================================

There is no special `for' statement for scanning a
``multi-dimensional'' array; there cannot be one, because in truth
there are no multi-dimensional arrays or elements; there is only a
multi-dimensional *way of accessing* an array.

However, if your program has an array that is always accessed as
multi-dimensional, you can get the effect of scanning it by combining
the scanning `for' statement (*note Scanning an Array::.) with the
`split' built-in function (*note String Functions::.).  It works like
this:

     for (combined in ARRAY) {
       split(combined, separate, SUBSEP)
       ...
     }

This finds each concatenated, combined index in the array, and splits
it into the individual indices by breaking it apart where the value
of `SUBSEP' appears.  The split-out indices become the elements of
the array `separate'.

Thus, suppose you have previously stored in `ARRAY[1, "foo"]'; then
an element with index `"1\034foo"' exists in ARRAY.  (Recall that the
default value of `SUBSEP' contains the character with code 034.) 
Sooner or later the `for' statement will find that index and do an
iteration with `combined' set to `"1\034foo"'.  Then the `split'
function is called as follows:

     split("1\034foo", separate, "\034")

The result of this is to set `separate[1]' to 1 and `separate[2]' to
`"foo"'.  Presto, the original sequence of separate indices has been
recovered.


▶1f◀
File: gawk-info,  Node: Built-in,  Next: User-defined,  Prev: Arrays,  Up: Top

Built-in Functions
******************

"Built-in" functions are functions that are always available for your
`awk' program to call.  This chapter defines all the built-in
functions in `awk'; some of them are mentioned in other sections, but
they are summarized here for your convenience.  (You can also define
new functions yourself.  *Note User-defined::.)


* Menu:

* Calling Built-in::   How to call built-in functions.

* Numeric Functions::  Functions that work with numbers,
                       including `int', `sin' and `rand'.

* String Functions::   Functions for string manipulation,
                       such as `split', `match', and `sprintf'.

* I/O Functions::      Functions for files and shell commands


▶1f◀
File: gawk-info,  Node: Calling Built-in,  Next: Numeric Functions,  Prev: Built-in,  Up: Built-in

Calling Built-in Functions
==========================

To call a built-in function, write the name of the function followed
by arguments in parentheses.  For example, `atan2(y + z, 1)' is a
call to the function `atan2', with two arguments.

Whitespace is ignored between the built-in function name and the
open-parenthesis, but we recommend that you avoid using whitespace
there.  User-defined functions do not permit whitespace in this way,
and you will find it easier to avoid mistakes by following a simple
convention which always works: no whitespace after a function name.

Each built-in function accepts a certain number of arguments.  In
most cases, any extra arguments given to built-in functions are
ignored.  The defaults for omitted arguments vary from function to
function and are described under the individual functions.

When a function is called, expressions that create the function's
actual parameters are evaluated completely before the function call
is performed.  For example, in the code fragment:

     i = 4
     j = sqrt(i++)

the variable `i' is set to 5 before `sqrt' is called with a value of
4 for its actual parameter.


▶1f◀
File: gawk-info,  Node: Numeric Functions,  Next: String Functions,  Prev: Calling Built-in,  Up: Built-in

Numeric Built-in Functions
==========================

Here is a full list of built-in functions that work with numbers:

`int(X)'
     This gives you the integer part of X, truncated toward 0.  This
     produces the nearest integer to X, located between X and 0.

     For example, `int(3)' is 3, `int(3.9)' is 3, `int(-3.9)' is -3,
     and `int(-3)' is -3 as well.

`sqrt(X)'
     This gives you the positive square root of X.  It reports an
     error if X is negative.  Thus, `sqrt(4)' is 2.

`exp(X)'
     This gives you the exponential of X, or reports an error if X is
     out of range.  The range of values X can have depends on your
     machine's floating point representation.

`log(X)'
     This gives you the natural logarithm of X, if X is positive;
     otherwise, it reports an error.

`sin(X)'
     This gives you the sine of X, with X in radians.

`cos(X)'
     This gives you the cosine of X, with X in radians.

`atan2(Y, X)'
     This gives you the arctangent of `Y / X', with the quotient
     understood in radians.

`rand()'
     This gives you a random number.  The values of `rand' are
     uniformly-distributed between 0 and 1.  The value is never 0 and
     never 1.

     Often you want random integers instead.  Here is a user-defined
     function you can use to obtain a random nonnegative integer less
     than N:

          function randint(n) {
               return int(n * rand())
          }

     The multiplication produces a random real number greater than 0
     and less than N.  We then make it an integer (using `int')
     between 0 and `N - 1'.

     Here is an example where a similar function is used to produce
     random integers between 1 and N:

          awk '
          # Function to roll a simulated die.
          function roll(n) { return 1 + int(rand() * n) }
          
          # Roll 3 six-sided dice and print total number of points.
          {
                printf("%d points\n", roll(6)+roll(6)+roll(6))
          }'

     *Note:* `rand' starts generating numbers from the same point, or
     "seed", each time you run `awk'.  This means that a program will
     produce the same results each time you run it.  The numbers are
     random within one `awk' run, but predictable from run to run. 
     This is convenient for debugging, but if you want a program to
     do different things each time it is used, you must change the
     seed to a value that will be different in each run.  To do this,
     use `srand'.

`srand(X)'
     The function `srand' sets the starting point, or "seed", for
     generating random numbers to the value X.

     Each seed value leads to a particular sequence of ``random''
     numbers.  Thus, if you set the seed to the same value a second
     time, you will get the same sequence of ``random'' numbers again.

     If you omit the argument X, as in `srand()', then the current
     date and time of day are used for a seed.  This is the way to
     get random numbers that are truly unpredictable.

     The return value of `srand' is the previous seed.  This makes it
     easy to keep track of the seeds for use in consistently
     reproducing sequences of random numbers.
DataMuseum.dk

DKUUG/EUUG Conference tapes

⟦4689fa24c⟧ TextFile

Derivation

TextFile