Using awk

Revision: $Revision: 1.12 $ ($Date: 2004-01-29 21:20:25 $)

Resources and further reading: Robbins96; man awk

awk is an interpreted language, used mostly to extract metadata from data files. It can be used to select rows and columns from a file and to calculate derived data, such as sums. It is also often used to filter input into a more readable format, such as log-file data.

awk uses blocks of statements, prepended with a so-called pattern. awk opens its inputfile(s), reads lines of text from them and matches the contents of the inputlines with the patterns as specified in its program. If a pattern matches, the corresponding block of code is executed. Such patterns can be regular expressions (see the section called “Regular Expressions”), but a number of other forms are supported as well.

awk can be used to extract reports from various input files, to validate data, produce indices, manage small databases and to play around with algorithms to be used in other programming languages. Some rather complex programs have been written in awk. However, awk's capabilities are strained by tasks of such complexity. If you find yourself writing awk scripts of more than, say, a few hundred lines, you might consider using a different programming language, such as Perl, Scheme, C or C++.

To give you an impression of the look and feel of an awk program, we offer the following example. Don't worry if you do not understand (all) this code (yet), we will clarify it later.

function resume(line, part) 
{
   len=length(line)
   if (len > ( 2 * part) )
   {
      retval = substr(line,1,part) " ... " substr(line,len-part)

   } else {

      retval = line
   }
   return retval
}

/^90:/ {
   print resume(substr($0,4),4)
}

!/^90:/ {
   print resume($0,4)
}

The name awk is an acronym and stands for the initials of its authors: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. It also pokes some fun at the somewhat awkward language that awk implements. The original version was written in 1977 at AT&T Bell Laboratories. In 1985, a new version made the programming language more powerful, introducing user-defined functions, multiple input streams and computed regular expressions. awk was enhanced and improved over many years and eventually became part of the POSIX Command Language and Utilities standard.

The GNU implementation, gawk, which is the version in use on most Unix systems nowadays, was written in 1986. In 1988, it was reworked to be compatible with newer (non-GNU) awk implementations. Current development focuses on bug fixes, performance improvements and standards compliance.

Once you are familiar with awk, you will most likely get into the habit of typing simple one-liners, for example, to extract and/or combine fields from input files. Such a one-liner is typically built like this:

$ awk 'program' input-file1 input-file2 ...

where program consists of a series of patterns (patterns) and actions (more about these later). This command format instructs the shell to start awk and use the program (in single quotes to prevent shell expansion of special characters) to process records in the input file(s).

Longer awk programs are often put in files. These files are either called from the command-line interface:

$ awk --file programname input-file1 input-file2 ...

or they are contained in files that start with the line:

#!/usr/bin/awk

and have the execution bit set (chmod +x).

Now that we have an overview, let's go in and take a closer look.

Generic flow

As stated before, an awk program reads lines and processes those lines, using blocks of code. A block of code is denoted by curly braces. A typical program consists of one or more of these blocks, often preceded by patterns that determine whether or not the code in the code block should be executed for the current line of input.

More formally, an awk program is a sequence of pattern-action statements and optional function-definitions:

pattern   { action statements }

.. and optional ..

function name(parameter list) { statements }           

For each line in the input, awk tests to see if the record matches any pattern specified in the program. For each pattern that matches, the associated action is executed. The patterns are tested in the order they occur in the program.

Patterns can have various forms, for example, regular expressions or relational expressions; pattern types are described later (patterns).

Actions are lines of code to be executed by awk. The regular programming structures (branching and looping) are available; so are variables and arrays. Conditional testing, matching on regular expressions and various operators and assignments are also available. All of these components are described below. The # (hash) is used to start a comment, the remainder of the line will be ignored.

Variables and arrays

awk supports the use of variables. There are a number of internal variables whose values either can be read or set by both awk and the programmer. These can be used to influence awk's behavior or to query awk's environment. The programmer may also set other variables, which automatically come into existence when first used. Their values are floating point numbers or strings: automatic conversion is performed by awk if the context requires it.

Only one-dimensional arrays are supported. Arrays are always subscripted with strings, for example, x[1] (the 1 is automatically converted to the string 1) or x[ray]. You may also use a list of values (a comma-separated list of expressions) as an array subscript since these lists are converted to strings by awk. Simulations of multi-dimensional arrays can be done this way:

i="X"
j="Y"
x[i,j]="text"

In this example, the array x is subscripted by a list of expressions. However, the list is converted to a string first. awk does this conversion by concatenating the values in i and j using a special separator (by default the hexadecimal value \034). Thus, the subscript would be the string X\034Y, which is an acceptable subscript for awk's one-dimensional associative arrays.

To delete an entire array, you can use the command:

delete array

To delete an array named scores, for example, you could issue the command delete scored. Additionally, you can delete one member of an array by using the construction

delete array[index]

So, for example, delete scores[9]. Note that this just deletes the element from the array, but does not rearrange the members to close the gap.

Input files, records and fields

awk normally is used with text-file input. Within awk text files are seen as sets of records and fields. In most cases, a simplified view can be used: records are lines of text, delimited by a newline. However, it is possible to set your own record separator by setting the variable RS, which can be a regular expression or a single character. When RS contains a regular expression, the text in the input that matches this regular expression will separate the record.

awk splits the input records into fields. By default, (sequences of) white space are seen as a separator. It is possible to use the value of the FS variable to define the field separator. By setting FS to the null string, each character in the input stream becomes a separate field. If FS is a single character, the records are divided into fields using that single character. In all other cases, FS should be a regular expression, and the fields will be separated by the characters that match that regular expression.

To refer to a field, specify a dollar sign followed by a numerical representation of it's position within the record. To refer to the first field, for example, you would use $1, the fifth field would be $5. $0 is the whole record. It is acceptable to refer to a field using a variable, like this:

n = 7
print $n

This prints the seventh field from the input record. To determine the number of fields in the input record, you can use the variable NF. This example also demonstrates the use of the print statement, which prints its arguments to standard output, and the use of the internal variable NF, which contains the number of fields in the current input record. More about internal variables will be explained later on.

To split records into fields, you can also set the variable FIELDWIDTHS to a space-separated list of numbers: the records will be seen as sequences of fields with fixed width, as specified in the variable, and awk will not use field separators any more, unless you assign a new value to FS.

awk allows you to assign values to the fields, and that assignment will prevail over the values assigned when awk reads the input record. Suppose the seventh field of a record contains the string scoobydoo, and you set it to another value:

$7 = "seven"

Then the seventh field will contain the string seven and the $0 will be recomputed to reflect the change.

As an example, and to review a number of the concepts above, let's study this awk program:

awk -F: '{ 
    OFS="."      # set the output field separator
    print $0     # print all input fields
    $1="first"   # re-assign the first field
    print $0     # print all input fields again
}' myfile

awk is called using the -F flag to set the field separator to a colon. We could have achieved the same result by assigning the value : to the FS variable. In our example, we assume the program has been entered at the command-line interface. Therefore we placed the program between quotes to prevent variable expansion by the shell. The main block of code is specified between curly braces, and no conditions or expressions are specified before that block, therefore it will be executed for any record.

In the main code block, we start by setting the OFS variable to a dot. This signifies that all output fields will be separated by a dot. Next, the input line is printed by referring to $0. Then, on the next line, we force the first field into a fixed value, and then we print out the (newly created) full record again. On the final line, the block is closed and the input file name is specified. Assuming the file myfile contains these two lines:

a:b:c:d
1:2:3:4

The output would be:

a:b:c:c
first.b.c.d
1:2:3:4
first.2.3.4

It is possible to refer to non-existent fields or to assign values to non-existing fields, but it's beyond the scope of this introduction. See the manual pages for (GNU) awk for more details.

Branching, looping and other control statements

Within awk programs, you can use a number of control statements. Here, too, statements can be grouped together by using curly braces. A statement (or group of statements) can be prepended by a control statement. If the control statement evaluates to true, the corresponding block of code (or single statement) will be executed. For example the if statement has the form:

 
if (condition) statement

.. or, if blocks of code are used:

 
if (condition) { statements...  }

A sample program that uses if and demonstrates the use else and the use of blocks of code follows:

if ($NF > 4) {
   print "This record has over 4 fields"
   print "It is a very lengthy record!"
   long++
} else {
   print "Never more than four - good :-)"
   short++
}

or, the simpler case, this fragment of code uses a single statement:

if ($NF < 3 ) print "less than 3 fields found"

Looping (iteration) can be done by using either the while, do while or for statement:

 
while (condition) statement
do statement while (condition)
for (expr1; expr2; expr3) statement

An example is given below. It prints out the numbers from 1 to 10, using all three methods. Additionally, the keyword exit is introduced. exit simply terminates the program. This program makes use of a simple trick to make it run without input data: it consists of just a BEGIN block (more about that later; patterns). Also note the use of the increment operator ++ (operators).

BEGIN {
   # first method:
   #
   number=1
   do { 
        print number
        number = number + 1  
   } while (number < 11)

   # second method:
   #
   number=1
   while ( number < 11 ) print number++

   # third method: 
   #
   for (number=1; number<11; number = number + 1) print number

}

The exit statement terminates the program. It can be followed by an expression that should return a value from 0...255. This value is passed on to the calling program, for example, to enable it to see why a program terminated:

$ awk 'BEGIN { exit 56 }'; echo $?
56

Note

Only values between 0 and 255 are meaningful exit values, all other values will be converted into that range.

The break statement can be used to jump out of a do, while or for loop unconditionally. The loop is terminated, and program continues at the first statement after the loop. For example, here is another (somewhat clumsy) way to count to ten using awk:

BEGIN {
    z=0
    while (1==1) {
        if (++z == 10) break
        print z
    }
    print "10"
}

Under normal circumstances, all commands within a block of code executed under the control of either a while, do or for function will be executed for each iteration. The continue statement can be used to perform the next iteration, thereby skipping the code between the continue statement and the end of the loop. Yet another way to count to ten, for example, is:

BEGIN {
    z=0
    while (1==1) {
        print z;
        if (++z < 10) continue
        print "10" 
        break
    }
}

Patterns

As stated before, awk programs are composed of blocks of code prepended by patterns. The blocks of code will be executed if the pattern applies.

Patterns can be in any of the following forms:

BEGIN
END
/regular expression/
relational expression
pattern && pattern
pattern || pattern
pattern ? pattern : pattern
(pattern)
! pattern
pattern1, pattern2         

BEGIN and END.  An awk program can have multiple BEGIN and END patterns. The actions in the BEGIN block will be executed before input is read, the actions in the END block after the input is terminated. Blocks of code that start with a BEGIN pattern are merged together into one virtual BEGIN block. The same applies to END blocks. Thus, a (somewhat unusable) interactive awk program like this:

BEGIN {
    print "1"
}
END { 
    print "a"
}
BEGIN {
    print "2"
}
END {
    print "b"
}
{ exit }

.. assuming you would enter some input, would print:

1
2
.. some input you gave ..
a
b

Regular Expressions.  For blocks prepended with a /regexp/ pattern, (see the section called “Regular Expressions”) the code found in the corresponding block will be executed for each input record that matches the regular expression regexp specified. awk uses an extended set of regular expressions, similar to egrep(1) (remember that the interval expressions are either not supported at all or can only enabled on request).

Relational expressions.  Another way to select which block to execute for which record is using relational expressions. Such an expression consists of awk commands (a.k.a. actions). Frequently, this is used to match on fields within a record:

$ awk '$1 == "01" { print }'

This would print all input records (lines) that have the string 01 as their first field. You can also check fields to match regular expressions, by using the match operator ~ (the tilde character), as is done in this one-liner:

$ awk '$1 ~ /[abZ]x/ { print $2 " " $3 }'

Given the following input:

Zx the rain
cx and snow 
ax in Spain 
gx and France

this would produce:

the rain 
in Spain

Logical operators.  Combinations of pattern expressions can be made by using logical operators && (and) || (or) and ! (not). As in C and perl, these work in a short-circuit fashion: if enough of the expression is resolved to determine the outcome (true or false) of the expression, the other patterns are not resolved. Thus, given this example:

  
( $1 > 0 ) && ( $2 < 42 )

if $1 was assigned the value 1, the first part of the expression would return true. Because the resolution of the second expression, ( $2 < 42 ), is of no relevance to the outcome of the entire expression, it will never be executed. This example:

$ awk ' $1 == 3 || $2 ~ /[a-z]9/ { print }'

would print all records where the first field has the value 3, or, if the first field does not contain a string with the value 3, would check if the second field matches the regular expression [a-z]9.

Conditional operators.  The conditional operator {pattern}?{pattern}:{pattern} (sometimes called the ternary operator) works just like the same operator in C. If the first pattern is true, then the pattern used for testing is the second pattern, otherwise it is the third. Only one of the second and third patterns is evaluated.

$ awk '$1==3 ? $2=="b" : $2=="c" { print }'

This signifies, that, if the first field contains a string with value 3, the record will be printed if the second field contains the string b. If the first field does not contain the value 3, the record will be printed if the second field contains the string c. All other records are not printed.

Range operator.  Finally, there is the form where a range of records can be specified. This is done by specifying the start of the range, a comma and the end of a range. In this example, selected records will be printed:

$ awk '$1=="start", $1=="stop" { print }'

This program will suppress lines in its input until a record (line) is found in the input where the first field is the string start. That record, and all records following it, would be printed - until a record is found where the first field is the string stop. Any records following that line will silently be discarded, until the next record is found where the first field is the word start.

Operators

awk supports standard operators similar to those used in C. From the manual pages:

Table 13.13. awk operators

(...)Grouping of statements, for example, to force order of execution
++ --Increment and decrement, both prefix and postfix
^ **Exponentiation. Both alternatives support the same functionality
+ - !Unary plus, unary minus and logical negation
+ -Addition, subtraction
* / %Multiplication, division and modulus
(space)the space is used for string concatenation
< <= > >= != ==relation operators: lesser, lesser or equal, bigger, bigger or equal, not equal, equal
!~match a regular expression
inarray membership
&& ||logical and, logical or
?:This has the form expr1 ? expr2 : expr3. If expr1 is true, expr2 is evaluated, else expr3 is evaluated.
= += -= *= /= %= ^= **=assignments. The basic form is var = value. The operator form, for example, var operator= value is an alternate form to write var = var operator value, b *= 3 is the same as b = b * 3. Note that the form ^= is an alternate way of writing **=


Using regular expressions

The chapter on Regular Expressions (see the section called “Regular Expressions”) gives a thorough overview of the regular expressions that can be used in either awk, perl or sed. This section, therefore, focuses on using them within awk, rather than on the Regular Expressions themselves.

A regular expression can be used as a pattern by enclosing it in slashes. Then the regular expression is tested against the entire text of each record. Normally, it only needs to match some part of the text in order to succeed. This, for example, prints the second field of each record that contains the characters snow anywhere in it:

$ awk '/snow/ { print $2 }' phonelist 

Built-in variables

An exhaustive list of all built-in variables can be found in the manual pages for awk; only the most commonly used ones are explained here. The built-in variables FS and OFS have been demonstrated before. They can be used to force the Field Separator and Output Field Separator by setting them from within an awk code block. The FNR variable holds count of the current record number. A very basic one-liner to generate numbered program listings, for example, could look like this:

 
awk '{ printf "%05d %s\n",FNR, $0 }' 

By default, records are separated by newline characters, that is, awk reads lines of text by default and treats each line as a record. This behaviour can be changed by setting the variable RS. RS contains either a single character or a regular expression. Text in the input that matches that character or regular expression is used as a separator. The environment can be read by using the contents of the internal array ENVIRON. To read the variable TERM for the environment, you would use ENVIRON["TERM"].

Functions

awk allows the definition of functions by the programmer and has a large number of built-in functions.

Built-in functions

The built-in functions consist of mathematical functions, string functions, time functions and the system function.

The mathematical functions are: atan2(y, x), cos(expr), exp(expr), int(expr), log(expr), sin(expr) and sqrt(expr).

Additionally, there are two functions available to work with random numbers: srand([expr]), which seeds the random generator, and rand(), which returns a random number between 0 and 1.

A rich set of string functions is available; please consult the manual pages for a full description. A short list of the most commonly used functions follows: index(s,t) to return the index of the string t in the string s, length([s]) to obtain the length of the string specified (or the length of the input record if no string is specified), match(s,r) to return on which position in string s the expression r occurs. substr(s,i[,n]) returns the substring from string s, starting at position i, with optional length n.

One of the primary uses of awk programs is processing log files that contain time-stamp information. awk provides two functions for obtaining time stamps and formatting them: systime() returns the current time of day as the number of seconds since midnight UTC, January 1st, 1970, and strftime([format [, timestamp]]) to format time stamps according to the form specified in format. To learn more about the format string used with this function, the manual pages for strftime(3) should be consulted.

Self-written functions

User-defined functions were added later to awk, and therefore, the implementation is sometimes, somewhat, well, awkward. User-defined functions are constructed following this template:

function name(parameter list) { statements }

As an example, we'll list a function that prints out a part of a string (you may be familiar with it, since we've used it before in our examples):

function resume(line, part) 
{
   len=length(line)
   if (len > ( 2 * part) )
   {
      retval = substr(line,1,part) " ... " substr(line,len-part)

   } else {

      retval = line
   }
   return retval
}

This function is named resume (reh-suh-may). It prints a short version of its input line, consisting of an equal number of characters from the begin and end of the line, concatenated by an ellipsis. It has two parameters:

  • line, the line to determine the resume from, and

  • part, which is the exact number of characters counted from the beginning and the end of the line that should be displayed.

Within the function, the parameter will be available as (local) variables line and part. This function returns its result, which in this case will be a string.

Arrays are passed to functions by reference, therefore, changing the value of one of the members of an array will be visible in the other parts of the program, too. Regular variables are passed by value, therefore when a function alters its value, that is not automatically visible to other parts of the program. For example:

function fna(a) { a[0]=3 }
function fnb(b) { b=4 }
BEGIN {
        a[0]=4
        fna(a)
        print a[0]
     
        b=8
        fnb(b)
        print b
}

This would print

3
8

Note that whilst defining a user-defined function, the name should be followed immediately by the left parenthesis, without any white space.

Functions are executed when called from within expressions in either actions (see the example above) or patterns. An example of usage of a function in a pattern follows:

function has_length_of (string, len)
{
    return (length(string) == len)
}

has_length_of($0,4) { print "four" }

This will print out the string four for each line that contains exactly 4 characters.

The provision for local variables is rather clumsy: they are declared as extra parameters in the parameter list. The convention is to separate local variables from real parameters by extra spaces in the parameter list, e.g.:

function  f(p, q,     a, b)   # a & b are local
{
     ... code ...
}

{ f(1, 2) }

Functions may call each other and may be recursive. To return a value from a function, you should use return expr. The return value will be undefined if no value is provided.

Copyright Snow B.V. The Netherlands