Revision: $Revision: 1.12 $ ($Date: 2004-01-29 21:20:25 $)
Resources and further reading: Robbins96; man awk
awk is an interpreted language, used mostly to extract metadata from data files. It can be used to select rows and columns from a file and to calculate derived data, such as sums. It is also often used to filter input into a more readable format, such as log-file data.
awk uses blocks of statements, prepended with a so-called pattern. awk opens its inputfile(s), reads lines of text from them and matches the contents of the inputlines with the patterns as specified in its program. If a pattern matches, the corresponding block of code is executed. Such patterns can be regular expressions (see the section called “Regular Expressions”), but a number of other forms are supported as well.
awk can be used to extract reports from various
input files, to
validate data, produce indices, manage small databases and to play around
with
algorithms to be used in other programming languages. Some rather complex
programs
have been written in awk. However,
awk's
capabilities are strained by tasks of such complexity. If you find yourself
writing
awk scripts of more than, say, a few hundred lines, you
might
consider using a different programming language, such as
Perl, Scheme,
C or C++.
To give you an impression of the look and feel of an awk program, we offer the following example. Don't worry if you do not understand (all) this code (yet), we will clarify it later.
function resume(line, part)
{
len=length(line)
if (len > ( 2 * part) )
{
retval = substr(line,1,part) " ... " substr(line,len-part)
} else {
retval = line
}
return retval
}
/^90:/ {
print resume(substr($0,4),4)
}
!/^90:/ {
print resume($0,4)
}
The name awk is an acronym and stands for the initials of its authors: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. It also pokes some fun at the somewhat awkward language that awk implements. The original version was written in 1977 at AT&T Bell Laboratories. In 1985, a new version made the programming language more powerful, introducing user-defined functions, multiple input streams and computed regular expressions. awk was enhanced and improved over many years and eventually became part of the POSIX Command Language and Utilities standard.
The GNU implementation, gawk, which is the version in use on most Unix systems nowadays, was written in 1986. In 1988, it was reworked to be compatible with newer (non-GNU) awk implementations. Current development focuses on bug fixes, performance improvements and standards compliance.
Once you are familiar with awk, you will most likely get into the habit of typing simple one-liners, for example, to extract and/or combine fields from input files. Such a one-liner is typically built like this:
$ awk 'program' input-file1 input-file2 ...
where program consists of a series of
patterns (patterns) and
actions (more about these later). This command format
instructs the shell to start awk and use the program (in
single
quotes to prevent shell expansion of special characters) to process records
in the input file(s).
Longer awk programs are often put in files. These files are either called from the command-line interface:
$ awk --file programname input-file1 input-file2 ...
or they are contained in files that start with the line:
#!/usr/bin/awk
and have the execution bit set (chmod +x).
Now that we have an overview, let's go in and take a closer look.
As stated before, an awk program reads lines and processes those lines, using blocks of code. A block of code is denoted by curly braces. A typical program consists of one or more of these blocks, often preceded by patterns that determine whether or not the code in the code block should be executed for the current line of input.
More formally, an awk program is a sequence of pattern-action statements and optional function-definitions:
pattern { action statements }
.. and optional ..
function name(parameter list) { statements }
For each line in the input, awk tests to see if the record matches any pattern specified in the program. For each pattern that matches, the associated action is executed. The patterns are tested in the order they occur in the program.
Patterns can have various forms, for example, regular expressions or relational expressions; pattern types are described later (patterns).
Actions are lines of code to be executed by awk. The
regular
programming structures (branching and looping) are available; so are
variables
and arrays. Conditional testing, matching on regular expressions and
various
operators and assignments are also available. All of these components
are described below. The # (hash) is
used to start
a comment, the remainder of the line will be ignored.
awk supports the use of variables. There are a number of internal variables whose values either can be read or set by both awk and the programmer. These can be used to influence awk's behavior or to query awk's environment. The programmer may also set other variables, which automatically come into existence when first used. Their values are floating point numbers or strings: automatic conversion is performed by awk if the context requires it.
Only one-dimensional arrays are supported. Arrays are always subscripted
with strings, for example, x[1] (the 1 is
automatically converted to the string “1”) or
x[ray]. You may also use a list of
values (a comma-separated list of expressions) as an array subscript
since these lists are converted to strings by awk.
Simulations of multi-dimensional arrays can be done this way:
i="X" j="Y" x[i,j]="text"
In this example, the array “x” is subscripted by a list of expressions.
However, the list is converted to a string first.
awk does this conversion by concatenating the values
in i and j using a special
separator (by default the hexadecimal value
\034). Thus, the subscript would be
the string “X\034Y”, which is an acceptable subscript for
awk's one-dimensional associative arrays.
To delete an entire array, you can use the command:
delete array
To delete an array named scores, for example, you
could issue the command delete scored. Additionally,
you can delete one member of an array by using the construction
delete array[index]
So, for example, delete scores[9]. Note that this
just
deletes the element from the array, but does not
rearrange the members to close the “gap”.
awk normally is used with text-file input. Within
awk text files are seen as sets of
records and fields.
In most cases, a simplified view can be used: records are lines of text,
delimited by a newline. However, it is possible to set your own record
separator by setting the variable RS, which can
be a regular expression or a single character. When RS
contains a regular expression, the text in the input that matches this
regular expression will separate the record.
awk splits the input records into fields. By default,
(sequences of) white space are seen as a separator. It is possible to
use the value of the FS variable to define the
field separator. By setting FS to the null string,
each character in the input stream becomes a separate field. If
FS is a single character, the records
are divided into fields using that single character. In all other
cases, FS should be a regular expression, and the fields will be
separated by the characters that match that regular expression.
To refer to a field, specify a dollar sign followed by a numerical
representation of it's position within the record. To refer to the
first field, for example, you would use $1, the fifth
field would be $5. $0 is the whole
record.
It is acceptable to refer to a field using a variable, like this:
n = 7 print $n
This prints the seventh field from the input record. To determine
the number of fields in the input record, you can use the variable
NF. This example also demonstrates the use of the
print statement, which prints its arguments to standard
output, and the use of the internal variable
NF, which contains the number of fields in the current
input record. More about internal variables will be explained later on.
To split records into fields, you can also set the variable
FIELDWIDTHS to a space-separated list of numbers:
the records will be seen as sequences of fields with fixed width, as
specified in the variable, and awk will not use
field separators any more, unless you assign a new value to FS.
awk allows you to assign values to the fields,
and that assignment will prevail over the values assigned when
awk reads the input record.
Suppose the seventh field of a record
contains the string scoobydoo, and
you set it to another value:
$7 = "seven"
Then the seventh field will contain the string
seven and the $0 will
be recomputed to reflect the change.
As an example, and to review a number of the concepts above, let's study this awk program:
awk -F: '{
OFS="." # set the output field separator
print $0 # print all input fields
$1="first" # re-assign the first field
print $0 # print all input fields again
}' myfile
awk is called using the -F flag to
set the field separator to a colon. We could have achieved the same result
by assigning the value “:” to the FS variable.
In our example, we assume the program has been
entered at the command-line interface.
Therefore we placed the program between quotes to prevent
variable expansion by the shell.
The main block of code is specified between curly braces, and no
conditions or
expressions are specified before that block, therefore it will be
executed for any record.
In the main code block, we start by setting the OFS
variable
to a dot. This signifies that all output fields will be separated by a dot.
Next, the input line is printed by referring to $0. Then,
on the next line, we force the first field into a fixed value, and then
we print out the (newly created) full record again. On the final line, the
block is closed and the input file name is specified. Assuming the file
myfile contains these two lines:
a:b:c:d 1:2:3:4
The output would be:
a:b:c:c first.b.c.d 1:2:3:4 first.2.3.4
It is possible to refer to non-existent fields or to assign values to non-existing fields, but it's beyond the scope of this introduction. See the manual pages for (GNU) awk for more details.
Within awk programs, you can use a number of control
statements.
Here, too, statements can be grouped together by using curly braces. A
statement (or
group of statements) can be prepended by a control statement. If the
control statement
evaluates to true, the corresponding block of code (or single statement)
will be
executed. For example the if statement
has the form:
if (condition) statement
.. or, if blocks of code are used:
if (condition) { statements... }
A sample program that uses if and demonstrates the
use else and the use of blocks of code follows:
if ($NF > 4) {
print "This record has over 4 fields"
print "It is a very lengthy record!"
long++
} else {
print "Never more than four - good :-)"
short++
}
or, the simpler case, this fragment of code uses a single statement:
if ($NF < 3 ) print "less than 3 fields found"
Looping (iteration) can be done by using either the while,
do while or for statement:
while (condition) statement do statement while (condition) for (expr1; expr2; expr3) statement
An example is given below. It prints out the numbers from 1 to 10, using all
three methods. Additionally, the keyword exit is introduced.
exit simply terminates the program. This program makes
use of a simple trick to make it run without input data: it consists of just a
BEGIN block (more about that later; patterns).
Also note the use of the increment operator
++ (operators).
BEGIN {
# first method:
#
number=1
do {
print number
number = number + 1
} while (number < 11)
# second method:
#
number=1
while ( number < 11 ) print number++
# third method:
#
for (number=1; number<11; number = number + 1) print number
}
The exit statement terminates the program.
It can be followed by an expression that should return a value
from 0...255. This value is passed on to the calling program,
for example, to enable it to see why a program terminated:
$ awk 'BEGIN { exit 56 }'; echo $?
56
Only values between 0 and 255 are meaningful exit values, all other values will be converted into that range.
The break statement can be used to jump
out of a do, while or for loop unconditionally. The loop is
terminated, and program continues at the first statement after the loop.
For example, here is another (somewhat clumsy) way to count to ten
using awk:
BEGIN {
z=0
while (1==1) {
if (++z == 10) break
print z
}
print "10"
}
Under normal circumstances, all commands within a block
of code executed under the control of either a while,
do or for function will be
executed for each iteration.
The continue statement can be used
to perform the next iteration, thereby skipping the code between the
continue statement and the end of the loop.
Yet another way to count to ten, for example, is:
BEGIN {
z=0
while (1==1) {
print z;
if (++z < 10) continue
print "10"
break
}
}
As stated before, awk programs are composed of blocks of code prepended by patterns. The blocks of code will be executed if the pattern applies.
Patterns can be in any of the following forms:
BEGIN END /regular expression/ relational expression pattern && pattern pattern || pattern pattern ? pattern : pattern (pattern) ! pattern pattern1, pattern2
BEGIN and END. An awk program can have multiple BEGIN and END patterns. The actions in the BEGIN block will be executed before input is read, the actions in the END block after the input is terminated. Blocks of code that start with a BEGIN pattern are merged together into one virtual BEGIN block. The same applies to END blocks. Thus, a (somewhat unusable) interactive awk program like this:
BEGIN {
print "1"
}
END {
print "a"
}
BEGIN {
print "2"
}
END {
print "b"
}
{ exit }
.. assuming you would enter some input, would print:
1 2 .. some input you gave .. a b
Regular Expressions.
For blocks prepended with a /regexp/ pattern,
(see the section called “Regular Expressions”) the code found in the corresponding block
will be executed for each input record that matches the regular expression
regexp specified. awk uses
an extended set of regular expressions, similar to
egrep(1) (remember that the interval
expressions are either not supported at all or can only
enabled on request).
Relational expressions. Another way to select which block to execute for which record is using relational expressions. Such an expression consists of awk commands (a.k.a. actions). Frequently, this is used to match on fields within a record:
$ awk '$1 == "01" { print }'
This would print all input records (lines) that have the string
“01” as their first field. You can also check fields to match
regular expressions, by using
the “match” operator ~ (the
tilde character), as is done in this
one-liner:
$ awk '$1 ~ /[abZ]x/ { print $2 " " $3 }'
Given the following input:
Zx the rain cx and snow ax in Spain gx and France
this would produce:
the rain in Spain
Logical operators.
Combinations of pattern expressions can be made by using logical operators
&& (and) ||
(or) and ! (not). As in C and
perl, these work in a short-circuit fashion: if enough
of the expression is resolved to determine the outcome (true or false) of
the expression, the other patterns are not resolved. Thus, given this example:
( $1 > 0 ) && ( $2 < 42 )
if $1 was assigned the value 1, the
first part of the expression would return
“true”.
Because the resolution of the second expression,
“( $2 < 42 )”,
is of no relevance to the outcome of the entire expression,
it will never be executed.
This example:
$ awk ' $1 == 3 || $2 ~ /[a-z]9/ { print }'
would print all records where the first field has the value 3, or, if the
first
field does not contain a string with the value 3,
would
check if the second field matches the regular expression
[a-z]9.
Conditional operators.
The conditional operator
{pattern}?{pattern}:{pattern}
(sometimes called the ternary operator)
works just like the same operator in C.
If the first pattern is true, then the pattern
used for testing is the second pattern, otherwise it is the third.
Only one of the second and third patterns is evaluated.
$ awk '$1==3 ? $2=="b" : $2=="c" { print }'
This signifies, that, if the first field contains a string with value “3”, the record will be printed if the second field contains the string “b”. If the first field does not contain the value “3”, the record will be printed if the second field contains the string “c”. All other records are not printed.
Range operator. Finally, there is the form where a range of records can be specified. This is done by specifying the start of the range, a comma and the end of a range. In this example, selected records will be printed:
$ awk '$1=="start", $1=="stop" { print }'
This program will suppress lines in its input until a record (line) is found in the input where the first field is the string “start”. That record, and all records following it, would be printed - until a record is found where the first field is the string “stop”. Any records following that line will silently be discarded, until the next record is found where the first field is the word “start”.
awk supports standard operators similar to those used in C. From the manual pages:
Table 13.13. awk operators
The chapter on Regular Expressions (see the section called “Regular Expressions”) gives a thorough overview of the regular expressions that can be used in either awk, perl or sed. This section, therefore, focuses on using them within awk, rather than on the Regular Expressions themselves.
A regular expression can be used as a pattern by enclosing it in slashes. Then the regular expression is tested against the entire text of each record. Normally, it only needs to match some part of the text in order to succeed. This, for example, prints the second field of each record that contains the characters “snow” anywhere in it:
$ awk '/snow/ { print $2 }' phonelist
An exhaustive list of all built-in variables can be found in the manual
pages for awk; only the most commonly used ones are
explained here. The built-in variables FS and
OFS have been
demonstrated before. They can be used to force the Field Separator and
Output Field Separator by setting them from within an
awk code block.
The FNR variable holds count of the current record
number.
A very basic one-liner to generate numbered program listings, for example,
could look like this:
awk '{ printf "%05d %s\n",FNR, $0 }'
By default, records are separated by newline characters, that is,
awk reads lines of text by default and treats each
line as a record. This behaviour can be changed by setting the variable
RS. RS contains either a single character
or a regular expression. Text in the input that matches that character
or regular expression is used as a separator.
The environment can be read by using the contents of the internal array
ENVIRON. To read the variable TERM
for the environment, you would use ENVIRON["TERM"].
awk allows the definition of functions by the programmer and has a large number of built-in functions.
The built-in functions consist of mathematical functions, string functions, time functions and the system function.
The mathematical functions are: atan2(y, x),
cos(expr), exp(expr),
int(expr), log(expr),
sin(expr) and sqrt(expr).
Additionally, there are two functions available to work with random
numbers: srand([expr]), which seeds the random
generator, and rand(), which returns a random
number between 0 and 1.
A rich set of string functions is available; please consult the manual
pages for a full description. A short list of the most commonly used
functions follows: index(s,t) to return the index
of the string t in the string s,
length([s]) to obtain the length of the string
specified (or the length of the input record if no string is specified),
match(s,r) to return on which position in string
s the expression r occurs.
substr(s,i[,n]) returns the substring from string
s, starting at position i, with
optional length n.
One of the primary uses of awk programs is
processing log files that contain time-stamp information.
awk provides two functions for obtaining time
stamps and formatting them: systime() returns the
current time of day as the number of seconds since midnight UTC,
January 1st, 1970,
and strftime([format [, timestamp]]) to
format time stamps according to the form specified in
format. To learn more about the format string used
with this function, the manual pages for
strftime(3) should be consulted.
User-defined functions were added later to awk, and therefore, the implementation is sometimes, somewhat, well, awkward. User-defined functions are constructed following this template:
function name(parameter list) { statements }
As an example, we'll list a function that prints out a part of a string (you may be familiar with it, since we've used it before in our examples):
function resume(line, part)
{
len=length(line)
if (len > ( 2 * part) )
{
retval = substr(line,1,part) " ... " substr(line,len-part)
} else {
retval = line
}
return retval
}
This function is named “resume”
(reh-suh-may). It prints a short version
of its input line, consisting of an equal number of characters from the
begin and end of the line, concatenated by an ellipsis.
It has two parameters:
line, the line to determine the
resume from, and
part, which is the exact number of characters
counted from the beginning and the end of the line that should be
displayed.
Within the function, the parameter will be available as (local) variables
line and part.
This function returns its result, which in this case will be a string.
Arrays are passed to functions by reference, therefore, changing the value of one of the members of an array will be visible in the other parts of the program, too. Regular variables are passed by value, therefore when a function alters its value, that is not automatically visible to other parts of the program. For example:
function fna(a) { a[0]=3 }
function fnb(b) { b=4 }
BEGIN {
a[0]=4
fna(a)
print a[0]
b=8
fnb(b)
print b
}
This would print
3 8
Note that whilst defining a user-defined function, the name should be followed immediately by the left parenthesis, without any white space.
Functions are executed when called from within expressions in either actions (see the example above) or patterns. An example of usage of a function in a pattern follows:
function has_length_of (string, len)
{
return (length(string) == len)
}
has_length_of($0,4) { print "four" }
This will print out the string “four” for each line that contains exactly 4 characters.
The provision for local variables is rather clumsy: they are declared as extra parameters in the parameter list. The convention is to separate local variables from real parameters by extra spaces in the parameter list, e.g.:
function f(p, q, a, b) # a & b are local
{
... code ...
}
{ f(1, 2) }
Functions may call each other and may be recursive.
To return a value from a function, you should use
return expr.
The return value will be undefined if no value is provided.