next up previous
Next: sed Up: intro_Unix Previous: Using I/O Streams

awk

awk is a sophisticated file parser that is capable of manipulating data in columns and performing some high-level math operations. A reasonably general representation of a typical awk invocation is as follows:

    awk 'BEGIN {commands} /pattern/{commands} END {commands}' file

The BEGIN and END steps are optional. They are used to issue commands before and after the parsing step, usually to first initialize counter variables and then print the result at the end. The pattern is compared against every line in the file and when a match occurs the associated commands are executed (see sed below for more info on pattern matching). Omitting pattern is equivalent to matching all lines. The commands can refer to columns in the data file, e.g. print $3 simply prints the third column (columns are delineated by ``whitespace''--any combination of tabs and spaces, up to the end of the line). You'll notice that the commands have a very C-like look and feel. Even the printf() function is supported, which makes awk a powerful way of cleaning up poorly formatted data files.

Here are a few examples:

  1. To count the number of lines in a file (this is equivalent to the Unix command wc but it's illustrative of the power of awk):
        awk 'BEGIN {nl=0} {nl++} END {print nl}' file
    
    Variables are actually automatically initialized to zero, so the nl=0 step is unnecessary, but it's good practice. Also, awk has a special variable NR which contains the current line (record) number in the file, so awk 'END {print NR}' file would accomplish the same thing as this example.
  2. To compute the average of the values in the 4th column of file:
        awk 'BEGIN {sum=0} {sum += $4} END {printf("%.2f\n",sum/NR}' file
    
    Here we used printf() to restrict the output to two places after the decimal. Note the \n to force a new line, just like in C.
  3. To print fields in reverse order between all ``start'' and ``stop'' pairs (why not?):
        awk '/start/,/stop/{for (i=NF;i>0;i--) printf("%s ",$i); printf("\n")}' file
    

One handy argument to awk is -Fc, where c is any character. This character is used as the delimiter instead of whitespace, which can be useful for, say, extracting the month, day, and year digits from an entry like ``MM/DD/YY''.

There's lots more you can do with awk. In fact, there's an entire O'Reilly book on both awk and sed if you're interested...


next up previous
Next: sed Up: intro_Unix Previous: Using I/O Streams
Massimo Ricotti 2009-01-26