Hello! Welcome to your next step in mastering Bash scripting. In this lesson, we will immerse ourselves in the world of text processing with the versatile command-line tool awk
. awk
is a powerful tool that allows you to manipulate and analyze text files with ease. By the end of this lesson, you’ll be equipped to efficiently handle and process text files, extracting meaningful data and performing relevant computations directly from your Bash scripts.
Let's get started by diving into how we can leverage awk
for various text processing tasks.
First, let's create a sample data file to work with. This file will help us learn and practice various awk
commands effectively.
The heredoc (short for "here document") is a special syntax in Unix shell scripting that allows you to create a multi-line string. It is particularly useful for creating files or including large blocks of text within your script. The syntax is <<EOF ... EOF
, where EOF
(End of File) is a marker indicating the beginning and end of the block of text. You can actually use any marker, but EOF
is conventionally used.
Let's create a file called data.txt
that includes data about computers in inventory.
Let's break this code down:
cat << EOF
: This starts the heredoc and tells thecat
command to begin reading the subsequent lines as a string until it encounters the endingEOF
marker.> data.txt
: This redirects the output of thecat
command to a file nameddata.txt
.- The lines between
<< EOF
andEOF
are the content that will be written todata.txt
.
The basic syntax of the awk
command in Unix-like systems is:
Here's a detailed breakdown of each component:
- awk: The command itself.
- options: These are optional flags you can pass to
awk
to modify its behavior (e.g.,-F
to specify the field separator). - selection_criteria: This is an optional condition or pattern that specifies which lines of the input file to process. It can be a regular expression or a logical condition based on field values.
- {action}: This is the block of code to execute for each line that matches the selection criteria. Actions are enclosed in curly braces
{}
. - input-file: The file that
awk
processes. - > output-file: This optional part redirects the output to a file. If omitted,
awk
prints the output to the terminal.
With this understanding of awk
syntax, let's dive into some examples.
Let's begin with the most basic awk
command to print the entire content of the file. The print
command is used to output text, fields, or expressions to the terminal or another output stream. It offers flexibility in how the data is displayed and allows custom formatting of text.
In this awk
command
- There are no
options
orselection_criteria
- The
action
is{print}
enclosed in curly braces. Theprint
pattern-action statement inawk
tells it to print each line of the file. - There is no
output-file
, so the result is displayed on the terminal.
Running this command will display all lines of the data.txt
file, mirroring the functionality of the cat
command.
In awk
, field numbers are used to refer to specific columns in a line of text. Fields are denoted by a dollar sign ($
) followed by the field number. The records (lines of text) are automatically split into fields based on a delimiter, which is a space or tab by default but can be changed using the -F
option.
$1
denotes the first field of a line of text, $2
represents the second field, and so forth. $0
refers to the whole line.
Often, we need to extract specific columns from a file. Suppose we only want to extract the "Brand" and "Model" of each line. The code is:
$1
and$2
refer to the first and second fields (columns) of each line in the file.- This command skips the first and fourth columns, displaying only the brand and model of each item.
The output of the command is:
The output successfully shows only the "Brand" and "Model" columns of the text file.
Often, you will need to filter lines based on specific conditions. To do this, you place the condition before the {action}
block, enclosed in single quotes ('
).
For instance, we may want to find all entries where the RAM is 64 GB or greater.
$3 >= 64 {print $0}
instructsawk
to print any line where the third field (RAM) is 64 or greater.$0
represents the entire line.
The output is:
NR
stands for "Number of Records" in awk
. It is a built-in variable that keeps track of the current line number being processed in the input file. Each time awk
reads a new line, it increments NR
by one. This makes NR
useful for actions based on the line number, such as skipping headers, processing specific lines, or adding line numbers to output.
Skip Header Line
Suppose we want to print every line, excluding the header line ("Brand Model RAM)". The header line has an NR
value of 1. To skip this line, we use the condition NR > 1
.
NR > 1
: This condition skips the first line (header) and prints the remaining lines.
The output is:
Process a Specific Line
Now let's print only the 3rd line of data.txt
.
The output is:
Pattern matching is one of the core strengths of awk
, allowing you to perform actions only on lines that match specific patterns. The syntax for pattern matching in awk
involves enclosing regular expressions within slashes (/pattern/
). Suppose we only want to print lines that contain "Apple":
- The
/Apple/ {print}
pattern checks each line for the string "Apple." - If a line contains "Apple," it is printed.
The output of the command is:
The END
keyword is used to specify an action to be executed after all lines have been processed. The syntax is:
This command will perform action1
for every line of text. After all lines have been processed, action2
is run once.
Now let's write a command that calculates the average RAM across all entries. To do this
- We create a
sum
variable andcount
variable. - For each line, we add the RAM value (column
$3
) tosum
and increment thecount
variable by 1. - After all lines have been processed, we print
sum/count
NR>1
skips the header line because it does not contain a RAM value.{sum+=$3; count++}
sums the values of the third field (RAM) and increments the count.END {print "Average RAM:", sum/count}
executes after processing all lines, printing the calculated average RAM.
The output of the code is:
You can customize the output format for each line as well. We can add text to our print statement by separating strings/field references with commas. Let’s create a message for each entry.
The output of this command is:
The output is still a bit difficult to read. Let's continue to see how to use printf
to format the output.
The printf
function in awk
offers more control over the formatting of the output compared to the print
command. The syntax for printf
is:
The format string includes text and format specifiers that begin with %
. Common format specifiers include:
- %d: Integer
- %s: String
- \n: Newline character
Modifiers can also be added to control the width and alignment:
Minimum Field Width: The number between %
and the format specifier defines the minimum width of the field.
Positive Width: Right-justified by default. For example, %10s
formats a string, right-aligned, with a minimum width of 10 characters.
Negative Width: Left-justified if prefixed with a minus sign. For example, %-10s
formats a string, left-aligned, with a minimum width of 10 characters.
Now, let’s format our output as a neatly aligned table:
BEGIN {print "Brand Model RAM"}
- The
BEGIN
block is executed before any lines from the input file are processed. {print "Brand Model RAM"}
This prints the header row "Brand Model RAM" before processing the actual data. The string contains specific spaces to align the header with the columns that will follow.
NR > 1
ensures that the action {printf ...}
is applied only to lines after the first one, which is the header line.
-
%-8s
: Left-align (-
) a string (s
) with a width of 8 characters. -
%-10s
: Left-align (-
) a string (s
) with a width of 10 characters. -
%2d
: Print an integer (d
) with exactly 2 digits. -
\n
: Newline character to move to the next line after printing. -
$1, $2, $3
: These are the fields to be printed according to the format specifiers. -
data.txt
: The input file that contains the data to be processed.
The output of the command is:
Using printf
and formating specifiers, the output of our table looks much more clean!
Great job! In this lesson, you learned how to:
- Create and manipulate data files using heredoc syntax.
- Print the entire content of a file using
awk
. - Extract specific columns from a file using field numbers.
- Filter lines based on conditions using
awk
. - Perform pattern matching with
awk
. - Calculate and display average values using the
END
block. - Customize and format output using
printf
inawk
.
Now, it’s time to apply what you’ve learned. Head to the practice section to sharpen your awk
skills through hands-on exercises. Happy coding!
