Text-Related Commands

One of the original design goals of UNIX at AT&T was the easy processing of text data. UNIX and FreeBSD include a large number of commands for processing text data from the command line. This chapter doesn't cover every available text command, but you'll learn how to use the most useful ones. With a few judiciously applied commands invoking small but flexible command-line text-processing utilities, you can take a bare text file and extract information from it like a sorcerer.

Counting Lines, Words, and Characters

Use the wc command to count the number of lines, words, and characters in a text file. With no options, this command gives all three. For example, the following tells me that there are 1,160 lines, 7,823 words, and 51,584 characters currently in the text file that contains this chapter of the book:

# wc chapter08     1160    7823   51584 chapter08

wc supports the -l option to only display the number of lines, the -w option to only display the number of words, and the -c option to only display the number of characters. These options can be combined to control what information wc displays.

Viewing Text Files: Less is More

Once two separate commands, more and less are now just two different names for the most commonly used tool for viewing text files in UNIX. You can use the less (or more) command to display text files on your screen, one screen at a time. In addition, you can search the file you are currently viewing for text strings, and you can scroll back and forth through the file by any number of lines you specify or by simply using the arrow keys.

Table 8.9 shows some examples of commands that can be used in less.

Table 8.9. Commands Allowed Within the `less` Program
Command	Usage
`/pattern`	If you replace `pattern` with the pattern you want to search for, `less` will find the specified pattern in the file.
Space or `f`	Scrolls forward one screen. If you type a number before pressing the spacebar, `less` will scroll forward that number of lines.
`b`	Scrolls back one screen. If you type a number before typing `b`, `less` will scroll back that number of lines.
Up and down arrows	Moves up and down, respectively, one line at a time in the file.
`# g`	If you replace `#` with a number and then type `g`, `less` will move to that exact line in the file.
`# %`	If you replace `#` with a number between 0 and 100, `less` will move to a new location that represents that percentage of the file.
`G`	Jumps to the end of the file.
`q`	Quits the `less` program.

These are probably the most common options you will use with less. Many more options are available, however. The man page for less is nearly 2,000 lines long. See this page for more information on the other options and commands that less offers (and note, too, that man uses the less interface to view the manual pages).

Viewing Only the Top or Bottom of a Text File

If you want to see only the first few lines or the last few lines of a file, you can use the head or tail commands.

By default, head and tail show the first 10 and last 10 lines of the file, respectively. You can change the number of lines that will be displayed by using the n option followed by a number (for example, tail -100 log.txt).

For the tail command, you can use the f option to have it continually update the display with new lines as they are appended to the end of a file. This can be useful for monitoring a log file and any new messages written to it in real-time.

Searching for Patterns

A hallmark of UNIX is the ability to search rapidly through a collection of files for a particular bit of textsomeone's name, a command you're trying to remember, a function in a program, or the name of a service in a log file. With standard UNIX commandline tools, these kinds of searches can be executed with great efficiency.

You can use the grep series of commands to search for patterns in text files. Three different grep commands are available. There is plain-old grep, which simply searches for patterns and basic regular expressions; there is egrep, which can search for extended regular expressions (which employ a large suite of special wildcards to define variable patterns); and there is fgrep, which searches for fixed strings (strings that must be matched literally, without wildcards). Some earlier UNIX manual pages also referred to fgrep as "fast grep" because it was supposed to be faster than regular grep. In reality, though, fgrep is almost always slower than regular grep. Most man pages these days no longer refer to fgrep as fast grep.

Suppose you want to search the file textfile for the pattern cat. In its simplest form, grep looks like this:

# grep cat textfile

This command searches through every line of the file textfile and prints each line where the pattern cat is matched. Note that the command matches a pattern and not a word. This means that in addition to cat, the words catnip, catbird, catfish, and concatenate would also be matched because they all contain the string cat. If you only want to match the actual word cat, enclose the string in quotes and include spaces on each side, like this:

# grep " cat " textfile

Note, however, that this only matches occurrences of cat in the middle of a sentence; if cat appears at the beginning or end of a line, or it's followed by a period, this grep command won't match it. Adjust your search pattern to your needs accordingly.

Some common options to grep include -i to perform a noncase-sensitive search, -c to suppress the display of matching lines and print the number of times the match occurred instead, -n to display the line number of the line in front of each line where a match occurs, and -v to reverse the operation and print only lines that do not match the specified pattern.

The extended regular expression matching of egrep is beyond the scope of this chapter, but regular expressions will be covered in detail in Chapter 10.

Sorting Text in a File

Sometimes, you might want to view the text in a file sorted into a certain order. For example, you might want to take a list of names and perform an alphabetical sort, or list of expenses and sort the lines numerically. You can use the sort command for this. By default, this command sorts based on ASCII value, and it does not ignore leading whitespace. Some of the common options are shown in Table 8.10.

Table 8.10. Options for Use with the `sort` Command
Option	Result
`-d`	Sorts using "telephone book" sorting, which ignores anything other than letters, digits, and blanks when sorting.
`-b`	Ignores leading whitespace in lines when sorting.
`-f`	Folds lowercase letters into uppercase letters when sorting. This has the effect of creating a case-insensitive sort.
`-n`	Sorts according to the numeric value of a field.
`-t`	Changes the field separator that `sort` uses to indicate the end of a field and the beginning of the next field. By default, `sort` uses whitespace to separate fields.
`-u`	If there are identical lines in the input to be sorted, this option displays only one of the lines in the sorted output.
`-r`	Reverses the output of the sort.
`-o`	Sends the results to an output file instead of to the screen. The name of the desired file should be supplied after `-o`. This option has the same basic effect as redirecting the output to a file (there'll be more on input/output redirection later in this chapter).

If given more than one file on its command line, sort will concatenate the two files. If you use the -m option when supplying multiple files, sort will work faster by merging them together. However, for the -m option to work properly, each input file should already be individually sorted.

Replacing Strings Using `tr`

You can use the tr ("translate") command to search a text file for each occurrence of a certain string and replace it with a new string. The basic form of the command is as follows:

# tr 'a-z' 'A-Z'

This command would replace all lowercase letters with uppercase letters. By default, tr gets its input from standard input (which is normally the keyboard) and sends its output to standard output (which is normally the screen). This is not very useful in most cases, so normally tr is used with input and output redirection. You will learn more about input and output redirection later, but here is the basic form of TR to make it receive input from a file and also direct output to a file:

# tr 'a-z' 'A-Z' < file1 > file2

This command will read file1, replace all lowercase letters in the file with capital letters, and store the new file in file2.

You can also use the -d option with tr. In this case, TR will simply go through the file and delete each occurrence of a specified character. For example, the following will delete each occurrence of either uppercase A or uppercase B from file1 and store the results in file2:

# tr -d 'AB' < file1 > file2

The TR command is extremely flexible. When used with the proper pipes, redirections, and options, it can address a great many text-manipulation tasks that users have all too frequently written Perl scripts for, not realizing they're reinventing the wheel.

Showing Only Certain Parts of Lines in Text Files

Sometimes, you might be interested in only a certain part of a line in a filejust the first half of each line, for instanceor you might want to divide up each line at commas or tab characters and print out only the third field of each. You can use the cut command to cut only certain fields or parts thereof from a file for display. For example, suppose you have a text file named phone.txt that contains the following simple address book:

Doe, John~105 Some Street~Anytown~NY~55555~123-555-1212 Doe, Jane~105 Some Street~Anytown~NY~55555~123-555-1212 James, Joe~251 Any Street~Sometown~CA~51111~321-555-1212

If you only want to see the first five characters of each line, you can use cut -c 1-5 phone.txt, in which the argument to c (1-5) specifies a list of character positions, which in this case is characters 1 through 5. This results in the following:

# cut c 1-5 phone.txt Doe, Doe, James

A more useful application of cut is to cut only certain fields from a line with regular delimiters. The following command will return the first field from a set of lines delimited by tabs:

# cut -f 1 phone.txt

By default, cut expects fields to be separated with tab characters. However, you can change the field separator to any character you want. In this case, our address book text file doesn't use tab characters as field separatorsit uses tildes (~). You can specify which delimiter character you want to use with the -d option:

# cut -f 1 -d '~' phone.txt Doe, John Doe, Jane James, Joe

Here, you have told cut to display only the first field, and you also told it that fields are delimited by tildes (~). Because the first tilde comes after the name, the command lists only the name of the person and leaves out the rest of the information.

Similarly, you can get a listing of all the users on your system by using cut on the /etc/passwd file:

# cut -f 1 -d ':' /etc/passwd frank bob alice joe simba lee

Formatting Text with `fmt`

The fmt command formats text into nice 65-character lines (by default). This is most useful for preparing a text file to be sent through email, but it can be used for other simple formatting tasks as well. Here's an example:

[View full width]

Until he extends his circle of compassion to include all living things, man will not

himself find peace. -- Dr. Albert Schweitzer

The first line contains 105 characters, which is too long to display on one line of a character-based display (and even some graphical displays if the resolution is low). The result is that either the mail-reading program will break the line in an odd place (such as in the middle of a word) or the text will go off the right end of the screen, forcing the reader to scroll right to read the rest of it. (If you've ever gotten one of those email messages that looks like it is just one long line, the mail program is not breaking the lines properly for display.)

The fmt command will save us. Its typical use is simple, as follows:

# fmt quote.txt Until he extends his circle of compassion to include all living things, man will not himself find peace. -- Dr. Albert Schweitzer

This output could then either be redirected to a mail program or to a file that could then be mailed.

Here is an example that makes it easier to see the results of the fmt command:

Until he extends his circle of compassion to include all living things man will not himself find peace -- Albert Schweitzer

It will look like this after being run through fmt:

Until he extends his circle of compassion to include all living things man will not himself find peace -- Dr. Albert Schweitzer

This section has presented some of the most useful commands for working with text. By combining these various commands, you can perform some rather sophisticated tasks, such as analyzing web server logs for trends. Of course, these commands have their limits. When you run into them, you might want to look into sed and awk for text processing. Both sed and awk are beyond the scope of this chapter (whole books exist on both subjects, such as sed & awk published by O'Reilly), but you should be aware that they exist on your FreeBSD system and can be used to handle some very sophisticated text-processing tasks.

So, how can you combine the commands we've used to do more useful things? That's where pipes and input/output redirection come in to play.

Counting Lines, Words, and Characters

Viewing Text Files: Less is More

Table 8.9. Commands Allowed Within the `less` Program

Viewing Only the Top or Bottom of a Text File

Searching for Patterns

Sorting Text in a File

Table 8.10. Options for Use with the `sort` Command

Replacing Strings Using `tr`

Showing Only Certain Parts of Lines in Text Files

Formatting Text with `fmt`

Text-Related Commands

Counting Lines, Words, and Characters

Viewing Text Files: Less is More

Table 8.9. Commands Allowed Within the less Program

Viewing Only the Top or Bottom of a Text File

Searching for Patterns

Sorting Text in a File

Table 8.10. Options for Use with the sort Command

Replacing Strings Using tr

Showing Only Certain Parts of Lines in Text Files

Formatting Text with fmt

Table 8.9. Commands Allowed Within the `less` Program

Table 8.10. Options for Use with the `sort` Command

Replacing Strings Using `tr`

Formatting Text with `fmt`