One of the original design goals of UNIX at AT&T was the easy processing of text data. UNIX and FreeBSD include a large number of commands for processing text data from the command line. This chapter doesn't cover every available text command, but you'll learn how to use the most useful ones. With a few judiciously applied commands invoking small but flexible command-line text-processing utilities, you can take a bare text file and extract information from it like a sorcerer. Counting Lines, Words, and CharactersUse the wc command to count the number of lines, words, and characters in a text file. With no options, this command gives all three. For example, the following tells me that there are 1,160 lines, 7,823 words, and 51,584 characters currently in the text file that contains this chapter of the book: # wc chapter08 1160 7823 51584 chapter08 wc supports the -l option to only display the number of lines, the -w option to only display the number of words, and the -c option to only display the number of characters. These options can be combined to control what information wc displays. Viewing Text Files: Less is MoreOnce two separate commands, more and less are now just two different names for the most commonly used tool for viewing text files in UNIX. You can use the less (or more) command to display text files on your screen, one screen at a time. In addition, you can search the file you are currently viewing for text strings, and you can scroll back and forth through the file by any number of lines you specify or by simply using the arrow keys. Table 8.9 shows some examples of commands that can be used in less.
These are probably the most common options you will use with less. Many more options are available, however. The man page for less is nearly 2,000 lines long. See this page for more information on the other options and commands that less offers (and note, too, that man uses the less interface to view the manual pages). Viewing Only the Top or Bottom of a Text FileIf you want to see only the first few lines or the last few lines of a file, you can use the head or tail commands. By default, head and tail show the first 10 and last 10 lines of the file, respectively. You can change the number of lines that will be displayed by using the n option followed by a number (for example, tail -100 log.txt). For the tail command, you can use the f option to have it continually update the display with new lines as they are appended to the end of a file. This can be useful for monitoring a log file and any new messages written to it in real-time. Searching for PatternsA hallmark of UNIX is the ability to search rapidly through a collection of files for a particular bit of textsomeone's name, a command you're trying to remember, a function in a program, or the name of a service in a log file. With standard UNIX commandline tools, these kinds of searches can be executed with great efficiency. You can use the grep series of commands to search for patterns in text files. Three different grep commands are available. There is plain-old grep, which simply searches for patterns and basic regular expressions; there is egrep, which can search for extended regular expressions (which employ a large suite of special wildcards to define variable patterns); and there is fgrep, which searches for fixed strings (strings that must be matched literally, without wildcards). Some earlier UNIX manual pages also referred to fgrep as "fast grep" because it was supposed to be faster than regular grep. In reality, though, fgrep is almost always slower than regular grep. Most man pages these days no longer refer to fgrep as fast grep. Suppose you want to search the file textfile for the pattern cat. In its simplest form, grep looks like this: # grep cat textfile This command searches through every line of the file textfile and prints each line where the pattern cat is matched. Note that the command matches a pattern and not a word. This means that in addition to cat, the words catnip, catbird, catfish, and concatenate would also be matched because they all contain the string cat. If you only want to match the actual word cat, enclose the string in quotes and include spaces on each side, like this: # grep " cat " textfile Note, however, that this only matches occurrences of cat in the middle of a sentence; if cat appears at the beginning or end of a line, or it's followed by a period, this grep command won't match it. Adjust your search pattern to your needs accordingly. Some common options to grep include -i to perform a noncase-sensitive search, -c to suppress the display of matching lines and print the number of times the match occurred instead, -n to display the line number of the line in front of each line where a match occurs, and -v to reverse the operation and print only lines that do not match the specified pattern. The extended regular expression matching of egrep is beyond the scope of this chapter, but regular expressions will be covered in detail in Chapter 10. Sorting Text in a FileSometimes, you might want to view the text in a file sorted into a certain order. For example, you might want to take a list of names and perform an alphabetical sort, or list of expenses and sort the lines numerically. You can use the sort command for this. By default, this command sorts based on ASCII value, and it does not ignore leading whitespace. Some of the common options are shown in Table 8.10.
If given more than one file on its command line, sort will concatenate the two files. If you use the -m option when supplying multiple files, sort will work faster by merging them together. However, for the -m option to work properly, each input file should already be individually sorted. Replacing Strings Using trYou can use the tr ("translate") command to search a text file for each occurrence of a certain string and replace it with a new string. The basic form of the command is as follows: # tr 'a-z' 'A-Z' This command would replace all lowercase letters with uppercase letters. By default, tr gets its input from standard input (which is normally the keyboard) and sends its output to standard output (which is normally the screen). This is not very useful in most cases, so normally tr is used with input and output redirection. You will learn more about input and output redirection later, but here is the basic form of TR to make it receive input from a file and also direct output to a file: # tr 'a-z' 'A-Z' < file1 > file2 This command will read file1, replace all lowercase letters in the file with capital letters, and store the new file in file2. You can also use the -d option with tr. In this case, TR will simply go through the file and delete each occurrence of a specified character. For example, the following will delete each occurrence of either uppercase A or uppercase B from file1 and store the results in file2: # tr -d 'AB' < file1 > file2 The TR command is extremely flexible. When used with the proper pipes, redirections, and options, it can address a great many text-manipulation tasks that users have all too frequently written Perl scripts for, not realizing they're reinventing the wheel. Showing Only Certain Parts of Lines in Text FilesSometimes, you might be interested in only a certain part of a line in a filejust the first half of each line, for instanceor you might want to divide up each line at commas or tab characters and print out only the third field of each. You can use the cut command to cut only certain fields or parts thereof from a file for display. For example, suppose you have a text file named phone.txt that contains the following simple address book: Doe, John~105 Some Street~Anytown~NY~55555~123-555-1212 Doe, Jane~105 Some Street~Anytown~NY~55555~123-555-1212 James, Joe~251 Any Street~Sometown~CA~51111~321-555-1212 If you only want to see the first five characters of each line, you can use cut -c 1-5 phone.txt, in which the argument to c (1-5) specifies a list of character positions, which in this case is characters 1 through 5. This results in the following: # cut c 1-5 phone.txt Doe, Doe, James A more useful application of cut is to cut only certain fields from a line with regular delimiters. The following command will return the first field from a set of lines delimited by tabs: # cut -f 1 phone.txt By default, cut expects fields to be separated with tab characters. However, you can change the field separator to any character you want. In this case, our address book text file doesn't use tab characters as field separatorsit uses tildes (~). You can specify which delimiter character you want to use with the -d option: # cut -f 1 -d '~' phone.txt Doe, John Doe, Jane James, Joe Here, you have told cut to display only the first field, and you also told it that fields are delimited by tildes (~). Because the first tilde comes after the name, the command lists only the name of the person and leaves out the rest of the information. Similarly, you can get a listing of all the users on your system by using cut on the /etc/passwd file: # cut -f 1 -d ':' /etc/passwd frank bob alice joe simba lee Formatting Text with fmtThe fmt command formats text into nice 65-character lines (by default). This is most useful for preparing a text file to be sent through email, but it can be used for other simple formatting tasks as well. Here's an example:
The first line contains 105 characters, which is too long to display on one line of a character-based display (and even some graphical displays if the resolution is low). The result is that either the mail-reading program will break the line in an odd place (such as in the middle of a word) or the text will go off the right end of the screen, forcing the reader to scroll right to read the rest of it. (If you've ever gotten one of those email messages that looks like it is just one long line, the mail program is not breaking the lines properly for display.) The fmt command will save us. Its typical use is simple, as follows: # fmt quote.txt Until he extends his circle of compassion to include all living things, man will not himself find peace. -- Dr. Albert Schweitzer This output could then either be redirected to a mail program or to a file that could then be mailed. Here is an example that makes it easier to see the results of the fmt command: Until he extends his circle of compassion to include all living things man will not himself find peace -- Albert Schweitzer It will look like this after being run through fmt: Until he extends his circle of compassion to include all living things man will not himself find peace -- Dr. Albert Schweitzer This section has presented some of the most useful commands for working with text. By combining these various commands, you can perform some rather sophisticated tasks, such as analyzing web server logs for trends. Of course, these commands have their limits. When you run into them, you might want to look into sed and awk for text processing. Both sed and awk are beyond the scope of this chapter (whole books exist on both subjects, such as sed & awk published by O'Reilly), but you should be aware that they exist on your FreeBSD system and can be used to handle some very sophisticated text-processing tasks. So, how can you combine the commands we've used to do more useful things? That's where pipes and input/output redirection come in to play. |