Project 23 Search File Content
“How do I find all files containing the text Dear Janet before my wife does?”
This project shows how to search a file, or many files, for particular text. The search term can be straight text or a regular expression. The project covers the commands grep, wc, awk, and sed.
Use grep
This chapter puts the spotlight on grep and friends. The grep command searches through files to find particular text that matches a search pattern. A file is searched line by line, and a match occurs when a line contains the search pattern. It’s important to realize that the search is done line by line and that to match, a line need only contain the search pattern, not be identical to it.
Let’s search all text files (*.txt) in the current directory for the words Dear Janet.
$ grep "Dear Janet" *.txt hello.txt:Dear Janet, lets-meet.txt:Dear Janet, sauciest of vixens secret-liaison.txt:Dear Janet,
We see displayed all lines from all files that match, with the matching line of text preceded by the filename.
grep Options
The grep command has many options, the most useful of which are explained below.
To change the output format from filename:text of matching line, specify the following:
- -l to display just filenames. Use this option when you are interested in which filenames match but not the matching lines—when generating a list of files to process, for example.
- -h to display just the matching line. Use this option when you want to process the lines of text and don’t want filenames polluting the output.
- -n to display line numbers too. This option is handy when you wish to edit the file later, as you can jump straight to the line in question.
- -Cn to display n lines before and after the matching line. C is for context. This option is useful when you search text documents for information.
To change the pattern-matching rules, specify the following:
- -i to ignore case. Hello will match hello and Hello.
- -v to invert the sense of the match. Lines that do not contain the pattern will be displayed.
- -w to match complete words only. Jan will match Dear Jan, hello but not Dear Janet, hello.
- -x to match whole lines only. The line must equal the pattern, not just contain it. This option has the same effect as specifying start- and end-of-line anchors in the regular expression.
- -E to activate extended regular expressions. By default, grep matches against basic regular expressions. Invoking grep as the command egrep is the same as using grep -E.
- -F to match fixed strings only, not regular expressions. The grep command operates faster on fixed strings than on regular-expression patterns. Also, in this mode it’s not necessary to escape characters like star that would otherwise be interpreted as pattern-matching operators. Invoking grep as the command fgrep is the same as using grep -F.
Use recursive mode:
- -r to recursively search directories listed on the command line. In the following example, grep searches all files and directories in the current directory.
$ grep -r "Janet" * archive/old-letter.txt:Dear Janet, hello.txt:Dear Janet, lets-meet.txt:Dear Janet, sauciest of vixens secret-liaison.txt:Dear Janet,
The next example of recursion doesn’t work as expected. We intended to say, “Search the current directory recursively for all *.txt files.” What actually happens is that the shell expands *.txt to include all matching filenames (which does not include the directory archive); grep then searches each filename in the expansion, and if it’s a directory, grep does so recursively. We can’t specify to grep both a directory to search recursively and at the same time which files to consider.
$ grep -r "Janet" *.txt hello.txt:Dear Janet, lets-meet.txt:Dear Janet, sauciest of vixens secret-liaison.txt:Dear Janet,
The solution is to use find and xargs.
$ find . -iname "*.txt" -print0 | xargs -0 grep "Janet" ./archive/old-letter.txt:Dear Janet, ./hello.txt:Dear Janet, lets-meet.txt:Dear Janet, sauciest of vixens ./secret-liaison.txt:Dear Janet,
Some grep Examples
Mac OS X has a handy dictionary (a list of words, but bereft of definitions) located at /usr/share/dict/web2. Let’s use grep to count how many words contain the sequence xy. We use option -c to count the number of matches instead of displaying them.
$ grep -c "xy" /usr/share/dict/web2 579
How many words start with xy? This requires the use of a regular expression that says “a line that starts xy”.
$ grep -c "^xy" /usr/share/dict/web2 75
Name two of them! (Xylophone is the easy one.)
The grep command is often combined with command ps to look for specific processes. In the next example, grep filters the output from ps to display only those lines containing safari. (The ps command does not require its options to be preceded by dash.)
$ ps axww | grep -i safari 27946 ?? S 31:08.79 /Applications/Safari.app/ Contents/MacOS/Safari -psn_0_1739980801 16705 std R+ 0:00.00 grep -i safari
If you want to use the results of this command to extract the process ID of Safari, for example, the second line of output is unwelcome. This can be eliminated in either of two ways.
Use grep –v.
$ ps axww | grep -i safari | grep -v grep 27946 ?? S 31:09.33 /Applications/Safari.app/ Contents/MacOS/Safari -psn_0_1739980801
Employ some clever regular-expression trickery.
$ ps axww | grep -i "safar[i]" 27946 ?? S 31:09.50 /Applications/Safari.app/ Contents/MacOS/Safari -psn_0_1739980801
How does this safar[i] trick work? It’s a regular expression that’s equivalent to "safari", so it still matches "Safari". The grep command line, however, does not match now because it contains "safar[i]" and not "safari". Think about it.
Escape and Double Escape
Remember to enclose a regular expression in single quotes to avoid interpretation by the shell. The regular-expression sequence .* matches any string of characters, for example, but it must be escaped from the shell to stop the shell from treating the star as a globbing character and potentially expanding it. To match "line" and then any character sequence and then "1", we would type:
$ grep 'line.*1' *.txt
If we wish to search for the star character itself, star must also be escaped from regular-expression interpretation. To search for "line *1", we would type:
$ grep 'line \*1' *.txt
The escape character ensures that star is matched literally rather than being interpreted as a regular-expression operator. Refer to Project 77 if you are unfamiliar with regular expressions.
The next line is equivalent.
$ grep line\ \\\* *.txt
Remember fgrep? It searches for fixed patterns and does not activate regular expressions, so we can type simply
$ fgrep 'line *' *.txt
Zipped Files
Use a grep -based command to examine the contents of a zip- or bzip2-compressed file directly by using these commands:
- zgrep
- bzegrep
- bzfgrep
- bzgrep
These bz variants correspond to the versions of grep discussed in the “grep Options” section above.
Count Words
The wc command counts the number of characters, words, and lines in a text file. It’s often used to count the number of results returned by a command or pipeline. We can repeat the dictionary example from earlier by using wc.
$ grep "xy" /usr/share/dict/web2 | wc -l 579 $ grep "^xy" /usr/share/dict/web2 | wc -l 75
Option -l says to count lines only, and you can guess at options -c and -w.
Use awk to Isolate and Format Text
The awk command (named after its authors, Aho, Weinberger, and Kernighan) is a powerful pattern-processing language. It’s explored in detail in Projects 60 and 62, but one (very simple) way it can be used is to isolate a particular portion of each line of text it receives as input.
More specifically, this use of awk involves printing a selected field from the input text—field in this instance meaning a sequence of characters separated by white space. We can use awk to isolate Safari’s process ID (PID) from the results of our earlier grep/ps search, for example. This example extends the earlier command with a pipeline to awk. An awk script, enclosed in single quotes, tells awk to print the value of the first field (field #1) of each input line. Because the first text string in a line of ps output is always a PID, this yields the PID of process Safari.
$ ps axww | grep -i "safar[i]" | awk '{print $1}' 27946
The number 27946 is the PID of Safari, and this number can be given as an argument to the kill command to abort the running process. We’ll enclose the pipeline sequence in $(), which tells Bash to execute it, write the result back to the command line, and then execute the new command line.
Before we do any actual killing, use echo to demonstrate that the expression enclosed by $() still outputs the Terminal PID.
$ echo $(ps axww | grep -i "safar[i]" | awk '{print $1}') 27946
Now run kill.
$ kill $(ps axww | grep -i "safar[i]" | awk '{print $1}')
For completeness, let’s create a shell function killer to kill a given process by name.
$ killer () { kill $(ps axww | grep -i "$1" | ¬ grep -v "grep -i $1" | awk '{print $1}'); } $ killer safari
The awk statement printf prints a formatted, or embellished, version of each input line. Here’s a quick example of what can be done.
$ ls -l | awk '{printf("Date: %s %s, File %s\n",$7,$6,$9)}' Date: , File Date: 13 Sep, File csv Date: 13 Sep, File double-space Date: 30 Aug, File script
The first line—Date:, File—results from the first line written by ls -l. This can easily be removed with grep.
Use sed
The sed command is a stream editor and, like awk, processes its input lines based on matching patterns. It’s covered in detail in Projects 59 and 61, and we’ll use it here simply to search text files for lines that match a given pattern (Jan). Here are a couple of examples equivalent to the grep examples shown earlier in this project.
Option -n stops sed from echoing every input line, which it usually does. The construct /re/p searches for a regular expression (re) and displays the lines that contain it.
$ sed -n '/Jan/p' *.txt Dear Janet, Dear Janet, sauciest of vixens Dear Jan, Dear Janet, Perhaps on Jan 31st?
Next, we count the number of words starting with xy.
$ sed -n '/^xy/p' /usr/share/dict/web2 | wc -l 75
To filter the output from ps:
$ ps axww | sed -n "/Safar[i]/p" 470 ?? S 0:15.71 /Applications/Safari.app/ Contents/MacOS/Safari -psn_0_3407873
Ignoring case is less elegant. One has to convert all uppercase letters to lowercase (or vice versa) by using the awk function y and then match the pattern.
$ ps axww | sed -n "y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/ ¬ ;/safar[i]/p" 470 ?? s 0:15.71 /applications/safari.app/ contents/macos/safari -psn_0_3407873