Project 29 File-Content Tips
“Is there an easy way to format the contents of text files?”
This project gives you tips for detecting the type of content a file contains and introduces some handy text-processing utilities.
Determine File Content
Command file tells you the type of content a file contains.
$ file * about-html.txt: ASCII text fake.html: empty index.html: ASCII HTML document text letter.doc: ASCII English text nodif: a /bin/tcsh script text executable smtp-auth-plain: a /usr/bin/perl script text executable unix2mac: a /bin/bash script text executable week1: directory week1.tar: POSIX tar archive week1.tbz2: bzip2 compressed data, block size = 900k
Specify option -i if you would like the file type displayed in mime format.
$ file -i * about-html.txt: text/plain; charset=us-ascii fake.html: application/x-empty index.html: text/html; charset=us-ascii letter.doc: text/plain, English; charset=us-ascii nodif: application/x-shellscript smtp-auth-plain: application/x-perl unix2mac: application/x-shellscript week1: application/x-not-regular-file week1.tar: application/x-tar, POSIX week1.tbz2: application/octet-stream
Search for Files with a Specific Type of Content
We can pipe the results from file to grep to look for files with specific content.
$ file * | grep -i html about-html.txt: ASCII text fake.html: empty index.html: ASCII HTML document text
This simple approach suffers from a problem: If the filename contains the search term, it will match too, regardless of the content. We must add a little sophistication to the search term to absorb everything from the beginning of the line to the colon after the filename, using a regular expression such as “^.*:”, and then search for html.
$ file * | grep -i "^.*:.*html" index.html: ASCII HTML document text
The regular expression searches from the start of a line (^) for anything (.*) followed by a colon and then anything followed by html.
Process Files with a Specific Content Type
It’s easy to extend the pipeline example given above, making it pass the list of filenames to a command like Apple’s textedit.
To realize this, we use awk to pass on just the filename, which is the first field of the line.
$ file * | grep -i "^.*:.*html" | awk '{print $1}' index.html:
Then we use sed to chop off the colon.
$ file * | grep -i "^.*:.*html" | awk '{print $1}' ¬ | sed 's/://' index.html
Finally, we use xargs to form a command line from the list of files.
$ file * | grep -i "^.*:.*html" | awk '{print $1}' ¬ | sed 's/://' | xargs open -a textedit
In this example, the command line will be
open -a textedit index.html
The command open -a runs the specified GUI program, resulting in TextEdit’s opening index.html.
An alternative approach uses option -F, telling file to separate the filename from the content type with space-colon instead of just colon. Consequently, the first field seen by awk will be the filename without the colon.
$ file -F " :" * | grep -i "^.*:.*html" ¬ | awk '{print $1}' | xargs open -a textedit
Search Compressed Files
Option -z tells file to look inside compressed files. Compare the output of the next two examples.
$ file week1.tbz2 week1.tbz2: bzip2 compressed data, block size = 900k $ file -z week1.tbz2 week1.tbz2: POSIX tar archive (bzip2 compressed data, block size = 900k)
Expand and Unexpand Tabs
The expand command expands tab characters to the appropriate number of spaces, and unexpand does the reverse. Pass option -a to unexpand to ensure that all spaces are converted; otherwise, only leading spaces are converted.
Fold Long Lines
Long lines can be broken into shorter lines by the fold command. In this example, the output has lines of no more than 40 characters. Output is displayed on the terminal screen; to save the results, simply redirect output to a file by using > name-of-output-file.
$ cat longlines this is a file with one very long line and no linefeeds in it to demonstrate the use of fold to break long lines into the specified width $ fold -w40 longlines this is a file with one very long line a nd no linefeeds in it to demonstrate the use of fold to break long lines into th e specified width
The fmt command is more sophisticated and breaks lines at spaces instead of midword.
$ fmt -40 longlines this is a file with one very long line and no linefeeds in it to demonstrate the use of fold to break long lines into the speficied width
Split Large Files
Use the split command to split a long file into many smaller files, each 1,000 lines long. Specify option -l to change the sizes of the smaller files.