Text Utilities | Learn Linux

Submitted by brian on Sat, 01/15/2011 - 17:45

STARTING TO THINK THIS SHOULD BE A CHAPTER
There are a number of text processing utilities available in Linux, of varying complexity. Some are stand alone programs and some are filters which are designed for use in scripting.

cat simply catalogues the contents of a file by dumping them to screen.

tac a reverse cat, it dumps the contents of a file to the screen last line first.
more displays the contents of a file one screen at a time.

less is an advanced version of more, it allows you move up and down within a file and has internal searching features.

echo simply prints its arguments to the screen on a line, it also allows certain escape characters if the -e flag is used

sort sorts the lines of the specified data stream alphabetically (or numerically if the -n flag is used)

uniq Only display one instance of consecutive identical lines (used in conjunction with sort to get truly unique results eg sort my_errors_file|uniq)

head display the first 10 lines of the file (or more if you want -$NO_TO_DISPLAY)

tail display the last 10 lines of the file, again adaptable and also takes the -f flag if you want to follow the progress of a log file as something is happening

wc word count by default will return word, character and line counts for a text file. to only get one use the appropriate flag (-w -c -l)

printf printf is really useful for formatting the output of a script, it is used as follows
printf "string specifier" variable list
The string specifier can contain string constants or place holders for a variable. For example
printf "Hello %s\nWelcome to %s\nThere are %d unsorted files in your home directory\n" $USER $(hostname) $(ls -l ~/ |egrep -v '^d'|wc -l)
Will print out
Hello philip Welcome to luggage There are 8 unsorted files in your home directory philip@luggage:~$

diff takes two files and returns a formatted list of their differences or nothing if their contents are identical
patch takes the output of the diff comnmand and can be used to reapply the changes to a copy of the original file

cut A simple way to split files

join
Allows a simple pairing of files on a common field, eg
File_1 contains
cheese dairy
apple fruit
pea legume
turnip brassica

While File_2 contains
Yoghurt dairy
Orange fruit
carrot root
bean legume
cabbage brassica

join -1 2 -2 2 File_1 File_2
gives
dairy cheese Yoghurt
fruit apple Orange
legume pea bean
brassica turnip cabbage

tr
Swaps characters in the first part for those in the second part, eg.
echo "THIS IS FAR TOO LOUD" | tr 'A-Z' 'a-z'
this is far too loud

Brief discussion of RegEx
Regular expressions are used to specify the "shape" of a piece of text you are looking for. for example "2 digits followed by a space followed by one or 2 letters followed by one or more numbers" is a good definition of an Irish License plate. So how would we define that in a regular expression ?
egrep -o '/[0-9]{2} [A-za-z]{1,2} [0-9]+/' filename
would return all the license plates listed in the file filename.
The [ and ] enclose a range, this range encompasses any characters between the first and last character (Depending on your distribution there may be issues with your locale's sort order but that's for later). The curly braces { and } define a number of times that the previous pattern should repeat. if there are two numbers separated by a comma then it should repeat between the first number and second number times. The + also repeats the previous patter, though it does so one or more times, there is no upper bound. Finally the egrep command is an enhanced global regular expression command, it has no difference from grep in modern Linux distributions. We delve more deeply into regex here (to be written).
grep
grep allows you to match a regular expression in a file or stream of text.
sed
The stream editor, lets you modify text with substitution, appending, deletion and insertion operations.
awk
This would need a book in its own right however awk is a very powerful little language for text manipulation.