内容简介:One of the things that makes the shell an invaluable tool is the amount of available text processing commands, and the ability to easily pipe them into each other to build complex text processing workflows. These commands can make it trivial to perform tex
This article is part of a self-published book project by Balthazar Rouberol and Etienne Brodu , ex-roommates, friends and colleagues, aiming at empowering the up and coming generation of developers. We currently are hard at work on it!
If you are interested in the project, we invite you to join the mailing list !
Table of Contents
Text processing in the shell
One of the things that makes the shell an invaluable tool is the amount of available text processing commands, and the ability to easily pipe them into each other to build complex text processing workflows. These commands can make it trivial to perform text and data analysis, convert data between different formats, filter lines, etc.
When working with text data, the philosophy is to break any complex problem you have into a set of smaller ones, and to solve each of them with a specialized tool.
Make each program do one thing well.
The examples in that chapter might seem a little contrived at first, but this is also by design. Each of these tools were designed to solve one small problem. They however become extremely powerful when combined.
We will go over some of the most common and useful text processing
commands the shell has to offer, and will demonstrate real-life
workflows piping them together. I suggest you take a look at the man
of these commands to see the full breadth of options at your disposal.
The example CSV (comma-separated values) file is available online.Feel free to download it yourself to test these commands.
cat
As seen in the previous chapter, cat
is used to concatenate a list of
one or more files and displays their content on screen.
$ cat Documents/readme Thanks again for reading this book! I hope you're following so far! $ cat Documents/computers Computers are not intelligent They're just fast at making dumb things. $ cat Documents/readme Documents/computers Thanks again for reading this book! I hope you are following so far! Computers are not intelligent They're just fast at making dumb things.
head
head
prints the first n lines in a file. It can be very useful to peek
into a file of unknown structure and format without burying your shell
under a wall of text.
$ head -n 2 metadata.csv metric_name,metric_type,interval,unit_name,per_unit_name,description,orientation,integration,short_name mysql.galera.wsrep_cluster_size,gauge,,node,,The current number of nodes in the Galera cluster.,0,mysql,galera cluster size
If -n
is unspecified, head
will print the first 10 lines in its
argument file or input stream.
tail
tail
is head
’s counterpart. It prints the last n lines in a file.
$ tail -n 1 metadata.csv mysql.performance.queries,gauge,,query,second,The rate of queries.,0,mysql,queries
If you want to print all lines in a file located after the nth line
(included), you can use the -n +n
argument.
$ tail -n +42 metadata.csv mysql.replication.slaves_connected,gauge,,,,Number of slaves connected to a replication master.,0,mysql,slaves connected mysql.performance.queries,gauge,,query,second,The rate of queries.,0,mysql,queries
Our file has 43 lines, so tail -n +42
only prints the 42nd and 43rd
line in our file.
If -n
is unspecified, tail
will print the last 10 lines in its
argument file or input stream.
tail -f
or tail --follow
displays the last lines in a file and
displays each new line as the file is being written to. It is very
useful to see real time activity that is written to a log file, for
example a web server log file, etc.
wc
wc
(for word count
) prints either the number of characters (when
using -c
), words (when using -w
) or lines (when using -l
) in its
argument files or input stream.
$ wc -l metadata.csv 43 metadata.csv $ wc -w metadata.csv 405 metadata.csv $ wc -c metadata.csv 5094 metadata.csv
By default, wc
prints all of the above.
$ wc metadata.csv 43 405 5094 metadata.csv
Only the count will be printed out if the text data is piped in or
redirected into stdin
.
$ cat metadata.csv | wc 43 405 5094 $ cat metadata.csv | wc -l 43 $ wc -w < metadata.csv 405
grep
grep
is the Swiss Army knife of line filtering. It allows you to
filter lines matching a given pattern.
For example, we can use grep
to find all occurrences of the word mutex
in our metadata.csv
file.
$ grep mutex metadata.csv mysql.innodb.mutex_os_waits,gauge,,event,second,The rate of mutex OS waits.,0,mysql,mutex os waits mysql.innodb.mutex_spin_rounds,gauge,,event,second,The rate of mutex spin rounds.,0,mysql,mutex spin rounds mysql.innodb.mutex_spin_waits,gauge,,event,second,The rate of mutex spin waits.,0,mysql,mutex spin waits
grep
can either files passed as arguments, or a stream of text passed
to its stdin
. We can thus chain multiple grep
commands to further
filter our text. In the next example, we filter lines in our metadata.csv
file that contain both the mutex
and OS
words.
$ grep mutex metadata.csv | grep OS mysql.innodb.mutex_os_waits,gauge,,event,second,The rate of mutex OS waits.,0,mysql,mutex os waits
Let’s go over some of the options you can pass to grep and their associated behavior.
grep -v
performs an invert matching: it filters the lines that do not
match the argument pattern.
$ grep -v gauge metadata.csv metric_name,metric_type,interval,unit_name,per_unit_name,description,orientation,integration,short_name
grep -i
performs a case-insensitive matching. In the next example grep -i os
matches both OS
and os
.
$ grep -i os metadata.csv mysql.innodb.mutex_os_waits,gauge,,event,second,The rate of mutex OS waits.,0,mysql,mutex os waits mysql.innodb.os_log_fsyncs,gauge,,write,second,The rate of fsync writes to the log file.,0,mysql,log fsyncs
grep -l
only lists files containing a match.
$ grep -l mysql metadata.csv metadata.csv
grep -c
counts the number of times a pattern was found.
$ grep -c select metadata.csv 3
grep -r
recursively searches files in the current working directory
and all subdirectories below it.
$ grep -r are ~/Documents /home/br/Documents/computers:Computers are not intelligent /home/br/Documents/readme:I hope you are following so far!
grep -w
only matches whole words.
$ grep follow ~/Documents/readme I hope you are following so far! $ grep -w follow ~/Documents/readme $
cut
cut
cuts out a portion of a file (or, as always, its input stream). cut
works by defining a field delimited (what separates two columns)
with the -d
option, and what column(s) should be extracted, with the -f
option.
For example, the following command extracts the first column of the last 5 lines our CSV file.
$ tail -n 5 metadata.csv | cut -d , -f 1 mysql.performance.user_time mysql.replication.seconds_behind_master mysql.replication.slave_running mysql.replication.slaves_connected mysql.performance.queries
As we are dealing with a CSV file, we can extract each column by cutting
over the ,
character, and extract the first column with -f 1
.
We could also select both the first and second columns by using the -f 1,2
option.
$ tail -n 5 metadata.csv | cut -d , -f 1,2 mysql.performance.user_time,gauge mysql.replication.seconds_behind_master,gauge mysql.replication.slave_running,gauge mysql.replication.slaves_connected,gauge mysql.performance.queries,gauge
paste
paste
can merge together two different files into one multi-column
file.
$ cat ingredients eggs milk butter tomatoes $ cat prices 1$ 1.99$ 1.50$ 2$/kg $ paste ingredients prices eggs 1$ milk 1.99$ butter 1.50$ tomatoes 2$/kg
By default, paste
uses a tab delimiter, but you can change that using
the -d
option.
$ paste ingredients prices -d: eggs:1$ milk:1.99$ butter:1.50$ tomatoes:2$/kg
Another common use of paste
it to join all lines within a stream or a
file using a given delimiter, using a combination of the -s
and -d
argument.
$ paste -s -d, ingredients eggs,milk,butter,tomatoes
If -
is specified as an input file, stdin
will be read instead.
$ cat ingredients | paste -s -d, - eggs,milk,butter,tomatoes
sort
sort
, well, sorts argument files or input.
$ cat ingredients eggs milk butter tomatoes salt $ sort ingredients butter eggs milk salt tomatoes
sort -r
performs a reverse sort.
$ sort -r ingredients tomatoes salt milk eggs butter
sort -n
performs a numerical sort, by sorting fields by their
arithmetic value.
$ cat numbers 0 2 1 10 3 $ cat numbers | sort 0 1 10 2 3 $ cat numbers | sort -n 0 1 2 3 10
uniq
uniq
detects or filters out adjacent identical lines in its argument
file or input stream.
$ cat duplicates and one and one and two and one and two and one, two, three $ uniq duplicates and one and two and one and two and one, two, three
As uniq
only filters out adjacent
identical lines, we can still see
more than one unique lines in its output. To filter out all identical
lines from our duplicates
file, we need to sort its content first.
$ sort duplicates | uniq and one and one, two, three and two
uniq -c
prepends all lines with its number of occurrences.
$ sort duplicates | uniq -c 3 and one 1 and one, two, three 2 and two
uniq -u
only displays the unique lines within its input.
$ sort duplicates | uniq -u and one, two, three
uniq
is particularly useful used in conjunction with sort
, as | sort | uniq
allows you to remove any duplicate line in a file or a
stream.
awk
awk
is a little more than a text processing tool: it’s actually a
whole programming language of its own. One thing awk
is really
good at is splitting files into columns, and it especially shines when
these files contain a mix and match of spaces and tabs.
$ cat -t multi-columns John Smith Doctor^ITardis Sarah-James Smith^I Companion^ILondon Rose Tyler Companion^ILondon
cat -t
displays tabs as ^I
.
We can see that these columns are either separated by spaces or tabs,
and that they are not always separated by the same number of spaces. cut
would be of no use there, because it only works on a single
character delimiter. awk
however, can easily make sense of that file.
awk '{ print $n }'
prints the nth column in the text.
$ cat multi-columns | awk '{ print $1 }' John Sarah-James Rose $ cat multi-columns | awk '{ print $3 }' Doctor Companion Companion $ cat multi-columns | awk '{ print $1,$2 }' John Smith Sarah-James Smith Rose Tyler
There is so much more we can do with awk
, however, printing columns
probably accounts for 99% of my personal usage.
{ print $NF }
prints the last column in the line.
tr
tr
stands for translate
, and it replaces characters into others. It
either works on characters or character classes
, such as lowercase,
printable, spaces, alphanumeric, etc.
tr <char1> <char2
translates all occurrences of <char1>
from its
standard input into <char2>
.
$ echo "Computers are fast" | tr a A computers Are fAst
tr
can also translate character classes by using the [:class:]
notation. The full list of available classes is described in the tr
man page, but we’ll demonstrate some of them here.
[:space:]
represent all types of spaces, from a simple space, to a tab
or a newline.
$ echo "computers are fast" | tr '[:space:]' ',' computers,are,fast,%
All spaces-like characters were translated into a comma. Note that the %
character at the end of the output represents the lack of a trailing
newline. Indeed, that newline was translated to a comma as well.
[:lower:]
represents all lowercase characters, and [:upper:]
represents all uppercase characters. Converting between cases is thus
made very easy.
$ echo "computers are fast" | tr '[:lower:]' '[:upper:]' COMPUTERS ARE FAST $ echo "COMPUTERS ARE FAST" | tr '[:upper:]' '[:lower:]' computers are fast
tr -c SET1 SET2
will transform any character not
in SET1 into the
characters in SET2. The following example replaces all non vowels by
spaces.
$ echo "computers are fast" | tr -c '[aeiouy]' ' ' o u e a e a
tr -d
deletes the matched characters, instead of replacing them. It’s
the equivalent of tr <char> ''
.
$ echo "Computers Are Fast" | tr -d '[:lower:]' C A F
tr
can also replace character ranges, for example all letters between a
and e
, or all numbers between 1 and 8, by using the notation s-e
, where s
is the start character and e
is the end one.
$ echo "computers are fast" | tr 'a-e' 'x' xomputxrs xrx fxst $ echo "5uch l337 5p34k" | tr '1-4' 'x' 5uch lxx7 5pxxk
fold
fold
wraps each input line to fit in a specified width. It can be
useful to make sure an argument text fits in a small display size for
example. fold -w n
folds the lines at n
characters.
$ cat ~/Documents/readme | fold -w 16 Thanks again for reading this bo ok! I hope you're fo llowing so far!
fold -s
will only break lines on a space character, and can be
combined with -w
to fold up to a given number of characters.
Thanks again for reading this book! I hope you're following so far!
sed
sed
is a non-interactive stream editor, used to perform text
transformation on its input stream, on a line-per-line basis. It can
take its output from a file our its stdin
and will output its result
either in a file or its stdout
.
It works by taking one or many optional addresses
, a function
and parameters
. A sed
command thus looks like this:
[address[,address]]function[arguments]
While sed
can perform many functions, we will cover only substitution,
as it is probably sed
’s most common use.
Substituting text
A sed
substitution command looks like this:
s/PATTERN/REPLACEMENT/[options]
Example: replacing the first instance of a word for each line in a file
$ cat hello hello hello hello world! hi $ cat hello | sed 's/hello/Hey I just met you/' Hey I just met you hello Hey I just met you world hi
We can see that only the first occurrence of hello
was replaced in the
first line. To replace all
occurrences of hello
in each line, we can
use the g
(for global
) option.
$ cat hello | sed 's/hello/Hey I just met you/g' Hey I just met you Hey I just met you Hey I just met you world ji
sed
allows you to specify any other separator than /
, which is
especially useful to keep the command readable if the search of
replacement pattern contains forward slashes.
$ cat hello | sed 's@hello@Hey I just met you@g' Hey I just met you Hey I just met you Hey I just met you world ji
By specifying an address, we can tell sed on which line or line range to actually perform the substitution.
$ cat hello | sed '1s/hello/Hey I just met you/g' Hey I just met you hello hello world hi $ cat hello | sed '2s/hello/Hey I just met you/g' hello hello Hey I just met you world hi
The address 1
tells sed
to only replace hello
by Hey I just met you
on line 1. We can specify an address range with the
notation <start>,<end>
where <end>
can either be a line number or $
, meaning the last line in the file.
$ cat hello | sed '1,2s/hello/Hey I just met you/g' Hey I just met you Hey I just met you Hey I just met you world hi $ cat hello | sed '2,3s/hello/Hey I just met you/g' hello hello Hey I just met you world hi $ cat hello | sed '2,$s/hello/Hey I just met you/g' hello hello Hey I just met you world hi
By default, sed
displays its result in its stdout
, but it can also
edit the initial file in-place, with the use of the -i
option.
$ sed -i '' 's/hello/Bonjour/' sed-data $ cat sed-data Bonjour hello Bonjour world hi
On Linux, only -i
needs to be specified. However, due to the fact that sed
’s behavior on macOS is slightly different, the ''
needs to be
added right after -i
.
Real-life examples
Filtering a CSV using grep
and awk
$ grep -w gauge metadata.csv | awk -F, '{ if ($4 == "query") { print $1, "per", $5 } }' mysql.performance.com_delete per second mysql.performance.com_delete_multi per second mysql.performance.com_insert per second mysql.performance.com_insert_select per second mysql.performance.com_replace_select per second mysql.performance.com_select per second mysql.performance.com_update per second mysql.performance.com_update_multi per second mysql.performance.questions per second mysql.performance.slow_queries per second mysql.performance.queries per
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。