Pipingfile
Pipingfile
Most programs/commands read input data from some source, then write output to some destination. A
data source can be a file, but can also be standard input. Similarly, a data destination can be a file but can
also be a stream such as standard output.
The power of the Linux command line is due in no small part to the power of piping. The pipe operator ( | )
connects one program's standard output to the next program's standard input.
A simple example is piping uncompressed data "on the fly" to count its lines using wc -l (word count command
with the lines option).
Pipe uncompressed output to a pager
Due its varying functionalities, it has many variants including grep, egrep (Extended
GREP), fgrep (Fixed GREP), pgrep (Process GREP), rgrep (Recursive GREP) etc. But
these variants have minor differences to original grep which has made them popular and to
be used by various Linux programmers for specific tasks.
Some Special Meta-Characters of grep
1. + – Equivalent to one or more occurrences of previous character.
2. ? – This denotes almost 1 repetition of previous character. Like: a? Would
match ‘a’ or ‘aa’.
3. ( – Start of alternation expression.
4. ) – End of alternation expression.
5. | – Matching either of the expression separated by '|'. Like: “(a|b)cde” would
match either ‘abcde’ or ‘bbcde’.
6. { – This meta-character indicates start of range specifier.
Like: “a{2}” matches “aa” in file i.e. a 2 times.
7. } – This meta-character indicates end of range specifier.
Grep Command
grep or Global Regular Expression Print is the main search program on Unix-like
systems which can search for any type of string on any file or list of files or even output of
any command.
It uses Basic Regular Expressions apart from normal strings as a search pattern. In Basic
Regular Expressions (BRE), meta-characters like: '{','}' , '(',')' , '|' , '+' , '?' loose
their meaning and are treated as normal characters of string and need to be escaped if they
are to be treated as special characters.
here, when the command is run without escaping '(' ')' and '|' then it searched for the
complete string i.e. “(f|g)ile” in the file. But when the special characters were escaped,
then instead of treating them as part of string, grep treated them as meta-characters and
searched for words “file” or “gile” in the file.
Egrep Command
Egrep or grep -E is another version of grep or the Extended grep. This version of grep is
efficient and fast when it comes to searching for a regular expression pattern as it treats
meta-characters as is and doesn’t substitute them as strings like in grep, and hence you are
freed from the burden of escaping them as in grep. It uses ERE or the Extended Regular
Expression set.
In case of egrep, even if you do not escape the meta-characters, it would treat them as
special characters and substitute them for their special meaning instead of treating them as
part of string.
fgrep Command
Fgrep or the Fixed grep or grep -F is yet another version of grep which is fast in searching
when it comes to search for the entire string instead of regular expression as it doesn’t
recognize the regular expressions, neither any meta-characters. For searching any direct
string, this is the version of grep which should be selected.
grep searches for complete string and doesn’t even recognize special characters as part of
regular expression even if escaped or not escaped.
TR
The tr command is a UNIX command-line utility for translating or deleting
characters. It supports a range of transformations including uppercase to
lowercase, squeezing repeating characters, deleting specific characters, and basic
find and replace. It can be used with UNIX pipes to support more complex
translation. tr stands for translate.
Syntax :
$ tr [OPTION] SET1 [SET2]
Options -c : complements the set of characters in string.i.e., operations apply to
characters not in the given set -d : delete characters in the first set from the output.
-s : replaces repeated characters listed in the set1 with single occurrence -t :
truncates set1
$ cat greekfile | tr [a-z] [A-Z]
$ cat greekfile | tr [:lower:] [:upper:]
$ tr "{}" "()" <greekfile >newfile.txt
Paste command is one of the useful commands in Unix or Linux operating system.
It is used to join files horizontally (parallel merging) by outputting lines consisting of
lines from each file specified, separated by tab as delimiter, to the standard output.
When no file is specified, or put dash (“-“) instead of file name, paste reads from
standard input and gives output as it is until a interrupt command [Ctrl-c] is
given. Syntax:
paste [OPTION]... [FILES]...
$ cat state
Arunachal Pradesh
Assam
Andhra Pradesh
Bihar
Chhattisgrah
$ cat capital
Itanagar
Dispur
Hyderabad
Patna
Raipur
$ paste number state capital
1 Arunachal Pradesh Itanagar
2 Assam Dispur
3 Andhra Pradesh Hyderabad
4 Bihar Patna
5 Chhattisgrah Raipur
Only one character is specified
$ paste -d "|" number state capital
1|Arunachal Pradesh|Itanagar
2|Assam|Dispur
3|Andhra Pradesh|Hyderabad
4|Bihar|Patna
5|Chhattisgrah|Raipur
Advanced command
cut, sort, uniq
• cut command lets you isolate ranges of data from its input lines
o cut -f <field_number(s)> extracts one or more fields (-f) from each line of its input
▪ use -d <delim> to change the field delimiter (Tab by default)
o cut -c <character_number(s)> extracts one or more characters (-c) from each line of input
o the <numbers> can be
▪ a comma-separated list of numbers (e.g. 1,4,7)
▪ a hyphen-separated range (e.g. 2-5)
▪ a trailing hyphen says "and all items after that" (e.g. 3,7-)
o cut does not re-order fields, so cut -f 5,3,1 acts like -f 1,3,5
awk
awk is a powerful scripting language that is easily invoked from the command line. Its field-oriented
capabilities make it the go-to tool for manipulating table-like delimited lines of text.
• awk '<script>' - the '<script>' is applied to each line of input (generally piped in)
• always enclose '<script>' in single quotes to inhibit shell evaluation, because awk has its own set
of metacharacters that are different from the shell's
Example that prints the average of its input numbers (echo -e converts backslash escape
characters like newline \n to the ASCII newline character so that the numbers appear on separate lines)
echo -e "1\n2\n3\n4\n5" | awk '
BEGIN{sum=0; ct=0}
{ sum = sum + $1
ct = ct + 1 }
END{print sum/ct}'
• BEGIN {<expressions>} – use to initialize variables before any script body lines are executed
o e.g. BEGIN {FS=":"; OFS="\t"; sum=0; ct=0}
▪ says use colon ( : ) as the input field separator (FS), and Tab ( \t ) as the output field separator (OFS)
• the default input field separator (FS) is whitespace
o one or more spaces or Tabs
• the default output field separator (OFS) is a single space
▪ initializes the variables sum and ct to 0
• {<body expressions>} – expressions to apply to each line of input
o use $1, $2, etc. to pick out specific input fields of each line
▪ e.g. {sum = sum + $4} adds field 4 of the input to the variable sum
o the built-in variable NF is the number of fields in the current line
o the built-in variable NR is the record (line) number of the current line
• END {<expressions>} – executed after all input is complete
o e.g. END {print sum,ct} prints the final value of the sum and ct variables, separated by the output field
separator.
The basic functions of cut and awk are similar – both are field oriented. Here are the main differences:
• grep -P '<pattern>' searches for <pattern> in its input, and only outputs lines containing it
o always enclose '<pattern>' in single quotes to inhibit shell evaluation!
▪ pattern-matching metacharacters in grep are very different from those in the shell
o -P says to use Perl patterns, which are much more powerful (and standard) than default grep patterns
o -v (inverse match) – only print lines with no match
o -n (line number) – prefix output with the line number of the match
o -i (case insensitive) – ignore case when matching
o -l says return only the names of files that do contain the pattern match
o -L says return only the names of files that do not contain the pattern match
o -c says just return a count of line matches
o -A <n> (After) and -B <n> (Before) – output <n> number of lines after or before a match
A regular expression (regex) is a pattern of literal characters to search for and metacharacters that control
and modify how matching is done.
A regex <pattern> can contain special match metacharacters and modifiers. The ones below
are Perl metacharacters, which are the "gold standard", supported by most languages (e.g. grep -P)
Awk whitespace (one or • In the BEGIN { } block cat /etc/fstab | grep -v '^#' |
awk 'BEGIN{OFS="\t"}{print
more spaces or Tabs) o FS= (input field $2,$1}'
separator) cat /etc/passwd | awk -F ":"
Note: some older versions
o OFS= (output field '{print $1}'
of awk do not treat Tabs as
separator)
field separators.
• -F or --field-
separator option
When the shell processes a command line, it first parses the text into tokens ("words"), which are groups of
characters separated by whitespace (one or more space characters). Quoting affects how this parsing happens,
including how metacharacters are treated and how text is grouped.
The quote characters themselves ( ' " ` ) are metacharacters that tell the shell to "start a quoting process"
then "end a quoting process" when the matching quote is found. Since they are part of the processing,
the enclosing quotes are not included in the output.
The first rule of quoting is: always enclose a command argument in quotes if it contains spaces so that the
command sees the quoted text as one item. In particular, always use single ( ' ) or double ( " ) quotes when
you define an environment variable whose value contains spaces.
foo='Hello world' # correct - defines variable "foo" to have value "Hello world"
foo=Hello world # error - no command called "world"
These two expressions using double quotes or single quotes are different because the single quotes tell the
shell to treat the quoted text as a literal, and not to look inside it for metacharacter processing.
# Inside double quotes, the text "$USER" is evaluated and its value substituted
echo "my account name is $USER"
One common use of for loops is to process multiple files, where the set of files to process is obtained by
pathname wildcarding. For example, the code below counts the number of reads in a set of
compressed FASTQ files:
For loop to count sequences in multiple FASTQs
for fname in *.gz; do
echo "$fname has $((`zcat $fname | wc -l` / 4)) sequences"
done
quotes matter
We saw how double quotes allow the shell to evaluate certain metacharacters in the quoted text.
But more importantly when assigning multiple lines of text to a variable, quoting the evaluated
variable preserves any special characters in the variable value's text such as Tab or newline characters.
Consider this case where a captured string contains newlines, as illustrated below.
txt=$( echo -e "aa\nbb\ncc" )
echo "$txt" # inside double quotes, newlines preserved
echo $txt # without double quotes, newlines are converted to spaces
the if statement
The general form of an if/then/else statement in bash is:
if [ <test expression> ]
then <expression> [ expression... ]
else <expression> [ expression... ]
fi
Where
A simple example:
for val in 5 0 "27" "$emptyvar" abc '0'; do
if [ "$val" ]
then echo "Value '$val' is true"
else echo "Value '$val' is false"
fi
done
If the input data is well structured, its fields can be read directly into variables. Notice we can pipe all the output
to more – or could redirect it to a file.
tail /etc/passwd | while IFS=':' read account x uid gid name shell
do
echo $account $name
done | more
The owner will always be a member of the Unix group associated with a file, and other accounts may also be
members of the same group. G-801021 is one of the Unix groups I belong to at TACC. To see the Unix
groups you belong to, just type the groups command.
Permissions
File permissions and information about the file type are encoded in that 1st 10-character field. Permissions
govern who can access a file, and what actions they are allowed.
• character 1 describes the file type (d for directory, - for regular file, l for symbolic link)
• the remaining 9 characters are 3 sets of 3-character designations
o characters 2-4: what the owning user account can do
o characters 5-7: what other members of the associated Unix group can do
o characters 8-19: what other non-group members (everyone) can do
Each of the 3-character sets describes if read ( r ) write ( w ) and execute ( x or s ) actions are allowed, or not
allowed ( - ).
Examples:
ls -l ~/.bash_history
haiku.txt description
ls -l /usr/bin/ls
/usr/bin/ls description
docs description