0% found this document useful (0 votes)
8 views

Pipingfile

The document discusses piping in Linux and how it allows connecting the output of one program to the input of another. It provides examples of using piping with commands like zcat, wc, grep, egrep and fgrep. It also briefly covers the tr and paste commands.

Uploaded by

apiit.sachin12
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Pipingfile

The document discusses piping in Linux and how it allows connecting the output of one program to the input of another. It provides examples of using piping with commands like zcat, wc, grep, egrep and fgrep. It also briefly covers the tr and paste commands.

Uploaded by

apiit.sachin12
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Piping

Most programs/commands read input data from some source, then write output to some destination. A
data source can be a file, but can also be standard input. Similarly, a data destination can be a file but can
also be a stream such as standard output.

The power of the Linux command line is due in no small part to the power of piping. The pipe operator ( | )
connects one program's standard output to the next program's standard input.

A simple example is piping uncompressed data "on the fly" to count its lines using wc -l (word count command
with the lines option).
Pipe uncompressed output to a pager

# zcat is like cat, except that it understands the gz compressed format,


# and uncompresses the data before writing it to standard output.
# So, like cat, you need to be sure to pipe the output to a pager if
# the file is large.
zcat big.fq.gz | wc –l

Difference Between Grep, Egrep and Fgrep


in Linux?
One of the renowned search tool on Unix-like systems which can be used to search for
anything whether it be a file, or a line or multiple lines in file is grep utility. It is very vast in
functionality which can be attributed to the large number of options it supports like:
searching using string pattern, or reg-ex pattern or perl based reg-ex etc.

Due its varying functionalities, it has many variants including grep, egrep (Extended
GREP), fgrep (Fixed GREP), pgrep (Process GREP), rgrep (Recursive GREP) etc. But
these variants have minor differences to original grep which has made them popular and to
be used by various Linux programmers for specific tasks.
Some Special Meta-Characters of grep
1. + – Equivalent to one or more occurrences of previous character.
2. ? – This denotes almost 1 repetition of previous character. Like: a? Would
match ‘a’ or ‘aa’.
3. ( – Start of alternation expression.
4. ) – End of alternation expression.
5. | – Matching either of the expression separated by '|'. Like: “(a|b)cde” would
match either ‘abcde’ or ‘bbcde’.
6. { – This meta-character indicates start of range specifier.
Like: “a{2}” matches “aa” in file i.e. a 2 times.
7. } – This meta-character indicates end of range specifier.
Grep Command

grep or Global Regular Expression Print is the main search program on Unix-like
systems which can search for any type of string on any file or list of files or even output of
any command.

It uses Basic Regular Expressions apart from normal strings as a search pattern. In Basic
Regular Expressions (BRE), meta-characters like: '{','}' , '(',')' , '|' , '+' , '?' loose
their meaning and are treated as normal characters of string and need to be escaped if they
are to be treated as special characters.

$ grep -C 0 '(f|g)ile' check_file

$ grep -C 0 '\(f\|g\)ile' check_file

here, when the command is run without escaping '(' ')' and '|' then it searched for the
complete string i.e. “(f|g)ile” in the file. But when the special characters were escaped,
then instead of treating them as part of string, grep treated them as meta-characters and
searched for words “file” or “gile” in the file.
Egrep Command

Egrep or grep -E is another version of grep or the Extended grep. This version of grep is
efficient and fast when it comes to searching for a regular expression pattern as it treats
meta-characters as is and doesn’t substitute them as strings like in grep, and hence you are
freed from the burden of escaping them as in grep. It uses ERE or the Extended Regular
Expression set.

In case of egrep, even if you do not escape the meta-characters, it would treat them as
special characters and substitute them for their special meaning instead of treating them as
part of string.

$ egrep -C 0 '(f|g)ile' check_file

$ egrep -C 0 '\(f\|g\)ile' check_file


here, egrep searched for “file” string when the meta-characters were not escaped as it
would mean by the meaning of these characters. But, when these characters were
escaped, then egrep treated them as part of string and searched for complete
string “(f|g)ile” in the file.

fgrep Command

Fgrep or the Fixed grep or grep -F is yet another version of grep which is fast in searching
when it comes to search for the entire string instead of regular expression as it doesn’t
recognize the regular expressions, neither any meta-characters. For searching any direct
string, this is the version of grep which should be selected.

grep searches for complete string and doesn’t even recognize special characters as part of
regular expression even if escaped or not escaped.

$ fgrep -C 0 '(f|g)ile' check_file

$ fgrep -C 0 '\(f\|g\)ile' check_file

TR
The tr command is a UNIX command-line utility for translating or deleting
characters. It supports a range of transformations including uppercase to
lowercase, squeezing repeating characters, deleting specific characters, and basic
find and replace. It can be used with UNIX pipes to support more complex
translation. tr stands for translate.
Syntax :
$ tr [OPTION] SET1 [SET2]
Options -c : complements the set of characters in string.i.e., operations apply to
characters not in the given set -d : delete characters in the first set from the output.
-s : replaces repeated characters listed in the set1 with single occurrence -t :
truncates set1
$ cat greekfile | tr [a-z] [A-Z]
$ cat greekfile | tr [:lower:] [:upper:]
$ tr "{}" "()" <greekfile >newfile.txt

Paste command is one of the useful commands in Unix or Linux operating system.
It is used to join files horizontally (parallel merging) by outputting lines consisting of
lines from each file specified, separated by tab as delimiter, to the standard output.
When no file is specified, or put dash (“-“) instead of file name, paste reads from
standard input and gives output as it is until a interrupt command [Ctrl-c] is
given. Syntax:
paste [OPTION]... [FILES]...
$ cat state
Arunachal Pradesh
Assam
Andhra Pradesh
Bihar
Chhattisgrah

$ cat capital
Itanagar
Dispur
Hyderabad
Patna
Raipur
$ paste number state capital
1 Arunachal Pradesh Itanagar
2 Assam Dispur
3 Andhra Pradesh Hyderabad
4 Bihar Patna
5 Chhattisgrah Raipur
Only one character is specified
$ paste -d "|" number state capital
1|Arunachal Pradesh|Itanagar
2|Assam|Dispur
3|Andhra Pradesh|Hyderabad
4|Bihar|Patna
5|Chhattisgrah|Raipur

More than one character is specified


$ paste -d "|," number state capital
1|Arunachal Pradesh,Itanagar
2|Assam,Dispur
3|Andhra Pradesh,Hyderabad
4|Bihar,Patna
5|Chhattisgrah,Raipur

Advanced command
cut, sort, uniq

• cut command lets you isolate ranges of data from its input lines
o cut -f <field_number(s)> extracts one or more fields (-f) from each line of its input
▪ use -d <delim> to change the field delimiter (Tab by default)
o cut -c <character_number(s)> extracts one or more characters (-c) from each line of input
o the <numbers> can be
▪ a comma-separated list of numbers (e.g. 1,4,7)
▪ a hyphen-separated range (e.g. 2-5)
▪ a trailing hyphen says "and all items after that" (e.g. 3,7-)
o cut does not re-order fields, so cut -f 5,3,1 acts like -f 1,3,5

• sort sorts its input lines using an efficient algorithm


o by default sorts each line lexically (as strings), low to high
▪ use -n sort numerically (-n)
▪ use -V for Version sort (numbers with surrounding text)
▪ use -r to reverse the sort order
o use one or more -k <start_field_number>,<end_field_number> options to specify a range of "keys" (fields)
to sort on
▪ e.g. -k1,1 -k2,2nr to sort field 1 lexically and field 2 as a number high-to-low
▪ by default, fields are delimited by whitespace -- one or more spaces or Tabs
• use -t <delim> to change the field delimiter (e.g. -t "\t" for Tab only; ignore spaces)

• uniq -c counts groupings of its input (which must be sorted) and reports the text and count for each group
o use cut | sort | uniq -c for a quick-and-dirty histogram

awk
awk is a powerful scripting language that is easily invoked from the command line. Its field-oriented
capabilities make it the go-to tool for manipulating table-like delimited lines of text.

• awk '<script>' - the '<script>' is applied to each line of input (generally piped in)
• always enclose '<script>' in single quotes to inhibit shell evaluation, because awk has its own set
of metacharacters that are different from the shell's

Example that prints the average of its input numbers (echo -e converts backslash escape
characters like newline \n to the ASCII newline character so that the numbers appear on separate lines)
echo -e "1\n2\n3\n4\n5" | awk '
BEGIN{sum=0; ct=0}
{ sum = sum + $1
ct = ct + 1 }
END{print sum/ct}'

General structure of an awk script:

• BEGIN {<expressions>} – use to initialize variables before any script body lines are executed
o e.g. BEGIN {FS=":"; OFS="\t"; sum=0; ct=0}
▪ says use colon ( : ) as the input field separator (FS), and Tab ( \t ) as the output field separator (OFS)
• the default input field separator (FS) is whitespace
o one or more spaces or Tabs
• the default output field separator (OFS) is a single space
▪ initializes the variables sum and ct to 0
• {<body expressions>} – expressions to apply to each line of input
o use $1, $2, etc. to pick out specific input fields of each line
▪ e.g. {sum = sum + $4} adds field 4 of the input to the variable sum
o the built-in variable NF is the number of fields in the current line
o the built-in variable NR is the record (line) number of the current line
• END {<expressions>} – executed after all input is complete
o e.g. END {print sum,ct} prints the final value of the sum and ct variables, separated by the output field
separator.

Here is an excellent awk tutorial, very detailed and in-depth

cut versus awk

The basic functions of cut and awk are similar – both are field oriented. Here are the main differences:

• Default field separators


o Tab is the default field separator for cut
o whitespace (one or more spaces or Tabs) is the default field separator for awk
• Re-ordering
o cut cannot re-order fields
o awk can re-order fields, based on the order you specify
• awk is a full-featured programming language while cut is just a single-purpose utility.

grep and regular expressions

• grep -P '<pattern>' searches for <pattern> in its input, and only outputs lines containing it
o always enclose '<pattern>' in single quotes to inhibit shell evaluation!
▪ pattern-matching metacharacters in grep are very different from those in the shell
o -P says to use Perl patterns, which are much more powerful (and standard) than default grep patterns
o -v (inverse match) – only print lines with no match
o -n (line number) – prefix output with the line number of the match
o -i (case insensitive) – ignore case when matching
o -l says return only the names of files that do contain the pattern match
o -L says return only the names of files that do not contain the pattern match
o -c says just return a count of line matches
o -A <n> (After) and -B <n> (Before) – output <n> number of lines after or before a match

A regular expression (regex) is a pattern of literal characters to search for and metacharacters that control
and modify how matching is done.

A regex <pattern> can contain special match metacharacters and modifiers. The ones below
are Perl metacharacters, which are the "gold standard", supported by most languages (e.g. grep -P)

• ^ – matches beginning of line


• $ – matches end of line
• . – (period) matches any single character
• * – modifier; place after an expression to match 0 or more occurrences
• + – modifier, place after an expression to match 1 or more occurrences
• ? – modifier, place after an expression to match 0 or 1 occurrences
• \s – matches any whitespace character (\S any non-whitespace)
• \d – matches digits 0-9
• \w – matches any word character: A-Z, a-z, 0-9 and _ (underscore)
• \t matches Tab;
• \n matches linefeed; \r matches carriage return
• [xyz123] – matches any single character (including special characters) among those listed between the
brackets [ ]
o sthis is called a character class.
o use [^xyz123] to match any single character not listed in the class
• (Xyz|Abc) – matches either Xyz or Abc or any text or expressions inside parentheses separated by | characters
o note that parentheses ( ) may also be used to capture matched sub-expressions for later use

Regular expression modules are available in nearly every programming language


(Perl, Python, Java, PHP, awk, even R)

• each "flavor" is slightly different


• even bash has multiple regex commands: grep, egrep, fgrep.

Field delimiter summary


Be aware of the default field delimiter for the various bash utilities, and how to change them:

Utility default delimiter how to change example

Cut Tab -d or --delimiter option cut -d ':' -f 1 /etc/passwd

Sort whitespace (one or -t or --field- sort -t ':' -k1,1 /etc/passwd


more spaces or Tabs) separator option

Awk whitespace (one or • In the BEGIN { } block cat /etc/fstab | grep -v '^#' |
awk 'BEGIN{OFS="\t"}{print
more spaces or Tabs) o FS= (input field $2,$1}'
separator) cat /etc/passwd | awk -F ":"
Note: some older versions
o OFS= (output field '{print $1}'
of awk do not treat Tabs as
separator)
field separators.
• -F or --field-
separator option

Quoting in the shell


What different quote marks mean in the shell and when to use can be quite confusing.

When the shell processes a command line, it first parses the text into tokens ("words"), which are groups of
characters separated by whitespace (one or more space characters). Quoting affects how this parsing happens,
including how metacharacters are treated and how text is grouped.

There are three types of quoting in the shell:


1. single quoting (e.g. 'some text') – this serves two purposes
• It groups together all text inside the quotes into a single token
• It tells the shell not to "look inside" the quotes to perform any evaluations
o all metacharacters inside the single quotes are ignored
o in particular, any environment variables in single-quoted text are not evaluated
2. double quoting (e.g. "some text") – also serves two purposes
• it groups together all text inside the quotes into a single token
• it allows environment variable evaluation, but inhibits some metacharcters
o e.g. asterisk ( * ) pathname globbing and some other metacharacters
• double quoting also preserves any special characters in the text
o e.g. newlines (\n) or Tabs (\t)
3. backtick quoting (e.g. `date`)
• evaluates the expression inside the backtick marks ( ` ` )
• the standard output of the expression replaces the text inside the backtick marks ( ` ` )
• the syntax $( date ) is equivalent

The quote characters themselves ( ' " ` ) are metacharacters that tell the shell to "start a quoting process"
then "end a quoting process" when the matching quote is found. Since they are part of the processing,
the enclosing quotes are not included in the output.

single and double quotes

The first rule of quoting is: always enclose a command argument in quotes if it contains spaces so that the
command sees the quoted text as one item. In particular, always use single ( ' ) or double ( " ) quotes when
you define an environment variable whose value contains spaces.
foo='Hello world' # correct - defines variable "foo" to have value "Hello world"
foo=Hello world # error - no command called "world"

These two expressions using double quotes or single quotes are different because the single quotes tell the
shell to treat the quoted text as a literal, and not to look inside it for metacharacter processing.
# Inside double quotes, the text "$USER" is evaluated and its value substituted
echo "my account name is $USER"

# Inside single quotes, the text "$USER" is left as-is


echo 'the environment variable storing my account name is $USER'

• The `seq 4` expression uses backtick evaluation to generate a set of 4 numbers: 1 2 3 4.


• The do/done block expressions are executed once for each of the items in the list
• Each time through the loop (the do/done block) the variable named num is assigned one of the values in the
list
o Then the value can be used by referencing the variable using $num
o The variable name num is arbitrary – it can be any name we choose

processing multiple files in a for loop

One common use of for loops is to process multiple files, where the set of files to process is obtained by
pathname wildcarding. For example, the code below counts the number of reads in a set of
compressed FASTQ files:
For loop to count sequences in multiple FASTQs
for fname in *.gz; do
echo "$fname has $((`zcat $fname | wc -l` / 4)) sequences"
done

quotes matter

We saw how double quotes allow the shell to evaluate certain metacharacters in the quoted text.

But more importantly when assigning multiple lines of text to a variable, quoting the evaluated
variable preserves any special characters in the variable value's text such as Tab or newline characters.

Consider this case where a captured string contains newlines, as illustrated below.
txt=$( echo -e "aa\nbb\ncc" )
echo "$txt" # inside double quotes, newlines preserved
echo $txt # without double quotes, newlines are converted to spaces

the if statement
The general form of an if/then/else statement in bash is:

if [ <test expression> ]
then <expression> [ expression... ]
else <expression> [ expression... ]
fi

Where

• The <test expression> is any expression that evaluates to true or false


o In the shell, the number 0 (or an empty value) is false
o Anything else is true
o There must be at least one space around the <test expression> separating it from the enclosing bracket [ ].
o Double brackets [[ ]] can also be used to enclose the <test expression>
• When the <test expression> is true the then expressions are evaluated.
• When the <test expression> is false the else expressions are evaluated.

A simple example:
for val in 5 0 "27" "$emptyvar" abc '0'; do
if [ "$val" ]
then echo "Value '$val' is true"
else echo "Value '$val' is false"
fi
done

reading file lines with while


The read function can be used to read input one line at a time, in a bash while loop.

While the full details of the read commad are complicated


while IFS= read line; do
echo "Line: '$line'"
done < ~/.bashrc
• The IFS= clears all of read's default input field separators, which is normally whitespace (one or
more spaces or Tabs).
o This is needed so that read will set the line variable to exactly the contents of the input line, and not strip
leading whitespace from it.
• The lines are redirected from ~/.bashrc to the standard input of the while loop by the < ~/.bashrc expression
after the done keyword.

If the input data is well structured, its fields can be read directly into variables. Notice we can pipe all the output
to more – or could redirect it to a file.
tail /etc/passwd | while IFS=':' read account x uid gid name shell
do
echo $account $name
done | more

Owner and Group


A file's owner is the Unix account that created the file (here abattenh, me). That account belongs to one or
more Unix groups, and the group associated with a file is listed in field 4.

The owner will always be a member of the Unix group associated with a file, and other accounts may also be
members of the same group. G-801021 is one of the Unix groups I belong to at TACC. To see the Unix
groups you belong to, just type the groups command.

Permissions
File permissions and information about the file type are encoded in that 1st 10-character field. Permissions
govern who can access a file, and what actions they are allowed.

• character 1 describes the file type (d for directory, - for regular file, l for symbolic link)
• the remaining 9 characters are 3 sets of 3-character designations
o characters 2-4: what the owning user account can do
o characters 5-7: what other members of the associated Unix group can do
o characters 8-19: what other non-group members (everyone) can do

Each of the 3-character sets describes if read ( r ) write ( w ) and execute ( x or s ) actions are allowed, or not
allowed ( - ).

• read ( r ) access means file contents can be read, and copied


• write ( w ) access means a file's contents can be changed, and directory contents can be modified (files added
or deleted)
• execute ( x or s )
o for files, execute ( x ) means it is a program that can be called/executed
▪ e.g. /usr/bin/ls, the file that performs the ls command
o for directories, execute ( x ) means directory operations may be performed/executed
▪ the directory can be listed and changed into

Examples:
ls -l ~/.bash_history

haiku.txt description

• dash ( - ) in position one signifies this is a regular file


• rw- for owner allows read and write access
• r-- for group permits only read access
• --- for everyone means no access allowed

ls -l /usr/bin/ls

/usr/bin/ls description

• /usr/bin/ls is the program that performs the ls command


o root (the master admin account) is the owner, in the root group
• dash ( - ) in position one signifies this is a regular file
• rwx for owner allows read, write and execute
• r-x for group permits read and execute
• r-x for everyone permits read and execute

ls -l -d ~/local (-d says to list directory information, not directory contents)

docs description

• d in position one signifies this is a directory


• rwx for owner allows read, write and "execute" (list for directories)
• r-x for group permits read and "execute" (list)
• --- for everyone means no access allowed

You might also like