Introduction To Bio-Linux: January 27, 2010
Introduction To Bio-Linux: January 27, 2010
Introduction To Bio-Linux: January 27, 2010
1
Table of Contents
INTRODUCTION TO BIO-LINUX.................................................................................................1
3
Part One: Introduction to the Bio-Linux System
You can log into your Bio-Linux machine locally or remotely, on an installed system or on a system running Live
from a USB memory stick or a DVD.
These course notes are written from the perspective of someone running the Live version of the system. The key
difference for people working on an installed system will be the name of the account you are logged into.
Please refer to our online document about various ways you can access Bio-Linux systems:
http://nebc.nox.ac.uk/tools/bio-linux/other-bl-docs/installoptionsleaflet
If you are booting the machine from a DVD or a USB memory stick, then choose
Option 1: Try Bio-Linux 5 without any change to your computer
After the system has started up, you will see the Bio-Linux desktop (Figure 1).
4
Key icons on the desktop
On the top of the screen you will see the default top taskbar (Figure 2).
Figure 2: The default top Bio-Linux taskbar, usually found at the top of the screen.
The three menus on the left of the Bio-Linux task On the right side of the task bar from left to right,
bar, from left to right are: you have:
The default icons on the left side of the taskbar from ● The Network Monitor icon - allows you to enter
left to right are: network settings;
● Evolution Mail (to read and write emails) ● The System Clock - you can also click on this to
open a calendar;
● Help (goes to the Ubuntu help documents)
● The Power button - once pressed, this will pop up a
● Terminal (starts up a terminal window) menu with options to:
○ Log off
○ Switch user,
○ Lock the screen
○ Power down the computer;
5
Figure 3: Bioinformatics applications
6
The System menu gives you access to functions that
allows you to customise and administer your system
(Figure 5).
Figure 6: The bottom Bio-Linux taskbar. Left to right: Show Desktop Button, Three open windows, Four Virtual
Desktops, Deleted Items Folder
The four squares on the right hand side of the taskbar (three grey and one blue in Figure 6) represent “virtual
desktops”. Unlike Windows, Linux gives you access to multiple desktops. This allows you to have windows open
for different things in different virtual desktops. For example, if you were working on writing an article, you
could have programs relevant to that work open and visible via one of these desktops. Meanwhile, you could have
programs related to sequence analysis open on another desktop, and so on. This is a great tool for keeping things
organised during your working day.
The blue box represents the virtual desktop you are currently working in. Clicking on a grey box takes you to that
virtual desktop (turning that box blue).
In Figure 6, we are looking at the first of four virtual desktops, and this desktop has three windows open: two file
browsers and a Mozilla Firefox web browser.
Pressing the Show Desktop Button (the small square icon on on the left of the taskbar) allows you to hide all the
open windows, giving you a clear view of the Desktop. Press it again to get all the windows back up.
The Deleted Items Folder icon (also commonly referred to as a Recycling bin or Trash can) is at the far right hand
side of this taskbar. This is where files deleted using graphical tools usually end up. This gives you a chance to
salvage them if you deleted them by mistake. Deleting files on the system is covered in more detail in the
Removing Files and Directories section of this tutorial.
7
Exercise 1-1
Take some time to explore the desktop. Look at the options under each of the menus covered in the previous
section. Try clicking the icons on the desktop. Also try using the right and middle mouse buttons when the mouse
pointer is over a blank area of the desktop and explore the menus presented to you.
Try going to a different virtual desktop and starting up some windows/applications there.
The sample files referred to in this tutorial can be downloaded in a compressed package from
http://nebc.nox.ac.uk/courses/Bio-Linux/bioinf_files.tar.gz
There are various ways you can do this. We offer two options here. Please choose the one you are most
comfortable with.
wget http://nebc.nox.ac.uk/courses/Bio-Linux/bioinf_files.tar.gz
● Open the Firefox web browser by clicking the Firefox icon on the top task bar
● Type the address given above into the web browser URL address field.
● Save the download file to your home directory: /home/live.
The file you just downloaded is referred to as a compressed tarball. Tar is a utility similar to zip; it allows you to
make a package of files. An additional utility, known as gzip has been used to compress the tarball.
You can use graphical or command line tools to extract the files from this compressed tarball. Choose whichever
of the following two options you are most comfortable with.
8
(Exercise 1-1, continued)
This command uncompresses and unpacks the tar file. The x means “unpack/extract”, the v means “in a verbose
fashion, reporting to the screen the list of files being unpacked”, the f means that the filename you are giving is
the tar file that will be unpacked (here, the tar file is in the same directory you are working in), and the z means
that the file is compressed and should be uncompressed.
● Choose the option Home Folder from under the Places menu in the top taskbar.
● Click the right mouse button over the bioinf_files.tar.gz file and move the cursor down to Extract
Here(Figure 7).
● Choose to extract the file in its current location by clicking on Extract Here.
Figure 7: A graphical view of your home folder after clicking on the right mouse button over the
compressed bioinf_files.tar.gz file.
9
(Exercise 1-1, continued)
d) Removing the compressed tarball
The unpacked files that you will be working with in this tutorial are now in a directory called bioinf_files.
You can remove the compressed tar file now if you wish. Again, this can be done via the command line or using
graphical tools. Here we use the command line. More details about how to remove files from the system are
covered in the Removing Files and Directories part of this tutorial.
● Open a terminal window if you do not already have one open: click on the Terminal icon in the top toolbar.
rm bioinf_files.tar.gz
● Enter “y” to agree if you are asked if you wish to delete the file.
In Linux /Unix systems, documents are usually referred to as files, and file folders are referred to as directories.
Your Bio-Linux machine can be thought of as a huge file folder (directory), inside of which are many other file
folders (directories). Inside these there are more nested file folders (directories), and so on. As in the real world
where file folders can contain documents and other file folders, in Linux, directories can contain files and other
directories.
Your account is one directory within the set of directories that make up your Bio-Linux machine. In your account,
you can create other directories, store data, run programs, etc.
By default on a Bio-Linux machine, you have the right to create, delete and edit files in your own account, but not
in other people’s accounts. You can be given permission to work on files in such areas, but that topic is beyond
the scope of this course. Your system administrator or local IT support should be able to help you with this.
You can use command line or graphical tools to explore areas on the machine, including your home directory.
A graphical view of your home directory is available by clicking on the Home Folder option in the Places menu
of the top taskbar (Figure 8). This opens up a window, similar to Figure 7 above, that shows the files and
directories in your home directory.
You can access many areas and functions available outside your home directory through this interface also.
10
Exercise 1-2
● If you have not done so already, click on Home Folder in the Places menu of the top taskbar
● Click on the arrow symbol of the File System folder on the left panel. This allows you to see what other
directories are nested inside.
● Find the folder called home and click on the arrow symbol beside it.
● You should see your home directory listed. If you are working on an installed system, other user folders may
also listed. A lock symbol on a folder informs you that you do not have permission to view the contents of those
folders.
The name of the base directory of the whole system, the one within which everything else on the system is
contained, is the root directory. It is referred to by by a forward slash “/”.
When you opened your home directory in Exercise 1-2, the following information should have been shown next
to the Text button, which looks like a pencil and paper and is seen in the left of Figure 9. If you move the mouse
cursor over the Text button, you should see a message box containing information about what the button does.
This information tells you that you are in the directory called live, which is within the directory called home. The
directory home is under the root directory.
In other words, this information tells you where you are in system.
The location of a file or directory within the system is its PATH. If you are asked for the full PATH to a file, you
need to provide a complete listing of all the directories traversed on the system to get to that file. That is, you need
to give the full path from the root directory to that file. You can get this information, in the format most programs
expect, by clicking on the Text button. See Figure 10.
Notice that when written out in text, all directory names are separated with a forward slash “/”.
11
Figure 10: Location in nautilus file browser Text View
As another example: the full PATH to the file capsall.fasta, in the bioinf_files directory within your home
directory is:
/home/live/bioinf_files/capsall.fasta
Sometimes you can provide just the route from where you are on the system to where your file is; this is referred
to as the relative path. For example, if you are working in your home directory, the relative path would be
bioinf_files/capsall.fasta
If you threw all your paper documents into a filing cabinet without any organisation, it would rapidly become
difficult to find anything. This will also happen to your Bio-Linux account if you choose to keep all your
documents in your home directory instead of creating subdirectories and storing associated data together within
them.
A list of common Linux commands is provided in Appendix D of this document for reference.
Many programs and facilities are available through graphical options on Linux, but all programs and facilities can
be accessed by the command line. Some tasks are easier, or more appropriately done using graphical interfaces.
Equally though, other things are easier or more appropriately done using the command line. Obvious examples
include when you need to work with large numbers of files or want to automate processes.
gnome-terminal &
The & allows you to keep working in the original window by starting up the process, (in this case, a terminal),
and putting that process in the “background”. You can then continue to work in that same terminal. If you try
running the above command without the &, you will see that you cannot type anything into the “original”
terminal.
12
Anatomy of a Command
The first item you supply on the command line is interpreted by the system as a command; that is – something the
system should do. Items that appear after that on on the same line are separated by spaces. The additional input on
the command line indicates to the system how the command should work. For example, what file you want the
command to work on, or the format for the information that should be returned to you.
Most commands have options available that will alter the way they function. You make use of these options by
providing the command with parameters, some of which will take arguments. Examples in the following sections
should make it clear how this works. With some commands you don't need to issue any parameters or arguments.
Occasionally this is because there are none available, but usually this is because the command will use default
settings if nothing is specified.
If a command runs successfully, it will usually not report anything back to you, unless reporting to you was the
purpose of the command. If the command does not execute properly, you will often see an error message returned.
Whether or not the error is meaningful to you depends on your experience with Linux/Unix and how user-friendly
the errors generated were designed to be.
Note: Items supplied on the command line separated by spaces are interpreted as individual pieces of information
for the system. For this reason, a filename with a space in it will be interpreted as two filenames by default. This
is addressed in more detail later in the course.
13
Listing files in a directory
By default, the command will list the filenames of the files in your current working directory. At the moment, this
is probably your home directory.
If you add a space followed by a –l (that is, a hyphen and a small letter L), after the ls command, it alters the
behavior of the command: it will now list the files in your current directory, but with details about them including
who owns them, what the size is, and what kind of file it is. Information about this is shown in Figure 12.
Figure 12: The detailed output of the command ls when run with the -l flag
Exercise 1-3
a)
● Type the command ls. Compare what you see listed with what you see in the graphical representation of
your home directory.
● Type the command ls –l and note the kind of information being provided and how it compares to the
graphical representation of your files.
● In the Nautilus File Browser, click on the View as List button, on the far right, and compare this information
to that provided using the ls –l command.
● In the console, type ls –l bioinf_files. Click on the bioinf_files folder in Nautilus and compare what you are
seeing.
You can also use regular expressions to limit the files you wish to list.
14
* an asterisk means any character
? a question mark means a single character
[] square brackets can be used to designate a group of characters
More details about this are given in the Linux/Unix shorthand and shortcuts section below.
b)
● List all the files in the directory bioinf_files. that start with the letters tes
ls bioinf_files/tes*
● List all the files in your directory that start with tes, and end in 1.embl, 2.embl or 3.embl
ls bioinf_files/tes*[1-3].embl
Most Linux commands have a manual page that provides information about the command and options that can
alter its behaviour. Many tasks can be made easier by using command options. A good rule of thumb is to ask
yourself whether what you want to do is something many others may have wanted to do. If the answer is yes, then
there may well be commands and options available to do that task.
Linux manual pages are referred to as man pages. To open the man page for a particular command, you just need
to type man followed by the name of the command you are interested in. To browse through a man page, use the
cursor keys (↓ and ↑). To close the man page simply hit the q key on your keyboard.
If you do not know the name of a command to use for a particular job, you can search using man –k followed by
the type of thing you are trying to do. An example of this is in exercise 1-3, part c).
c)
● Look up the manual information for the ls command by typing the following in a terminal:
man ls
● Read through the man page. You can scroll forward using the up and down arrow keys on your keyboard.
You can go forward a page by using the space bar, and move backwards a page by using the b key.
● What does the -m option do? What about the -a option? What would running ls -lrt do?
● Press the q key when you want to quit reading the man page.
● Look up some programs with man pages with the keywords “list directory”
15
man –k “list directory”
You could now look at the man pages for any Linux commands used in this tutorial to learn about what they do
and any options that could be useful to your work.
If you stick with letters, numbers, hyphens, underscores and full stops, you will be fine.
Filenames with spaces in them are a common problem when transferring files to Linux/Unix from computers
running Windows, or Mac operating systems.
If you end up with filenames with spaces in them, you will need to enclose the entire filename in quotation marks
so that Linux/Unix understands that the space is part of the name.
Alternatively, you can “escape” the space using a backslash. For example, if I have a file called
my document
But you could write either of the following to make it understand you mean a single file:
“my document”
my\ document
Our general advice is to change the name of such files to remove the space. A common practice is to replace the
space with an underscore. For example:
Please note that there are some common naming conventions in place for biological data that you should try to
follow. More is said on this in the second part of this tutorial.
Understanding Unix commands can seem daunting at first. This is often due to particular characters (full stops,
question marks, etc.) having special meaning in commands. Once you learn the basics, these shorthand characters
are extremely useful.
16
The following incomplete list covers the symbols you will see most often today and describes their meanings as
you will most likely encounter them in this course.
ls cat??hat list all files starting with the letters cat followed by any 2 letters,
and then hat
.. the directory one level above the one you are “currently in”.
Warning: some symbols have different meanings depending where they are used.
17
More Basic Linux Commands
A list of common Linux commands is provided in Appendix D of this document for reference.
Changing directories
If you think of your directory structure, (i.e. this set of nested file folders you are in), as a tree structure, then the
simplest directory change you can do is move into a directory directly above or below the one you are in.
To change to a directory one below you are in, just use the cd command followed by the subdirectory name:
cd subdir_name
To change directory to the one above your are in, use the shorthand for “the directory above” ..
cd ..
If you need to change directory to one far away on the system, you could explicitly state the full path:
cd /usr/local/bin
If you wish to return to your home directory at any time, just type cd by itself.
cd
cd –
This returns you to the last directory you were working in before this one.
If you get lost and want to confirm where you are in the directory structure , use the pwd command (push
working directory). This will return the full path of the directory you are currently in.
Note also that by default in Bio-Linux, you see the name of the current directory you are working in as part of
your prompt.
For example, when you first opened the terminal you probably saw the prompt:
live@bio-linux[live]
This means you are logged in as the user live on the machine named bio-linux, and you are in a directory called
live.
cd bioinf_files
you would see the prompt:
live@bio-linux[bioinf_files]
18
Exercise 1-4
● Change directory from your home directory to the directory bioinf_files by typing
cd bioinf_files
pwd
cd
cd /usr/local/bin
cd
Tab completion
Tab completion is an incredibly useful facility for working on the command line.
One thing tab completion does is complete the filename or program name you want, saving huge amounts of
typing time.
cd bio
If there is only one directory with a name starting with the letters “bio”, the rest of the name will be completed for
you. Here this would give you:
cd bioinf_files
User accounts on Bio-Linux are setup such that if there is more than one file with that combination of letters, all
the files will be shown to you. You can choose the one you want by typing more of the filename, or by continuing
to hit the tab key.
19
Exercise 1-5
● Return to your home directory if you are not already there by typing cd
● Type cd bio and use tab completion for the rest of the command. Then press the return key.
● Type ls testseq and use tab completion. This will show you a list of files that start with testseq.
You now have the option of completing the filename yourself, or “tabbing” through the filenames available.
● Now press the tab key again. You can gradually add extra letters and use the tab key to limit the options
available.
As you get faster with this, it will save you a lot of typing effort.
Exercise 1-6
● Type a on the command line and then press the tab key.
● Add rte to the a so that you now have arte on the command line. Press the tab key again.
● You will see that there is only one command that starts with these letters: artemis
For programs that might contain case sensitive names, tab completion can be especially useful.
● Type bl on the command line and press the tab key. You will see a number of program names listed.
● Keep pressing the tab key to see how the filenames will cycle through on the command line.
The default tab completion capabilities on Bio-Linux allow you to list the program names available on the
system by typing the first letter or letters of the name at the prompt. This works because the system knows
that the first item typed on the command line is a command (recall the Anatomy of a Command section
above).
20
Command history
Previous commands you have used are stored in your history. You can save a lot of typing by using your
command history effectively.
If you use the up arrow key when you are at the prompt in your terminal, you can see previous commands you
have run. This is particularly useful if you have mistyped something and want to edit the command without
writing the whole command out again.
You can also view past commands using the command history.
By default, history will return a list of the last 15 commands run. You can add a number as a parameter to the
command to ask for longer or shorter lists. For example, to return the last 30 commands run, you would type:
history -30
To re-run a command listed by the history command, you can just type the command number, preceded by an
exclamation mark. E.g. to run command number 12 returned to you, you can type:
!12
Exercise 1-7
● Run one of your previous commands using ! followed by the number of the command.
Making a directory
To make a new directory, use the command mkdir (make directory). For example:
mkdir newdir
Exercise 1-8
The graphical view of your account in Nautilus will be automatically updated to show this new directory.
● Using information in the Linux/Unix shorthand and shortcuts section, try to move back into the
bioinf_files directory using a single command. (Answer 2 in Appendix A)
21
Office software
There are a number of word processors and spreadsheet programs available on your system. In this course we will
look primarily at the OpenOffice suite of programs. This is an open source alternative to Microsoft Office and can
be run on both Linux and Windows.
The programs within OpenOffice can be run graphically from the top taskbar: Applications → Office (Figure
13).
Notes on OpenOffice
OpenOffice programs are very much like Microsoft Office programs in their style and the options they
present.
OpenOffice programs can open Microsoft formatted files such as Word files (.doc), Excel
files (.xls) and PowerPoint files (.ppt) and save into those formats also.
OpenOffice will save in its own native formats by default if you create a new file, but you
can also save in Microsoft Office formats.
When saving a Microsoft format file, or choosing to save an OpenOffice file in a Microsoft format, you will
be presented with a message warning you that you may lose formatting or content. Your files will generally
be fine if they are documents with basic formatting like headings and simple tables, or spreadsheets without
macros.
If you have opened a Microsoft formatted file, the default is to save in that same Microsoft
format.
By default, the file will be saved in the same directory you opened the file from.
22
Exercise 1-9
● Make a few changes and save the file using the Save or Save As… options under the File menu.
● Close OpenOffice Calc by choosing Exit from under the File menu.
Plain text files are important for work on Linux systems, both as input to bioinformatics programs and as input or
configuration files for system programs. We highly recommend that you learn to use a text editor to prepare and
edit plain text files.
Documents written using a word processor such as Microsoft Word or OpenOffice Write are not plain text
documents. If your filename has an extension such as .doc or .odt, it is unlikely to be a plain text document.
(Try opening a Word document in notepad or another text editor on Windows if you want proof of this.)
Word processors are very useful for preparing documents, but we recommend you do not use them for
working with bioinformatics-related files.
We recommend that you prepare text files for bioinformatics analyses using Linux-based text editors and not
Windows- or Mac-based text editors. This is because Windows- or Mac-based text editors may insert
hidden characters that are not handled properly by Linux-based programs.
There are a number of different text editors available on Bio-Linux. These range in ease of use, and each has its
pros and cons. In this practical we will briefly look at two editors, nano and gedit.
23
Nano
Pros: Cons:
• very easy – For example, command options are • by default it puts return characters into lines too
visible at the bottom of the window long for the screen (i.e. using nano for system
• can be used when logged in without graphical administration can be dangerous!) This
support behaviour can be changed by setting different
• fast to start up and use defaults for the program or running it with the
–w option.
• it is not completely intuitive for people who are
used to graphical word processors
Gedit
Pros:
• very easy Cons:
• quite intuitive • it is a graphical program and cannot be run from
• colouring schemes are available for a text-only environment
programming syntax • it is slightly slower to start up than non-
• Very similar to a word processor, but is in fact a graphical editors
powerful text editor.
• Gedit has many plugins that are very easy to
install and very useful for editing text files and
for programming)
As most users will work on Bio-Linux using a graphical environment, we will only use gedit in the exercise for
this section.
Exercise 1-10
To start up gedit, you can use the command line, or an option under the Applications menu. Choose one of the
two methods to open gedit:
Command line
● Type gedit &
Graphical menu
● Follow Applications → Accessories → Text Editor on the top taskbar
● Save your file using the save option under the File menu or simply click the Save button on the Toolbar.
Save it as myfirstfile.txt in your testdir directory.
24
Exercise 1-10 continued
To save a file under the testdir directory, you may have to click on the drop down arrow to Browse for other
folders. This will expand this section into a File Browser like the one you've seen in past exercises. Simply
browse through to the location testdir is in and click the Save button.
● Add a new line to your file and save the file again using the Save As… option under the File menu. Save
this file as mysecondfile.txt in the testdir directory.
● Add more functionality to gedit by choosing the menu options; Edit → Preferences. A pop-up box will
appear with 4 tabs:
Seeing the line numbers in a file helps to keep track of your position in that file. We will enable line numbers
here.
● On the View tab enable Display line numbers. Now you can see the line numbers on the left.
● Next, click on the Plugins tab and enable the Change Case and the Document Statistics plugins. Browse
around the other plugins and see what functionality they provide.
● Try out the other newly added plugin, by selecting a piece of text from the document you are editing with
the mouse and click on the Edit menu. Hover the mouse over the Change Case menu and choose one of the
options you are presented with.
● Change part of one of the lines in this file and save it again using the Save As… option under the File menu.
This time save it as mythirdfile.txt in the testdir directory.
● Quit gedit by choosing the option Quit under the File menu.
There are many commands available for reading text files on Linux/Unix. These are useful when you want to look
at the contents of a file, but do not edit them. Among the most common of these commands are cat, more, and
less.
cat can be used for concatenating files and reading files into other programs; it is a very useful facility. However,
cat streams the entire contents of a file to your terminal and is thus not that useful for reading long files as the text
streams past too quickly to read.
more and less are commands that show the contents of a file one page at a time. less has more functionality than
more. With both more and less, you can use the space bar to scroll down the page, and typing the letter q causes
the program to quit – returning you to your command line prompt.
Once you are reading a document with more or less, typing a forward slash / will start a prompt at the bottom of
the page, and you can then type in text that is searched for below the point in the document you were at. Typing in
a ? also searches for a text string you enter, but it searches in the document above the point you were at. Hitting
the n key during a search looks for the next instance of that text in the file.
With less (but not more), you can use the arrow keys to scroll up and down the page, and the b key to move back
up the document if you wish to.
25
Exercise 1-11
● Read the file hsy14768.embl using the commands cat, more and less.
Don’t forget that tab completion can save you typing effort.
cat hsy14768.embl
For reading files yourself, we recommend the command less. The command cat is more usually used in
conjunction with other commands when you wish to process text from within a file in some way.
There are many command line options available for each of the above commands, as well as
functionality we do not cover here. To read more about them, consult the manual pages:
man cat
man more
man le ss
26
Copying files
The basic command used to copy files using the command line is cp. At a minimum, you must also specify the
name of the file(s) to be copied, where you wish to copy the file(s) to.
•if you provide a directory name as the last argument to the command, the files will be copied into that directory
•if you provide a name that cannot be found from your current working directory as the last argument to the
command, it will be assumed that this is the filename you wish to use for the file copy.
•if you provide multiple filenames to cp, the final filename provided needs to be the name of a directory that
already exists – all the files will be copied into this directory.
Examples:
cp file1 file2 file3 location copy file1, file2 and file3 to a directory called location
cp destdir/* location copy all files in the directory called destdir to the directory
called location
To move whole directories, with all the subfiles and subdirectories, use the –R option, (meaning recursive).
cp –R somedir/mydir location copy mydir and its contents into the directory called location
The Linux/Unix shorthand for “this directory right here” (a dot . ) comes in very handy when copying:
Make sure you leave a space between the directory name and the shorthand dot.
Also useful is the shorthand for someone’s home account. e.g. instead of having to know and type the location of
their account, you can use ~username In the case of your own account, you use just the ~ symbol, followed by a
/ if you want to specify any subdirectories in your account.
cp ~user2/somefile . copy the file somefile from user2’s home directory to my current working
directory. Note that you need the appropriate permissions to do this!
cp ~/somedir/mytext . copy the file or directory called mytext from within the somedir directory
under my home directory, to my current working directory.
27
Exercise 1-12
● Copy all the files that have the letters fil in them into the subdir directory. (Answer 5 in Appendix A)
● Copy all the files that start with the letters tes and end in .embl into the directory testdir. (Answer 6 in
Appendix A).
There are command line and graphical tools for deleting files. You should choose which to use on the basis of
what is convenient, and also on the basis of whether you wish to remove the file completely from the system
(command line), or whether you like to keep deleted files somewhere for a while, just in case you discover you
deleted the wrong thing (graphical).
Option 1: Using the command line (effect: deletes files from the system)
To remove a file or files, use the rm command followed by the name of the file(s) you wish to delete.
rm file1
rm file2 file3 file4
rm cat* remove all files starting with the letters cat
rmdir thisdir
If that directory contains any files, you will not able to delete the directory using rmdir until you have deleted all
the files within it.
To delete a directory and all the files in it at the same time, use the rm command with the option -r or -R option
(recursive)
rm –r fulldir
If you use the above command, you will be prompted to confirm that you wish to delete each file. While
sometimes useful, this can be tedious. If you are certain that you want to delete all the files in that directory, as
well as the directory itself, then you can combine the recursive flag with the force (-f) flag
rm -rf fulldir
28
Option 2: Using a graphical tool (effect: moves files into the Deleted Items Folder)
If you view your files graphically, just find the file you wish to remove, right click on it and choose the Move to
the Deleted Items Folder option. Note that this file will not be removed from your system, only moved into a
folder allocated for files you no longer want.
Option 3: Using a graphical tool to completely remove files from the system
If you have a graphical view of your files, you can remove the file from the system permanently by finding the
file you wish to remove, highlighting it by clicking on it, pressing the Shift key on your keyboard and, while
keeping this key depressed, press the Delete key on your keyboard. A message box will pop up asking you to
confirm that you wish to permanently delete your file. Click on the Delete button in this window if you wish to.
Exercise 1-13
● Delete myfourthfile.txt using one of the graphical options (options 2 or 3 above). Did you choose to
permanently remove the file, or move it into the Deleted Items Folder?
● Delete myfirstfile.txt from testdir without moving to the testdir directory. (Answer 8 in Appendix A)
● Delete the entire subdir directory from within testdir without being prompted about the deletion of each file
individually. (Answer 9 in Appendix A)
On Bio-Linux the commands cp, mv and rm have been aliased to cp –i , mv –i and rm –i respectively.
This means the system will ask you if you really mean to overwrite files should the situation arise with cp or
mv, or delete the file you have just asked to delete when using rm. You must respond with a y or Y if you do
wish to proceed. Hitting any other key will cause the action you requested to be ignored.
You cannot assume that other Linux/Unix systems you work on have safety features like this configured.
29
Piping and outputting to files
An incredibly powerful facility on Linux/Unix systems is the ability to take the output of one command and use it
directly as the input to another command. This is referred to as “piping” the output of one command into another
command.
The symbol used for this is called a pipe and looks like: |
Many keyboards have the pipe symbol on the same key as the backslash symbol, at the bottom, right hand side of
the keyboard. Pressing the Shift key and the backslash key together will give you the pipe symbol.
On some keyboards, the pipe symbol is at the top left hand side, on the same key as the backtick. To type a pipe
symbol on such keyboards, hold down the key Alt Gr and hit the back tick ( ` ) key (left of the number 1 key).
An example of when you might use a pipe would be if you wanted to list all the files in a directory, but there are
too many to fit on a single page. This is probably what you saw when you listed the contents of /usr/local/bin in
Exercise 1-4.
You can pipe the output of the ls command (a list of files) into the less command, which will allow you to view
the list page by page. To list the files in /usr/local/bin and view them page by page, the command would be
written:
ls /usr/local/bin | less
A useful command is the wc command, which stands for wordcount. By default, wc returns the number of
newlines, words and bytes in a file (or in information given to it via a pipe). Using command line options, you can
get wc to return just the number of lines, just the number of words or just the number of bytes.
There are other options available for obtaining information from a file that can be found by reading the manual
page for wc.
For example, you could find out how many files you had in a directory by typing:
ls | wc -l
30
Diff, Grep and Sort
In this section, we look briefly at three very useful commands: diff, grep and sort. As with all the commands
covered today, we recommend that you read the manual page for more information about how these work and
what options are available.
Diff
diff compares files line by line and reports the differences between the files. In fact, diff can be used for more
involved tasks as well, like comparing the contents of directories. This can be very useful when you are looking
for changes that you or someone else has made.
Exercise 1-14
In the above command the – refers to the information being given to diff from the pipe. That is, the information
resulting from the command cat myfourthfile.txt is put directly into the diff command. Obviously, in this
instance it would be easier just to give the name of the file, myfourthfile.txt, but there are many instances where
being able to use – to mean “what I am sending via a pipe” can be extremely useful.
Grep
grep stands for global regular expression; you use this command to search for text patterns in a file (or list).
You can also use flexible search terms, known as regular expressions, in your grep searches.
You have already used regular expressions in this practical. For example, when you listed all files with the pattern
tes*embl* you were using a regular expression giving explicit characters (e.g. tes) and special symbols (*
meaning any character or characters).
grep requires a regular expression as input, and returns all the lines containing that pattern to you as output.
grep is especially useful in combination with pipes as you can search through the results of other commands.
For example, perhaps you only want to see only the information in an embl file relating to the origin of the
sequence, that is, the DE line. You do not need to open the file, you can just cat it and grep for lines beginning in
DE.
31
Exercise 1-15
Read the manual page for ls if it is not clear what this command returns.
● Use the above command with a pipe and a grep command to search for files created or modified
today.
● List the files in the bioinf_files directory and use the grep command to look for those containing the
characters d4.
The second last command in the above exercise searches all the text in the hsy14768.embl file and returns the
lines in which it finds the letter D followed by the letter E.
The last command in the exercise returns only the lines in the file that have a letter D followed by a letter E, and
where DE is found at the beginning of a line. This is because the ^ symbol means “at the beginning of a line”.
The $ symbol can be used similarly to mean “at the end of a line”.
What this does in the example above is return to you just the organism information in the embl file. This is
because none of the other lines returned in the previous command started with DE, they just contained DE
somewhere in them. This is an example where knowing how information is stored in an embl file, along with a
few basic Linux commands, allows you to retrieve information quickly.
Another useful example where combining commands is handy is counting how many sequences are in a fasta
formatted file. We can do this with pipes between the commands cat, grep and the handy wc utility, which here
we use to count lines found by grep.
Each sequence in a fasta file starts with a header line that begins with a > . The above command streams the
contents of a file called multiseqs.fasta through a search with grep looking for lines that start with the symbol > .
The backslash before the > is necessary, as otherwise it is interpreted as a special character telling the system to
do something, rather than as a symbol you wish to look for. The ^ symbol means “look at the beginning of the
line”.
The output of this grep search is sent to the wc command, with the -l indicating that you want to know the
number of lines.
So an english translation of the command above is: Read through the multiseqs.fasta file and look for all the
header lines in the file, then count the number of header lines found and return the number to screen.
32
Some other useful information
There are a number of ways that you can copy and paste text on Bio-Linux. The way we cover here involves
highlighting text to copy it, and using mouse buttons to paste the text.
The exact way to select, copy and paste text from within terminal windows depends on how your mouse has been
set up. Common configurations include:
● Highlight text with your left mouse button depressed to copy the text, and paste using the middle mouse
button (or the two outer mouse buttons pressed simultaneously)
or
● Highlight text with the left mouse button depressed and copy it to the clipboard by then using the right
mouse button and choosing Copy from the pop-up menu that appears. Paste by clicking the right mouse button
again and choosing Paste.
Sometimes a command or program goes on too long, or is obviously doing something you did not plan. If there is
no obvious way such as a menu option or button) to stop the program running, try using Control-c (more
commonly written as Ctrl-c). i.e. hold down the Control key and hit the c key.
Note that this is the same key combination used on Windows machines for copying text. Remember that
highlighting text in Linux automatically copies it into the buffer – you don't need to press Ctrl-c.
To logout, you can press the Power Button on the far right of the top taskbar (Figure 2) and choose the Log Out
option.
To shut down the machine, you can choose the Shut Down option on the same menu. If you are working on the
console of a machine with users apart from you, then please check with your system administrator before
powering down the machine. Other people might want to log in remotely. In addition, Bio-Linux machines are
configured by default to update system and bioinformatics software overnight. If your machine is turned off, it
cannot update.
Your terminal windows can fill up with lots of text, and it can become difficult to see the information you want
because of all the clutter. You can clear the terminal window of all previous text by typing
clear
33
Changing permissions on files and directories
Every file on the system has a set of permissions on it that dictate who on the system can read, change or delete,
or execute the file. By default, all the files in your account are readable, changeable or executable by you.
However, you can grant other users permissions to access parts of your account if you wish.
Below is some basic information about file permissions. To set up access to your files for other people on the
system, please get advice from your system administrator.
The command to change permissions is chmod. You have to specify who you are modifying the permissions of,
what the new permissions are, and what file or directory to act on.
So, for example, the command below gives read permission to people in the group already set for that file.
Obviously, to make your files accessible to the appropriate people, they would have to be in a particular group,
and that would have to be the group set for the files and directories referred to below. (Please refer to the manual
pages for the commands chown and chmod for more on this topic.)
chmod g+x ~ give permission to people in the group to execute, in this case, so
that they can move through, your home directory.
chmod g+rx ~/bioinf_files give permission to people in the group to list files in the
bioinf_files directory under your home directory
chmod g+r ~/bioinf_files/myfile give permission to people in the group to read the file myfile
34
Accessing a running program or working with others interactively
If you just run a job and then close down the terminal you ran it from, often the job will be terminated. It would be
nice to be able to leave a long job running and be able to log out and then log back in again to see how it is
progressing. This is especially true if you work remotely and experience network disruptions, or if you run
programs that can take quite a long time, but ask you for input periodically.
Luckily, there is a tool that makes it possible to leave programs running with no danger of them terminating if you
log off or your terminal is closed. In addition, when you log back into your system, either locally or remotely, you
can “re-attach” to your earlier session so it feels like you are picking up where you left off, in the same window
you were running your program from.
The utility that allows you to do this is called screen. It must be run before you start running other programs in
your window. Screen can also allow two people on different machines to work in the same session – i.e. Real
time collaborative editing is possible with screen.
Unfortunately, how to work with screen is beyond the scope of this course. However, the link below provides
links to a number of tutorials about screen and multi-user sessions:
http://www.xmarks.com/topic/gnu_screen
http://www.oreillynet.com/linux/cmd/cmd.csp?path=s/screen
There are many useful commands available on Linux and we cannot begin to cover them in this course. We
recommend that you consider buying a book to help you learn how to use Linux efficiently.
35
Part Two: Introduction to Bioinformatics on Bio-Linux
This section of the tutorial introduces you to running bioinformatics software on Bio-Linux, including how to find
out what is available for particular types of bioinformatics tasks, some options you have for running programs on
the system, and where to find documentation about the software on the system. This course does not cover the
detailed use or understanding of any particular piece of software.
The main points we hope you take away after completing this section of the tutorial are:
a) If you have repetitive tasks to carry out, chances are there are ways of fully or partially
automating them.
b) Web interfaces are easy, and have certain benefits, but there are other ways to access software,
and sometimes these will suit your needs better.
c) If you are funded by the NERC, we can be contacted directly for help. Please email us if you
have questions or problems relating to Bio-Linux or bioinformatics analysis. Our contact
address is [email protected]
There are a number of sources of information about the bioinformatics software on Bio-Linux, including
● Bio-Linux bioinformatics documentation
● local copies of software documentation
● options under the help menus in some graphical programs
● web pages
● journal articles.
Categorised information about all bioinformatics software on the Bio-Linux system can be accessed via the
Bioinformatics Docs icon on the left hand side of your desktop. Software can be listed by name or by functional
category.
The information for each program includes an overview of what it does, links to local documentation when
available, as well as links to information on the internet.
We highly recommend that you read the documentation for any programs you intend to run.
This is especially important for programs that use heuristic algorithms (methods involving some level of
approximation, such as blast), and those that output numerical results.
36
Exercise 2-1
● Click on the names of any of the programs that might interest you and view the information in the
resulting web page.
● Return to the search form and click on the link to List all categories. This shows a view of all the
documented software according to the functional category (or categories) they are listed in.
Please refer to the bioinformatics documentation throughout this tutorial to find out more about the
programs introduced.
If you know of a good information resource for a program on Bio-Linux that is not mentioned in our
bioinformatics documentation system, or you have any problems with the system, please let us know by emailing
us at [email protected].
Documentation is available from within many programs. For example, many graphical programs have a Help
menu or button; many command line programs provide help if you type the name of the program followed by –h,
–help or --help. Some programs even have their own manual pages that can be accessed by typing man followed
by the program name.
The sequences referred to in this tutorial can be downloaded in a compressed package from
http://nebc.nox.ac.uk/courses/Bio-Linux/bioinf_files.tar.gz
If you have just done the associated Introduction to Linux tutorial, you will already have these files – please move
on to the next section of the tutorial.
If you have joined the tutorial at this point, please refer to Exercise 1-1, parts b, c and d to download and unpack
the necessary sample data files.
37
Interface choices
Software can be run on the command line, via graphical programs on your computer, via web interfaces, via web
services and/or via scripts. Bioinformatics programs can often be run using more than one of these options. Each
type of interface has pros and cons. We have summarised some of these for reference below.
Type out the command Repetitive tasks are easy to run or automate
and press enter
Easy to log in remotely and carry out tasks
Easy to run; don't have to remember the Easy to forget the diversity of options for a
Prompted command command line syntax program because of the temptation to just reply
line to prompts provided
Easy to log in remotely and carry out tasks
Slower to get running than “pure” command
Type out the command
line
and respond to prompts
on screen
Often more intuitive and visually pleasing than Can be slower to use than the command line,
the command line especially for repetitive tasks
Extensive help is often available via a menu For some programs, the command line version
Graphical interface option or button provides more functionality.
Some programs (not all!) can be run by clicking You may need your system admin to set up
Start the program and
an icon in the Applications | Bioinformatics programs so that you can run graphical
interact via menus menu on your system. programs when logging in remotely
Can bring together the ease of a locally run You are dependent on network connectivity
program with the data and computing resources
Web services You are dependent on the consistency of the
of a remote site
remote server where the functions you need are
Runs tasks over the Can be used via graphical programs or scripts running
internet from a
program, usually You are dependent on the functionality the
locally installed or run remote site offers; this may not be as extensive
via java webstart. as the functionality you get locally for some
programs.
38
Very flexible You have to write the script or find a script that
Scripts does the job. This means learning a
Great for automating tasks programming language (or asking someone
Using a small program Great for carrying out customised tasks who knows one to help you)
that runs a program or
programs for you Straightforward to learn enough to alter existing
scripts to do exactly the task you want.
For repetitive tasks, we highly recommend the use of the command line, workflow software
and/or scripting.
39
General points about working with bioinformatics programs
Sequence formats
A simple thing that often trips people up is sequence formats. There are many different sequence formats; the
reasons for this are both historical and functional.
Historically, when people first started writing analysis programs for molecular data, they designed a format that
they felt suited their needs. As time went on, numerous formats came into existence. We live with the legacy of
this. We must know what format our data is in, and whether the program we want to run can use data in that
format.
Functionally, a program may require information that can be included with data held in certain formats, but not
others. For example, embl format files can, in addition to the sequence data itself, contain descriptive information
about a sequence, such as its features. In contrast, plain format contains nothing inside the file except the
sequence data, while fasta format allows a small amount of information about a sequence to be given in a header
line. Clustal and msf formats handle multiple aligned sequences, while phylip and nexus format files contain
aligned sequences as well as information relevant to phylogenetic analysis programs.
To analyse data, it must be pre se nte d to the analysis program in a format the progam
unde rstands.
This seems obvious, but frequent errors (or worse, misleading results) occur when the data entered into
a program is not appropriate.
Converting files to different sequence formats used to be a frequent, and often time consuming, task in
bioinformatics. Luckily there are file conversion programs that take care of this easily for many formats. In
addition, many program understand more than one format.
Some common bioinformatics sequence formats, along with common filename conventions used for those
formats, are listed in the table that follows the next section.
We recommend the following page for more information and examples of common bioinformatics file formats:
http://www.molecularevolution.org/resources/fileformats
40
File naming conventions in bioinformatics
The suffix, (the part of the filename after the final dot), is often used to denote to you, and other people, what the
format of the data inside the file is.
For example, the common suffix for clustal formatted alignments is aln. .A bioinformatics file that ends in .aln is
usually assumed to be a clustal formatted alignment file.
Another multiple sequence alignment format is phylip. A common suffix used on files containing sequences in
phylip format is phy.
Common suffices used for files containing data in particular formats are listed in the table following this section.
We highly recommend that you follow conventions when naming your data files.
● You will know your data format just by looking at the name of the file.
● Following standard conventions, (rather than making up your own naming system), makes it easier
for other people looking at your files, (e.g. collaborators, or people helping you); they will know the
data format just by looking at the name.
● Some graphical programs have filters set so that only files with particular suffices will be listed in
the file browser window when you try to load some data. If you use conventional filename endings, this
is less likely to cause problems for you.
Certain programs use information in the filename to interpret aspects of the data, (not just the data format). Such
programs have strict naming conventions for the whole filename. For example, some sequence assembly
programs either require, or are benefited by, defined naming schemes for sequence traces. The filename will
inform them about which sequences are read pairs, what direction sequence reads are in, and other information
relevant to assembly or visualisation. You will need to read the program documentation to find out what is
required in such instances.
You are not restricted to naming your files in any particular way but we highly recommend that you
follow the convention for the type of file you are generating/saving.
Following file naming conventions from the beginning will save you, and your collaborators,
a lot of time!
41
Common bioinformatics file formats
Format Some common Comments
filename endings
Embl or .dat Usually these files, along with genbank files, contain feature
swissprot .embl information as well as sequence.
.sprot
.swiss Embl and Swisprot (or Uniprot) format are the same. Embl files
contains nucleotide sequences and Uniprot files contain peptide
sequences.
Files downloaded from EMBL or Uniprot websites use the suffix .dat.
Often these are compressed with gzip, and so end in .dat.gz
This was the standard output format from some of the suite of programs
called GCG. The format is still sometimes used.
Other multiple alignment formats are more generally used and thus are
often a better option to choose if you have a choice.
Nexus .nxs Multiple sequence alignment format
.nex
Used by a number of phylogenetics programs.
GFF .gff A format for describing genes and other features associated with DNA,
RNA and Protein sequences. Not generally used as input for analyses.
42
Naming files and the danger of over-writing previous results
Many programs will suggest a name for your results file. Sometimes this name is generated by taking the
beginning of the name of your input file, and adding a new suffix. However, sometimes it is just a generic name
like prettyplot.ps or clustalw.aln. We encourage you to change generic names as soon as you can.
Apart from the fact that filenames like prettyplot.ps give you little idea what is in the file, if you do not change the
name, the next time a file of the same name is generated, you will overwrite previous results.
Sequence data are usually stored in text or binary files. Text files contain human readable data. Binary files are
not human readable. The file formats referred to in the table above are all text formats. Examples of binary
formats include ABI sequences and SFF sequence files.
Word documents may look like text, but they aren’t. The letters you see on the page of a Word document (or
OpenOffice Write, or other word processing programs) are stored in a binary format.
Most sequence analysis programs expect text. Plain old, nothing fancy, text.
It is an unusual situation to need to use sequence data that has been stored as a Word document (if it is not
unusual to you, you are probably doing things the hard way!). To get a text document when using Word, save it as
text only.
Rule of thumb
If you are using Word or any other word processing program at any stage your work with sequences, then it
is very likely that your life could be made a lot easier.
Please seek advice about other ways to handle your data. You will almost certainly save yourself time and
frustration. Honest.
Exercise 2-2
A useful Linux command to find out what type of file you are dealing with is file.
● In your bioinf_files directory is the file example.xls. Move into your bioinf_files directory if you
are not already there and try running the command
file example.xls
● In the bioinf_files directory is a file called testseq1.embl. Try running the command
file testseq1.embl
43
Examples of running bioinformatics programs on Bio-Linux
For each program covered, we include a list of interfaces available. Those interfaces with an asterisk next to
them are covered in the tutorial.
Documentation and links for all the software is available via the Bioinformatics Docs web pages discussed
earlier.
Artemis
Artemis is a DNA sequence viewer and annotation tool, allowing visualisation of sequence features and the
results of analyses within the context of the sequence and its six-frame translation. Artemis can read embl or
genbank format files. Sequences can be loaded from local files or via the network from the EBI.
44
Exercise 2-3
● Start Artemis on Bio-Linux by typing artemis on the command line or by choosing the option
Artemis from under the Applications | Bioinformatics graphical menu.
● Now choose the option Open... from under the Artemis File menu, and select the file
hsy14768.embl from within the bioinf_files directory.
This should open up a large window , like that shown in Figure 13, where this sequence is displayed
graphically .
● Open a terminal window and view the text of the embl entry using the command less
hsy14768.embl
Notice how Artemis is providing a graphical representation of what is in the text file.
● Try choosing Mark Open Reading Frames from under the Create menu of Artemis.
You should now see two boxes near the top in the Entry section, the first called hsy14768.embl and
the other called ORFS_200+.
● Uncheck the box next to hsy14768.embl. You should now be able to scroll along the window
horizontally and easily see the open reading frames you marked.
● Check the box next to hsy14768.embl again. Look at the information in the bottom
frame of the window. Notice how it is related to the images in the frames above.
● Try clicking on some of the lines in the bottom frame and seeing what happens in the images in
the other two frames.
● Explore the options available to you. (Not all options will be functional by default. See the
information about the Run menu below)
● You can also load up files direct from the EBI. If you want to try this, then choose File | Open
from the EBI – Dbfetch... option in the original small Artemis window and enter the accession
number BX255937.
● When you are done, close Artemis by choosing File | Close in the sequence entry window and
then choosing File | Quit in the main (small) Artemis window.
You can run various programs on your sequence, or parts of your sequence, from under the Run menu in
Artemis. Some of the options in this menu need to be configured to be appropriate for your site. There is
information on how to do this on our website at:
http://nebc.nox.ac.uk/tools/bioinformatics-docs/faq#blast_art
If you are not the system administrator of your Bio-Linux machine, then you will probably need to liaise with the
person who is to get this set up properly.
We also recommend the ACT, a sister program to Artemis, allowing comparison of two or more sequences.
45
EMBOSS Programs
EMBOSS is an extensive package of programs that cover areas of bioinformatics analysis including:
● Sequence alignment
● Rapid database searching with sequence patterns
● Protein motif identification, including domain analysis
● Nucleotide sequence pattern analysis---for example to identify CpG islands or repeats
● Codon usage analysis for small genomes
● Rapid identification of sequence patterns in large scale sequence sets
● Presentation tools for publication
http://nebc.nox.ac.uk/tools/bioinformatics-docs/other-bioinf/emboss-applications-and-databases
46
A comparison of the Jemboss and command line interfaces for EMBOSS programs
Command Line Prompted command line makes programs easy to Prompted command line makes it easy to overlook
run many of the options available
Programs accept input files with multiple You have to read the documentation to find out
sequences either directly or using lists of sequence about the options available
or filenames.
The task:
Fetch a sequence file from the EMBL database, extract all the mRNA sequences from the feature table and search
for palindromes in those mRNA sequences.
47
Using Jemboss
Exercise 2-4
● Start Jemboss on Bio-Linux by typing jemboss & on the command line. It can also be started by clicking on
the icon under the Applications | Bioinformatics menu.
● Click on each of the categories (e.g. Alignment, Display, etc) to see what programs are listed.
● When you're ready, click on the Feature Tables category and choose coderet.
● Scroll to the bottom of the window and click on the i button to bring up a documentation window.
Please read about what codret does.
48
Figure 15: The GO button is pressed when you are
ready to run the program. The i button pops up a
window with documentation. Some, but not all
programs, will also have an Advanced Options
button that will bring up, often very useful, optional
fields.
● Scroll to the top of the Jemboss window again, and fill in a Sequence Filename. In fact, we want to pull a
sequence directly from embl at the EBI. The sequence we want has the accession number BX255937. To fetch it
from the EBI, you need to type
embl:BX255937
into the Sequence Filename box.
● Enter an filename into the outfile file name box. For example, to distinguish from your later work,
you could use the name: jemboss_bx.coderet.
● When the program has finished, a new window called Saved Results should appear. (Don't be
fooled – your results haven't been saved yet!) There should be a number of tabs in that window. One will
be called the name you entered into the the outfile file name box (e.g. jemboss_bx.coderet) The others
will likely be called things like bx255937.mrna, bx255937.noncoding, etc.
● Take a look at the type of information in each tab. In particular, take note that:
➢ each of the tabs that contains sequence information contains multiple sequences
➢ the command line you would use to run this program identically to how you just ran it via
Jemboss is provided to you under the cmd tab. This will be useful later.
● To work with any of this data further, you have to save it to a local file. Click on the tab with the
name ending in .mrna. Choose the File | Save to Local File... option and save this to a location you can
find again (e.g. under your bioinf_files directory). Give it a name that will distinguish it from later work
-e.g. jemboss_bx.mrna. Do not close the Saved Results window as we want to refer to the information
under the cmd tab later.
● Go back to the main Jemboss window, go to the Nucleic | Repeats section and choose palindrome
from the list of programs.
● Browse for the file you just saved using the Browse files... button next to the box under Sequence
Filename near the top of the page. Note that you'll have to change the Files of Type: option to All Files
to find your saved file because it has a .mrna suffix.
● Check that you're happy with all the required options, and give a filename in the outfile file name
box. For example, jemboss_palin.txt. Then press the GO button.
● Scan through the results to see what has been returned to you.
You can also view listings of the files on your system using the Jemboss file manager functionality. Click on the
symbol at the bottom right side of the Jemboss window. If you double click on the name of a file that contains
text, it will pop up in another window for you to view or edit. Note: the file manager listings in the Jemboss
49
window are not kept up to date automatically - the Nautilus file browser or the ls command are a better way to
keep track of what files have been created or deleted.
Much more information about the EMBOSS command line syntax is available at:
http://emboss.sourceforge.net/developers/acd/commandline.html
We're now going to run the same tasks we did via Jemboss earlier.
Exercise 2-5
● Look at the cmd tab in your jemboss results window for coderet. You should see the following:
This command runs coderet, specifies the sequence to use and sets the output file name. The -auto option
indicates that you do not want to be prompted for further information. This results in default values being used
for all options you have not specified on the command line.
● Read about coderet by bringing up the information via the command line:
● To make things simple, we will edit the command line in the coderet cmd tab of the Saved Results window
in Jemboss, and then copy and paste our final command line into a terminal to run the program.
Go to the coderet cmd tab of the Saved Results window in Jemboss, and edit the command to give a new
output filename. e.g.
● Open a new terminal window and navigate to your bioinf_files directory. Make a new directory to store your
result files (as it will make it easier to see what files the program generates by default).
mkdir cl_dir
● Change directory into your new directory, copy and paste the coderet command line above into the terminal
and press the return key. (Recall that we covered highlighting and pasting text using mouse buttons near the end
of the first half of this tutorial.)
cd cl_dir
coderet -seqall embl:BX255937 -outfile cl_bx.coderet -auto
● When the program finishes, list the files in your directory. What has coderet produced? How does this
compare with the tabs presented to you when you ran coderet via Jemboss?
You may notice that we have generated a lot of files we don't need. We could have specified to coderet that we
only wanted the mRNA sections from the embl entry BX255937. To find out how, you'll need to refer to the
coderet documentation (the lists of options won't tell you enough).
● Now run palindrome on the mRNA sequence. To do this, you could edit, copy and paste the the command
in the Jemboss Saved Results window for palindrome, or you can type palindrome on the command line and
answer the prompts. Please run palindrome now, doing one of these.
Once you get to know it, the command line is much faster to get running than programs via Jemboss. However,
the power of using the EMBOSS command line is much greater if you need to process groups of files, or do
things repetitively.
Below we'll go through an example of running an emboss program on a batch of files using a single command.
If you want to run a job like this repetitively, you can save the commands in a text file and then set things up to
get those command executed whenever you want (either by you directly, or by your computer at a time you
schedule). We do not cover this in these course notes, but please ask the demonstrator if you would like to know
more about this.
51
Exercise 2-6
● Please look at the contents of the file hexaseqs.list in your bioinf_files directory. e.g. using the
command less. You will see a list of sequence ids and the database those sequences are in.
● Quit less.
● We need to tell EMBOSS programs when they are going to work on a list of files rather than just a
single file. To do this, we preface the filename with the @ symbol. So, to fetch the list of sequences in
the hexaseqs.list file, we can use the command:
The default behaviour of seqret is to fetch sequences in fasta format, with all sequences in a
single file with a filename that uses the id of the first sequence. By now you should know how to
go about finding out how to alter aspects of the program behaviour like these.
You can use this same “list of sequences” syntax with Jemboss. e.g. you could run seqret via
Jemboss and specify the sequence name as @hexaseqs.list.
52
Blast
The Basic Local Alignment Search Tool (BLAST) searches for regions of local similarity between sequences.
The program compares nucleotide or protein sequences or patterns to sequence, or sequence-related, databases
and calculates the statistical significance of matches.
There are two main distributions of blast: NCBI blast, (aka blastall) and (the former) Washington University blast
(aka wu-blast). Wu-blast is in the process of being acquired by Advanced Biocomputing LLC and so is not easily
obtained at the moment. Quite recently, the NCBI released a re-worked version of blast called blast+. Both
standard blastall and blast+ packages are available on Bio-Linux.
The blastall and blast+ packages both contain a number of programs allowing you to carry out different types of
searches. Here we focus on the programs of the blast package, as these are still the most commonly used and most
documented. We recommend that you consider the new blast+ package as well as it has some enhancements and
may be faster for certain flavours of blast searches. Information and links in the Bio-Linux Bionformatics
Documentation System (icon on your Desktop) provide information on both packages.
For this course, we assume that you are familiar with running blast searches using at least one web-based
interface. If you are not, then this is a good time to look at the facilities offered through one of these sites:
NCBI: http://blast.ncbi.nlm.nih.gov/Blast.cgi
EBI: http://www.ebi.ac.uk/Tools/blast/
For small volumes of data, where you wish to search a commonly available database or subset of data available
through a website, then web access is a very good option. Web-based utilities are also good for experimenting
with parameters when determining useful settings for your investigation. The command line comes into its own
for setting up searches quickly, for processing large volumes of data, for automating your searches, and for giving
you the ability to get just the information you want returned from the blast searches. (This last point has been
53
made easier than ever thanks to the blast+ programs, where you can specify which information to return in a tab
delimited format.)
We HIGHLY recommend you invest time learning about what blast does in detail, including how it works and
what the statistics is produces mean. The “take the top hit” method will rarely serve your research well.
We provide a list of references and helpful web pages in Appe ndix C that we hope will help you learn more
about blast programs.
Before you start searching with a sequence, it is useful to outline your answers to questions like:
There are many other programs available as part of the NCBI blastall release apart from the ones above. These
include blastclust, bl2seq, impala, rpsblast, blastpgp, seedtop, blastcl3 and megablast. These programs are not
covered here, but are worth learning about for your own work.
54
A simple blastp search
The following is a basic blastp command.
You can fine tune blast easily using additional command line options. We highly recommend that you read
about blast and determine appropriate settings for your research questions. This will ultimately save you a huge
amount of time and energy.
A copy of the swissprot part of uniprot, formatted for blast searches, is located in the directory blastdb, under
your bioinf_files directory. We do not cover the use of formatdb in this course, but its basic use is shown in
Appendix C. For completeness, the steps we took, including the command we used to create the blast formatted
swissprot database, are shown below:
ftp://ftp.ebi.ac.uk/pub/databases/fastafiles/uniprot/swissprot.gz
We then used the formatdb command from within the blastdb directory.
The location of the blast database used during this course is not the recommended location
You can either give the full or relative PATH to your blast databases within the blast command, or you can
store your blast databases in a location that is supplied as the value for the BLASTDB environmental variable
and just provide the database name in the blast command line.
We recommend for your own work on Bio-Linux 5 that you put blast databases in the location
/home /db/blastdb OR change the environmental variable BLASTDB to a location appropriate for your work.
You may need to talk to the system administrator of the machine about this. Note that the default location for blast
databases may be different on different machines, and may change on Bio-Linux in the future.
55
For the purposes of this tutorial, we will give the blast command the relative location of the blast databases.
Exercise 2-7
● Move into the bioinf_files directory if you are not already there.
● List the files in the blastdb subdirectory. The files called sprot.p* are the files that blast uses when it
searches.
● Look at the results file (you defined the name of the results file using the -o flag).
Recall that a blastx search translates a nucleotide sequence in six frames and searches a peptide database.
A good summary page explaining the -m formatting option can be found at:
http://www.compbio.ox.ac.uk/analysis_tools/BLAST/BLAST_blastall/blastall_examples.shtml
Exercise 2-8
● Run either of the above blast searches again, this time adding the flag -m9 to the command. Make sure you
change the name of the output file as well.
● Look at the results from this search and compare it to what was returned using default formatting. Is it easier
or harder to read? Is there information present in one report that is not in the other?
Note: Blast+ programs offer the above formatting options as well as finer control over the format and
contents of results returned.
56
Handling multiple sequences
This section covers ways to deal with a small number of sequences at once – say up to a few hundred. For
thousands of sequences, you will probably want to use the ideas introduced here, in conjunction with running your
searches on a cluster and using scripts for pulling out the information of relevance to you from the results files.
This section looks at using blast on a number of sequences. However, the general principles presented apply to
many bioinformatics programs.
Blast sarching using fasta files containing more than one sequence
Exercise 2-9
● Look at the contents of the file multiseqs.fasta in your bioinf_files directory. How many sequences are in
this file?
● Look at the results file to see how the results have been reported. How easy would this be to read and
understand?
● Try the above query again, but with the -m9 flag.
● Read about the -b and -v flags in the blast documentation. For very small studies ,where you might read
through the blast reports yourself rather than doing further processing on them using the computer, these flags
may help you in some instances.
So, when running multiple blast searches, you might want to do something like:
“For each sequence in my list, run a blastx search against my swissprot database.”
You can also create nested foreach loops. For example, if you had a list of sequences and a list of databases, you
could use a nested foreach loop to get the computer to do something like this:
“For each sequence in my sequence list, run a blastx search against each database in my database list”
You can run a foreach loop on arbitrarily long lists. However, for the exercises below, we will use just five
sequences testseq1.fasta, testseq2.fasta, testseq3.fasta, testseq4.fasta and testseq5.fasta.
57
The foreach loop explained step by step
You need to tell the computer the list of files to work on. Here, we will use one of the commands below to
indicate the list of sequences we want to work with:
ls testseq*.fasta
ls testseq[1-5].fasta
We use the command for the list of files within the first line of the foreach loop:
This means: for each sequence in the list in the first line, run the command in the second line. When all the
sequences in the list have been dealt with, then finish.
A full explanation of each line in the loop follows, in the text below and in exercise 2-10.
● the “j” means “each thing” – more specifically, for each thing we get to in the list, let's refer to it by the
name j. This is an arbitrary name. You can use whatever you want. So the following are equivalent to the line
given above:
foreach myThing in `ls testseq[1-5].fasta` calls each list item in turn “myThing”
foreach seq in `ls testseq[1-5].fasta` calls each list item in turn “seq”
Once you have chosen a name for each thing in your list, you must use that name to refer to the list item in any
commands that follow within the foreach loop.
● the quotation marks around the ls testseq[1-5].fastacommand are backquotes and the computer understands
this to mean “take the results of the command inside these backquotes”. Backquotes are usually found on a key
in the top left hand side of a standard UK keyboard.
● the list generated using the command in backquotes is the list of things to be processed
● So the overall effect of that one line is: “foreach thing in the list that can be generated using the command ls
testseq[1-5].fasta, do the following:”
Hint: It is usually a good idea to check that the command used to create a list does actually generate the list
you expect before including it within a foreach loop.
58
Ple ase note that the syntax use d the above command assume s that you are in the z-she ll. If the
above command fails for you and you are sure that you have type d it in and use d backquote s whe n
state d, ple ase che ck you are using the z-she ll.
You can check your default login shell by typing the command finge r followed by your username. Look for the
information next She ll: near the top of the output. If you are not in the z-shell already, just type zsh in your terminal
window.
Other shells provide the same functionality as the foreach loop demonstrated here, but the syntax is different.
Once you have told the computer what files to work on, you need to tell it what to do with each file.
Exercise 2-10
We will set up a foreach loop to run blastx searches using the five testseq?.fasta sequences with the swissprot
database.
● The foreach> is a prompt – it is here we tell the computer what we want it to do with each item in the list.
To do this, type:
Recall that we defined each thing that we want to work on by the letter j in the first line of the foreach
loop. In each subsequent line of the foreach loop, we refer to each thing by prefacing the j with a $ sign.
Each $j in that command will be replaced by the name of a file from the list.
So here, the blastall command is executed with each filename in turn, and output files are named using the
sequence filename with .blastx appended.
end
This indicates that there are no more processing steps to include in this foreach loop.
If when running the above you get an error such as “File ‘ls testseq[1-5].fasta’ not found”, it probably
means you used single quotes instead of backquotes. Try running the foreach loop above again, being
careful to use backquotes around phrase in brackets.
59
ls -l *blastx
You should now see that you have five blastx results files. Imagine you had 100 sequences to blast – you could set
up a foreach loop and go get a coffee. (Of course, you still need to figure out how you're going to use or analyses
the results files if you're working with large numbers of sequences.)
We mentioned above that the j in the foreach loop was an arbitrary name. As an example, if we had used seq
instead of j, the foreach loop would have been written:
Exercise 2-11
● Read through all the files called testseq*.blastx by using the command less:
less testseq*.blastx
● When you get to the end of one document, (or just want to go to the next document), just type
:n
● To quit, type q
Why go to all this trouble when we could just create a multiple fasta file and run a blast search using it?
Foreach loops can be used with any programs – not just blast. So this method is widely applicable.
Multiple tasks can be carried out in a single foreach loop, as the following example shows.
Exercise 2-12
If you have time, you can run the following foreach loop. Try to figure out what it does before running it. You
may need to read the man pages for sed and cut to understand all the steps being taken.
60
Working with lots of blast results
Reading a few blast reports is fine, but when you have thousands, you presumably won't be reading them one by
one yourself.
A common way to handle large volumes of blast results is to get the computer to read through the report files,
pulling out key information. You can try using the new blast+ programs, which give you a great deal of fine tuned
control over what to report in tab delimited format. Alternatively, or if blast+ doesn't output the format you need,
you can use a customised script. You might choose to load such extracted information into a database, or for small
scale studies, into a spreadsheet. This topic is not covered further in this course, but we recommend BioPerl
modules for parsing blast report files. Example BioPerl scripts for blast parsing can be found on your Bio-Linux
machine under the following directory:
/usr/share/doc/bioperl/examples/searchio
We won’t be nicing any jobs today, but for the sake of all the other users of your Bio-Linux machine, please read
the documentation on nice:
man nice
man renice
To nice a job you are about to run, use nice –n level command. Levels range from -20 (hightest priority) to 19
(lowest priority). For example, to nice a program called someprog.pl to level 15 (a fairly low priority), you could
type:
nice –n 15 someprog.pl
You can also move a running program to a lower priority using the command renice.
You may have to give the full path of the command you wish to run when using nice , rather than
just the short name.
There are other facilities, such as queuing and load balancing systems, which are more sophisticated than just
“nicing” a job, but nice is simple, built-in, and effective for machines with a very small number of users.
If you suspect there may be a more efficient way to do what you are doing, there probably is!
If you find yourself doing anything repetitively, there is probably an easier way to do it.
Please read documentation and seek advice. It will save you a lot of time in the end!
61
Appendix A: Exercise Answers
It is a good idea to use the ls command to check that the files you have chosen are the correct files to
move.
ls test*embl*
would show the files that would be moved by the cp command shown above.
7. rm mythirdfile.txt
8. rm bioinf_files/testdir/myfirstfile.txt
9. rm –rf bioinf_files/testdir/subdir
62
Appendix B – Blast references and documentation
Web pages
The blastall page in your Bio-Linux Bioinformatics Docs provides links to local web pages with information
about NCBI blast programs. You can also access this remotely at the URL:
http://nebc.nox.ac.uk/bioinformatics/docs/blastall.html
http://nebc.nox.ac.uk/bioinformatics/docs/blast+.html
References
The book by Ian Korf is a good place to start in learning about what blast can do, how it does it and what blast
output means.
C. Camacho, G. Coulouris, V. Avagyan, M.N. Papadopoulos, K. Bealer and T.L. Madden. Blast+: architecture
and applciations. BMC Bioinformatics, 10: 421, 2009
S. R. Eddy. Where did the blosum62 alignment score matrix come from? Nat
Biotechnol, 22(8):1035–6, 2004. Evaluation Studies Journal Article Review
United States.
Ian Korf, Mark Yandell, Joseph Bedell, and Stephen Altschul. BLAST. O’Reilly,
Sebastopol, Calif. ; Farnham, 2003. GB A3-Y7706 Ian Korf, Mark Yandell,
and Joseph Bedell ; [foreword by Stephen Altschul]. ill. ; 24 cm. ”An essential
guide to the Basic Local Alignment Search Tool”-Cover. Includes bibliographical
references and index.
64
Appendix C – Creating local blast databases
Obtaining local blast databases
To get the most from blast, you should search against a relevant database, which may mean using the relevant
parts of a larger database. In general, blast searching against the whole of nr or the whole of embl is not a
particularly good idea. It takes up your time and computer resources, returns blast results with less useful statistics
and often less meaningful results. For example, if you are studying marine viruses, do you really care about all the
mouse sequence in nr or embl?
Web resources often offer different data subsets you can search against. For example, using the NCBI blast pages,
you can choose from a certain number of database sections, or you can fine tune the sequence set you blast against
using Entrez queries:
http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#entrez
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?
book=helpentrez&part=EntrezHelp#EntrezHelp.Writing_Advanced_Sea
Using the EBI blast services, you can choose from a number of data subsets, as well as having a choice of WU-
blast or NCBI blastall.
http://www.ebi.ac.uk/Tools/blast/
To run blast locally, you need to index your database; it is these indices that blast reads when searching. For some
databases or database divisions, you can download prepared blast indices from sites such as the NCBI. These are
convenient, but do restrict you to searching against particular sets of sequences. It is often useful to create a set of
sequences chosen for the types of searches you wish to carry out (e.g. organism or tissue specific) and format
them into a database you can search using blast.
Any set of fasta sequences can be indexed for blast searching. Creating useful sets of sequences is beyond the
scope of this course, but two resources to consider are SRS (http://srs.ebi.ac.uk) and Entrez
(http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/helpentrez/EntrezHelp.pdf).
For NCBI blastall, the formatdb command is run on fasta formatted files to create blast indices.
For blast+, the program used is called makeblastdb.
65
One thing to note in the table above is that uniprot divisions are provided in embl format. However, blast indices
are created from fasta format files. Unfortunately, the EMBOSS program seqret, which you saw earlier, does not
handle entire database divisions well. Instead, you can use a simple script to do the conversion. Instructions in this
are below.
If you choose to use pre-formatted blast databases, make sure you read the notes about them (usually available as
a file called something like REAMDE on the ftp site you get the blast files from) as they can be slightly different
than the database that results from downloading and formatting your own.
It is important to read the documentation about the databases you choose to work with.
For example, uniprot and nr are not the same. Nt is not a non-redundant database; nr is.
Knowing what is in a database you work with is vital in understanding your results.
Nucleic Acids Research publishes a database issue in January of each year.
This is an excellent resource for finding out more about available database resources.
Another useful resource is the information available via the links on the Library page of SRS at the EBI:
http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+top
We will use the uniprot swissprot virus division as an example here. As this is distributed in embl format, and we
need it in fasta format, we include a format conversion step in the instructions below.
Bio-Linux machines by default have the BLASTDB environmental variable set to /home/db/blastdb. If you are
logged in as an administrative user, then you will be able to download and work in this area using your sudo
privileges. If you are on a multi-user system and are not an administrative user, you should talk to your system
admin: either to ask them to give you privileges in the central blast database folder, or warn them that you are
about to use lots of space in your account for blast databases.
These instructions assume that you are working from the directory where you will be storing your blast database
files. This is not normally the case. Usually, if you download blast databases into your account, it is easiest to set
the BLASTDB environmental variable to the location of these blast databases, and then work from a convenient
folder where you plan to store your results. You can set the BLASTDB environmental variable for a single
session by typing a line of the form below in the terminal you are working in. To set this variable for every
session, you can add the line to your ~/.zshrc file.
export BLASTDB=/home/yourUserName/blastdbDir
● Download the database section of interest. Here we will work with the uniprot swissprot virus division:
wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/uniprot_sprot_viruses.dat.gz
● If you don't already have a sequence conversion tool, download the emblToFastaAndPreProcess.pl script
from the NEBC site.
wget http://nebc.nox.ac.uk/scripts/bioinf/emblToFastaAndPreProcess.pl
This script converts embl sequence to fasta sequence. Due to issues that sometimes appear because of the
formatting of information in the feature table, it does so by removing the feature lines from the entry before
conversion. A version of the script that does not pre-edit the feature lines is also available:
http://envgen.nox.ac.uk/scripts/bioinf/emblToFasta.pl
66
● Make this script executable.
● This script can handle compressed files, so you can create a fasta formatted copy of the
uniprot_sprot_viruses division by running the command:
./emblToFastaAndPreProcess.pl uniprot_sprot_viruses.dat.gz
Notice the ./ at the start of the line. You need this if you are running the script from the directory you are in. There
are better ways to do this if you plan to keep this script for use again, but they are not covered here.
● When the script is finished, you should find a file called uniprot_sprot_viruses.fasta in your directory. This
is the file we build the blast database from.
● You should now have four new files in your directory: sprot_virus.psq, sprot_virus.pin, sprot_virus.phr and
formatdb.log. The last of these lets you know how the blast formatting went.
The sprot_virus.p* files are your blast indices. You search against them by specifying the blast database name
sprot_virus.
Note:
If you were interested in the swissprot virus division, you would probably be interested in the trembl virus
division also. You could download and format that division as described above, and then search the swissprot and
trembl virus divisions separately, or as a single, virutal database. Alternatively, you could create a single blast
formatted database from the two fasta files using formatdb:
formatdb -i "uniprot_sprot_viruses.fasta uniprot_trembl_viruses.fasta" -n uniprot_viruses -t "combined sprot and trembl virus divisions"
67
Appendix D: Basic Linux Commands
68
mv file1 dirName Move a file called file into a directory called dirName
mv file1 file2 Rename file1 and call it file2
nano A basic text editor
gedit A nicer text editor
passwd Change your password
pgrep pattern Find process names that contain the pattern. See also ps
pkill processname Kill a running process using the process name. See also
ps, pgrep and kill
pwd Print the full path of your current directory
ps –u List your current processes
ps –aux List all processes on the machine. See also top
rm filename Delete a file
rm –rf dirName Delete a directory and all its contents
rmdir Delete an empty directory
screen Run the screen manager (read the man page!)
top List the processes running that are using the most CPU
touch filename Create an empty file
who To list users currently logged on
Cntrl-c Stop a process
Cntrl-z Suspend a process, see also jobs, fg and bg
69